[HN Gopher] Common Crawl
___________________________________________________________________
Common Crawl
Author : Aissen
Score : 283 points
Date : 2021-03-26 16:42 UTC (6 hours ago)
(HTM) web link (commoncrawl.org)
(TXT) w3m dump (commoncrawl.org)
| smaddox wrote:
| I was hoping it was going to be a massively multiplayer dungeon
| crawling game...
| breck wrote:
| Love it! Just donated.
| cblconfederate wrote:
| I don't think that this is the answer to "only google can crawl
| the web". This is a huge archive suitable for making a web search
| engine maybe.
|
| What if you want to make a simple link previewer? An abstract
| crawler for scientific articles? Most websites are behind
| cloudflare which will block/captcha you, but happily whitelist
| only google & major social sites. Tha answer is measures that
| bring the web back to basics, not this over-SEOed bot infested
| ecosystem. FANGS succeeded in sucking out all the information of
| the web, but they suck at creating protocols that are
| interoperable (even twitter now needs its own tags!).
|
| Incidentally, maybe the next search engine should use a push-
| system , where websites will ping to it whenever they have
| updates. If the engine has unique features, it might be actually
| seen as a measure to reduce the loads from bots.
| gillesjacobs wrote:
| This is mainly used for language modeling research. A filtered
| CC was used in GPT3 and I have personally used data from CC for
| NLP projects.
| shadowgovt wrote:
| I don't think we can bring the web back to basics in the sense
| you're envisioning without kicking most users off of it.
|
| Cloudflare's protection is to guard against traffic spikes and
| automated malicious attacks; Google and social sites are allow-
| listed because they're trustable entities with well-defined
| online presences and a vested interest (generally) in not
| breaking sites. "How do we do away with the need to put up an
| anti-automated-traffic screen plus whitelist" is in the same
| category of problem as "How do we change the modern web to
| address automated malicious attacks?"
| neura wrote:
| Another way to look at this is if CDNs don't allow Google,
| nobody is going to want to use them. Their content doesn't
| get indexed and anybody doing a search is going to get
| directed to someone else that doesn't put their content
| behind a CDN with that level of protection. That or someone
| like google will just solve the problem themselves and be
| both the CDN and the indexer, bringing them one step closer
| to complete ownership of finding anything on the web.
| th0ma5 wrote:
| Actually I was able to search URLs with very little ram across
| the entire collection as they have a series of indexes you can
| download.
|
| In theory someone could do something similar with terms, or you
| could first use URLs to filter the text size you download into
| Elastic or Solr and do your own custom search that way.
|
| The indexes are really neat though, I highly recommend playing
| with them.
| thunderbong wrote:
| Can you suggest where I can get them from?
| breischl wrote:
| >maybe the next search engine should use a push-system
|
| So setting up a new search engine that way would require going
| to every site and convincing them to notify you of changes.
| Wouldn't that be even more limited than the current Cloudflare
| whitelist system? At least there's some chance you can get
| around the whitelist system.
| neura wrote:
| This... I couldn't get past the irony of the comment.
| Basically, the problem is that sites only let google and
| friends index them. The solution? Site should only send index
| data to google and friends.
|
| I mean, I get it, then sites can send to any number of
| indexers, but let's be honest, like you say, any new search
| engine has to get sites to push data out to them. That's just
| not. going. to. happen.
| waynesonfire wrote:
| hope these guys team up w/ archive.org
| rektide wrote:
| relates closely to the recent "Only Google is allowed to crawl
| the web"[1][2] post.
|
| [1] https://news.ycombinator.com/item?id=26592635
|
| [2] https://knuckleheads.club/
| staunch wrote:
| It seems like Common Crawl is doing a lot of awesome stuff, but
| they're _not_ attacking Google 's stranglehold head on.
|
| Presumably this is because they lack the money to do so. Have
| they attempted to estimate how much it would cost per year to
| crawl the web as aggressively and comprehensively as Google does?
| I've checked their site and didn't find anything like that.
|
| If they came up with a number, say $2 or $10 billion per year, it
| might actually be possible to gather enough donations to dethrone
| Google.
|
| A lot of Google competitors would love to see them dethroned. And
| it would be a huge win for virtually everyone else too. There's
| no one in the world that wants Google to maintain their web
| search monopoly indefinitely.
| ricardo81 wrote:
| Good resource, admirable intention, great that it simply exists.
| Good sized index.
|
| I see a lot of people subscribe to the idea of this being the
| feeder to alternative search engines.
|
| I'd guess part of the problem with doing things this way is the
| 'crawl priority' of what the search engine thinks are the next
| best pages to crawl, it's totally out of their hands or at least,
| they'd still need to crawl on top of the Common Crawl data.
|
| The recent UK CMA report into monopolies in online advertising
| estimated Google's index to be around 500-600 billion pages in
| size and Bing's to be 100-200 billion pages in size [0]. Of
| course, what you define as a 'page' is subjective given URL
| canonicals and page similarity.
|
| At the very least, the common crawl gets around crawl rate
| limiting problems by being one massive download.
|
| Would be interesting to know if there's an appreciable % of site
| owners blocking it, though going on past data (there is some data
| in the UK CMA about this also), it's not a huge issue.
|
| [0]
| https://assets.publishing.service.gov.uk/media/5efc57ed3a6f4...
| (page 89)
| Leary wrote:
| Great resource. Does anyone have a good free source for popular
| keywords/topics on google/the internet?
| ttfxxcc wrote:
| https://keywordshitter.com/
| tyingq wrote:
| https://trends.google.com/trends/ is probably the best resource
| for the "top" queries, though it doesn't dive too far down the
| list.
|
| Particularly their "Year in Search" entries, like:
| https://trends.google.com/trends/yis/2020/US/
| dang wrote:
| The interesting past threads seem to be the following. Others?
|
| _Ask HN: What would be the fastest way to grep Common Crawl?_ -
| https://news.ycombinator.com/item?id=22214474 - Feb 2020 (7
| comments)
|
| _Using Common Crawl to play Family Feud_ -
| https://news.ycombinator.com/item?id=16543851 - March 2018 (4
| comments)
|
| _Web image size prediction for efficient focused image crawling_
| - https://news.ycombinator.com/item?id=10107819 - Aug 2015 (5
| comments)
|
| _102TB of New Crawl Data Available_ -
| https://news.ycombinator.com/item?id=6811754 - Nov 2013 (37
| comments)
|
| _SwiftKey's Head Data Scientist on the Value of Common Crawl's
| Open Data [video]_ - https://news.ycombinator.com/item?id=6214874
| - Aug 2013 (2 comments)
|
| _A Look Inside Our 210TB 2012 Web Corpus_ -
| https://news.ycombinator.com/item?id=6208603 - Aug 2013 (36
| comments)
|
| _Blekko donates search data to Common Crawl_ -
| https://news.ycombinator.com/item?id=4933149 - Dec 2012 (36
| comments)
|
| _Common Crawl_ - https://news.ycombinator.com/item?id=3690974 -
| March 2012 (5 comments)
|
| _CommonCrawl: an open repository of web crawl data that is
| universally accessible_ -
| https://news.ycombinator.com/item?id=3346125 - Dec 2011 (8
| comments)
|
| _Tokenising the english text of 30TB common crawl_ -
| https://news.ycombinator.com/item?id=3342543 - Dec 2011 (7
| comments)
|
| _Free 5 Billion Page Web Index Now Available from Common Crawl
| Foundation_ - https://news.ycombinator.com/item?id=3209690 - Nov
| 2011 (39 comments)
| Grimm1 wrote:
| I've used this and it's invaluable for all types of things but a
| feeder for Google killers it is not.
|
| They don't approach the scale of what Google crawls, they state
| as much. Nor do they do it on the same timeline as Google. This
| is really nice for research or kick starting a project but this
| isn't a long term viable solution for alternative search engines.
| Between breadth, depth, timeline/speed, priority, and information
| captured it falls well short.
| ziftface wrote:
| Well that's to be expected, but if a lot of search engines
| start using it, it's likely that websites will start allowing
| it to crawl and index their pages. So there might be potential
| there.
| Grimm1 wrote:
| It's not an issue of being allowed to crawl.
| ziftface wrote:
| Oh I guess I misunderstood then. Why don't they crawl at
| Google's scale?
| samcgraw wrote:
| Love to see this.
|
| As an aside, it always jars me when a site hijacks default
| browser scrolling functionality. In my experience, making it as
| fast as possible is a _far_ better use of dev resources than
| figuring out how to make scrolling unique (no matter what the
| marketing department says).
| yesenadam wrote:
| > it always jars me when a site hijacks default browser
| scrolling functionality
|
| I assume you are saying this because this site does it. What do
| you mean? I can't see any difference from normal scrolling
| functionality on there.
| psKama wrote:
| How feasible would it be to store all that data on a
| decentralized system like IPFS or Sia-Skynet etc, instead of
| Amazon, to add further meaning to the cause?
| gillesjacobs wrote:
| Blockchain storage is going to cost you a pretty penny if you
| were to store all of Common Crawls pentabytes, so not very
| feasible.
| gloriousternary wrote:
| Moreso than Amazon? From my (limited) experience blockchain
| storage solutions are often less expensive, although I've
| never worked with petabytes of data so maybe it's different
| on that scale.
| psKama wrote:
| That's not correct. When it comes to storage and transfer,
| blockchain alternatives are fraction of the cost of Amazon.
| For example Sia Skynet is offering $5/month/TB[1] storage. If
| you skip Skynet and run your own Sia node the price can even
| go lower to $2/month/TB basing on the market conditions.
|
| [1] https://blog.sia.tech/announcing-skynet-premium-plans-
| faster...
| coder543 wrote:
| Amazon is hosting the Common Crawl on S3 for free, so...
| yes, $2/month/TB is a lot more expensive.
| gillesjacobs wrote:
| It seems that at least on Sia's plans, you can maximally
| host 20TB for 80$/month, not even a tenth of a monthly
| common crawl.
|
| Of course Sia's Skynet are package deals right now and I
| guess they're currently bootstrapping the network with
| users. Filecoin has no operational storage yet. Storj
| quotes 10$/Terabyte/month [1] so that will come out
| expensive.
|
| 1. https://www.storj.io/blog/2019/11/announcing-
| pioneer-2-and-t...
| duskwuff wrote:
| You may not grasp just how large the Common Crawl dataset is.
| It's been growing steadily at 200-300 TB _per month_ for the
| last few years. I 'm not certain how large the entire corpus is
| at this point, but it's almost certainly in the tens to low
| hundreds of petabytes. (This is significantly larger than the
| capacity of the entire Sia network, for example.)
|
| Storing a dataset of this size and making it available online
| is not inexpensive. Amazon has generously donated their
| services to handle both of these tasks; it would be foolish to
| turn them down.
| duskwuff wrote:
| (Update: the complete Common Crawl dataset is actually a
| little smaller than I thought, at 6.4 PB. That's still pretty
| big, though.)
| new_realist wrote:
| It's not clear, but it looks like the last crawl was 280 TiB
| (100 TiB compressed) and contains a snapshot of the web at
| that point; i.e. you don't need prior snapshots unless you're
| interested in historical content.
|
| EDIT: the state of the crawls are summarized at
| https://commoncrawl.github.io/cc-crawl-statistics/.
| riedel wrote:
| As a European (German) I am always wondering about the legal
| basis for a) making a copy of copyrighted material and databases
| available and b) processing contained personal data.
|
| The Internet archive seems more like a library with exceptions
| applying, but common crawl seems to advertise also many other
| purposes that go beyond archiving publically relevant content.
|
| Would this be possible in Europe, too? My feeling is that US
| legislation different here. Do you have to actively claim
| copyright in the US or enforce technically e.g. via DRM? Anyone
| can use anything without a license as long nobody finds out?
| new_realist wrote:
| Soon, you and your friends can host your own private search
| engine at modest cost and enjoy total privacy.
| kristopolous wrote:
| I'm kinda surprised this is new to people. It's 10 years old. Is
| this really the first time it's been talked about here?
| TechBro8615 wrote:
| It's the 39th time, apparently, as you can see by clicking the
| domain next to the submission.
| kristopolous wrote:
| Oh I always forget about that feature. Excuse my stupidity.
___________________________________________________________________
(page generated 2021-03-26 23:00 UTC)