[HN Gopher] Crawling a quarter billion webpages in 40 hours (2012)
___________________________________________________________________
Crawling a quarter billion webpages in 40 hours (2012)
Author : swyx
Score : 184 points
Date : 2023-06-15 07:33 UTC (15 hours ago)
(HTM) web link (michaelnielsen.org)
(TXT) w3m dump (michaelnielsen.org)
| merek wrote:
| Great to see this again. This was the article that introduced me
| to Redis (and more broadly the NoSQL rollercoaster) all those
| years ago!
| luckystarr wrote:
| How to get the top million sites list today? Alexa has shifted
| focus recently.
| jonatron wrote:
| There's a list here: https://radar.cloudflare.com/domains
| zX41ZdbW wrote:
| Here is an example of how to obtain a list of the top six
| million domains from Tranco and analyze their content with
| ClickHouse:
| https://github.com/ClickHouse/ClickHouse/issues/18842
| spyder wrote:
| CommonCrawl / Open PageRank ?
|
| https://www.domcop.com/top-10-million-websites
| gnfargbl wrote:
| Check out Tranco [1], which uses Cisco Umbrella, Majestic and
| also now a list sourced from Farsight passive DNS [2]. They're
| "working on" adding Chrome UX and Cloudflare Radar.
|
| There's also a list from Netcraft [3].
|
| [1] https://tranco-list.eu/
|
| [2] https://www.domaintools.com/resources/blog/mirror-mirror-
| on-...
|
| [3] https://trends.netcraft.com/topsites
| bottom999mottob wrote:
| Just use 500 android phones running Tmux they said. It'll be easy
| they said.
| jasonjayr wrote:
| Imagine a Beowulf cluster of android phones running tmux!
| marginalia_nu wrote:
| Be careful with that reference, it's an antique.
| 0xdeadbeefbabe wrote:
| A Beowulf cluster running screen
| onion2k wrote:
| To save you a click: Use 20 machines.
| sethammons wrote:
| 250M / 40hrs / 60min / 60s ~= 1,737 rps. That over 20 machines
| is ~87 rps per machine.
|
| Depending on a few factors, I rough out my backend Go stuff to
| handle between 1-5k rps per machine before we have real
| numbers.
| onion2k wrote:
| You didn't write your Go stuff 11 years ago though...
| sethammons wrote:
| We started with Go version 1.2 which was released over ten
| years ago. Pretty darn close to 11 years.
|
| https://go.dev/doc/go1.2
| ddorian43 wrote:
| With async in any language of choice this should need just 1
| server. 250M/40hours=1.7K requests/second.
|
| You can probably do it on a single core in a low level language
| like Rust/Java.
| marginalia_nu wrote:
| Threads isn't the only bottleneck in crawling.
|
| Assuming you're crawling at a civilized rate of about 1
| request/second, you're only so many network hiccups away from
| consuming the entire ephemeral port range with connections in
| TIME_WAIT or CLOSE_WAIT.
| sethammons wrote:
| Crank up the ulimit. And what is this 1 req/second nonsense?
| 1 req/sec/domain maybe. I have to agree, my first thought is
| "why this is not a single node in Go?"
| marginalia_nu wrote:
| Oh yeah, I mean 1 req/sec per domain of course.
|
| It's very easy to end up with tens of thousands of not-
| quite-closed connections while crawling, even with
| SO_LINGER(0), proper closure, tweaking the TCP settings,
| and doing all things by the book.
|
| It's a different situation from e.g. having a bunch of
| incoming connections to a node, the traffic patterns are
| not at all similar.
| spacebacon wrote:
| How to spend $580 in 40 hours. More can be done for much less in
| 2012.
| ricardo81 wrote:
| Yes. I used lib curl's multi interface on one $40/m server
| around that time. Indeed at any scale the rate limiting becomes
| the main bottleneck, mainly because a lot of sites are
| concentrated on certain hosts. Speed isn't the problem and
| multiple servers aren't really needed.
| [deleted]
| swyx wrote:
| a bit of context - posted this because i found it via
| https://twitter.com/willdepue/status/1669074758208749572
|
| if they replicate it in 2023 it would be pretty interesting to
| me. i can think of a few times a year i need a good scraper.
|
| but also thought it a good look into 2012 Michael Nielsen, and
| into thinking about performance.
| trevioustrouble wrote:
| [flagged]
| dspillett wrote:
| Nothing to stop you writing a similar project and releasing it
| if you feel so strongly that it should be out there, instead of
| being incensed by not being give then sweat off someone else's
| brow.
| trevioustrouble wrote:
| Okey walley
| bryanrasmussen wrote:
| don't want to release the source code because of moral
| reservations about how people might use it. Ok upstanding
| citizen.
| EGreg wrote:
| Just like AI people
| KomoD wrote:
| No point being a douchebag.
| trevioustrouble wrote:
| Cant help myself to the stupidity
| krishadi wrote:
| I've been recently reading up on multithreading and
| multiprocessing in python. You mention that you've taken a multi
| threaded approach since the processes are i/o bound. Is this the
| same as running the script with asyncio as async/await?
| Jorge1o1 wrote:
| At a 10,000 foot view (pedants will take offense) you should
| look to use multiprocessing for tasks which are CPU bound, and
| asyncio/threads (but really asyncio if you can) for problems
| which are IO-bound.
|
| This is a massive simplification but most useful for a
| beginner.
|
| Additionally, asyncio is not the same as multithreading,
| because typically asyncio is powered by a single-threaded event
| loop and use of a mechanism like select/kqueue/IOCP/epoll.
| krishadi wrote:
| Oops, I meant to ask the author but realised that the author is
| not the same as the op. Hah!
| samwillis wrote:
| This is a good overview, but being from 2012 it's missing comment
| on one now common area, using a real browser for
| crawling/scraping.
|
| You will often now hear recommendations to use a real browser, in
| headless mode, for crawling. There are two reasons for this:
|
| - SPAs with front end only rendering are hard to scrape with a
| traditional http library.
|
| - Anti bot/scraping technology fingerprints browsers, looks at
| request patterns, and browser behaviour to try and detect and
| block bots.
|
| Using a "real browser" is often advised as a way around these
| issues.
|
| However from my experience you should avoid headless browser
| crawling until it is absolutely necessary, I have found:
|
| - Headless browser scraping is between 10x and 100x more resource
| intensive, even if you carefully block requests and cache
| resources.
|
| - Most SPAs now have some level of server side rendering, and
| often that includes having a handy JSON in the returned document
| that contains the data you actually want.
|
| - Advanced browser fingerprinting is vanishingly rare. At most I
| have seen detection of user agent strings and comparing them to
| http headers and the order of them. If you make your http lib
| look like a current browser you are %99.9 of the way there.
| jaymzcampbell wrote:
| On the topic of JSON and whatnot that reminded me of this
| awesome post[1] about looking within heap snapshots for entire
| data structures where they may have a lot of nicely structured
| data within it not readily available from more direct URL
| calls.
|
| [1] https://www.adriancooney.ie/blog/web-scraping-via-
| javascript...
|
| (Previous comments about it from 2022 here:
| https://news.ycombinator.com/item?id=31205139)
| moneywoes wrote:
| Is there like a custom performance based headless browser
|
| How are they scaled?
| tempest_ wrote:
| Site owners adding browser fingerprinting explicitly is
| vanishingly rare however site owners who sit behind cloud flare
| who very likely fingerprint browsers is very common.
| londons_explore wrote:
| Headless browsers are expensive, but if you needed them, I
| suspect there are easy wins to be had on the performance front.
|
| For example, you can probably skip all layout/rendering work
| and simply return faked values whenever javascript tries to
| read back a canvas it just drew into, or tries to read the
| computed width of some element. The vast majority of those
| things won't prevent the page loading.
| arek_nawo wrote:
| Which way to go will depend on your use-case and what websites
| you want to scrape.
|
| For occasional web crawling, headless browser is great as it's
| easy to set up and you're almost guaranteed to get the data you
| need.
|
| For frequent or large-scale crawling, it's a different story.
| Surely you can implement something with just HTTP library, but
| you'll need to test it throughly and make some research before
| hand. That said, most scraped, content-heavy websites use
| either static HTML or SSR, in which case you can use HTTP no
| problem.
| marginalia_nu wrote:
| Depends on what you want with the data I guess. From a search
| engine point of view, the SPA case isn't _that_ relevant, since
| SPAs can 't reliably be linked to and in general tend to not be
| very stable and it's overall difficult to figure out how to
| enumerate and traverse their views.
|
| I think a good middle ground might be to do a first pass with a
| "stupid" crawler, and then re-visit the sites where you were
| blocked or that just contained a bunch of javascript with a
| headless browser.
| abhibeckert wrote:
| The SPA's we create can be reliably linked to (the current
| URL changes as the user moves around, even though the page
| hasn't reloaded) and they are "stable" because our business
| would go bankrupt if Google couldn't crawl our content.
|
| If Google can crawl it, then you can too. And while Google
| doesn't use a headless browser (or at least I assume they
| don't) they absolutely do execute javascript before loading
| the content of the page. And they execute the click event
| handlers on every link/button and when we use
| "history.pushState()" to change the URL Google considers that
| a new page.
|
| You're just going to get a loading spinner with no content if
| you do a dumb crawl (I disagree with that and think we should
| be running a headless browser server side to execute
| javascript and generate the initial page content for all our
| pages... but so far management hasn't prioritised that
| change... instead they just keep telling us to make our
| client side javascript run faster... imagine if there was no
| javascript to execute at all? At least none before first
| contentfull paint)
| marginalia_nu wrote:
| > The SPA's we create can be reliably linked to (the
| current URL changes as the user moves around, even though
| the page hasn't reloaded) and they are "stable" because our
| business would go bankrupt if Google couldn't crawl our
| content.
|
| This is true for some SPAs, but not all SPAs, and there's
| not really any way of telling which is which.
|
| I don't personally attempt to crawl SPAs because it's not
| the sort of content I want to index.
| r3trohack3r wrote:
| I have a pet theory that there are two forms of the web:
| the document web and the application web. SPAs have some
| very attractive properties for the application web but
| complicate/break the document web.
|
| That being said, with sites like HN, Reddit, LinkedIn,
| Twitter, news outlets, etc. the lines between "document"
| and "application" get blurred. In some ways they've built
| a micro-application that hosts documents. Content can be
| user submitted in-browser. Content can be "engaged with"
| in browser. Some handle this blurring better than others.
| HN is an example IMO of getting it right where nearly
| everything that should be addressable (like comments) can
| be linked to. Others not so much.
|
| (As an aside, I love marginalia!)
| marginalia_nu wrote:
| For application websites like the ones you listed, you'd
| typically end up building a special integration for
| crawling against their API or data dumps. This is also
| true for github, stackoverflow, and even document:y
| websites like wikipedia.
|
| It's simply not feasible to treat them as any other
| website if you wanna index their data.
| cacoos wrote:
| I've always wondered why not to use request interceptors, get
| the html/json/xml/whatever url and then just call them.
|
| If you need cookies/headers, you can always open the browser,
| log in and then make the same requests in the console, instead
| of waiting for the browser to load and scrape the UI (by xpath,
| etc.)
|
| sounds weird going in circles: - SPA call some URL - SPA use
| the response data to populate the UI - You scrape the UI
|
| instead of just calling the url inside the browser? am i
| missing something?
| rocho wrote:
| It also depends on what the goal is. If you need to extract
| some data from a specific site (as opposed to obtaining
| faithful representations of pages from arbitrary sites like a
| search engine might need to), then SPAs might be the easiest
| targets of all. Just do some inspection beforehand to find out
| where the SPA is loading the data from, very often the easiest
| thing is to query the API directly.
|
| Sometimes the API will need some form of session tokens found
| in HTTP meta tags, form inputs, or cookies. Sometimes they have
| some CORS checks in place that are very easy to bypass from a
| server-side environment by spoofing the Origin header.
| Nakroma wrote:
| I do a lot of scraping at my day job, and I agree the times we
| need to use a headless browser are very rare. Vast majority of
| the time you can either find an API or JSON endpoint, or scrape
| the JS contained in the returned SSR document to get what you
| want.
| ricardo81 wrote:
| The issues you list can probably be split into 2 main issues
|
| - The host has a tendency to block unknown user agents, or user
| agents that claim to be a browser but are not
|
| - Anything that requires client side rendering
|
| I'd suppose both problems are more pertinent in 2023 than they
| were in 2012.
|
| At web scale, the issue appears on at what point would you be
| required to use a headless browser when not using one in the
| first place e.g. if React is included/referenced. Perhaps some
| simple fingerprinting of JS files would do, IMO in reality the
| line is very blurry, so you either do or don't.
| bluedino wrote:
| > Headless browser scraping is between 10x and 100x more
| resource intensive, even if you carefully block requests and
| cache resources.
|
| Instead of setting up some kind of partnership with our
| vendors, where they just send us information or provide an API,
| we scrape their websites.
|
| The old version ran in a hour, using one thread on one machine.
| Downloaded PDF's and extracted the values.
|
| The new version is Selenium based, uses 20 cores, 300GB of
| memory, and takes all night to run. It does the same thing, but
| from inside of a browser.
|
| As a bonus, the 'web scrapers' were blamed for every
| performance issue we had for a long time.
| marginalia_nu wrote:
| Archive link: https://archive.is/yUWjh
|
| Also kinda wish the author paid any sort of attention to the fact
| that doing this incorrectly may create a flood of DNS queries. At
| least have the decency to set up a bind cache or something.
| swyx wrote:
| what is a bind cache? always assume most of us have terrible
| knowledge of networking
| marginalia_nu wrote:
| I mean running bind[1] locally configured to act as a DNS
| cache.
|
| The operating system does some DNS caching as well, but it's
| not really tuned for crawling, and as a result it's very easy
| to end up spamming innocent DNS servers with an insane amount
| of lookup requests.
|
| [1] https://www.isc.org/bind/
| swyx wrote:
| ok but in my understanding isnt DNS also cached on the
| nearest wifi/ISP routers? the whole DNS system is just
| layer after layer of caches right? i.e. does caching on
| local machine actually matter? (real question, i dont know)
| marginalia_nu wrote:
| Yeah sure, but most of those caching layers (including
| possibly on the ISP level) aren't really configured for
| the DNS flood a crawling operation may result in.
|
| If you're going to do way more DNS lookups than are
| expected from your connection, it's a good custom to
| provide your own caching layer that's scaled accordingly.
|
| Not doing so probably won't break anything, but it risks
| degrading the DNS performance of other people using the
| same resolvers.
| swyx wrote:
| gotcha. thanks for indulging my curiosity! hopefully
| others will learn good dns hygiene from this as well.
| m4r71n wrote:
| A note of caution: never scrape the web from your
| local/residential network. Few months back I wanted to verify a
| set of around 200k URLs from a large data set that included a set
| of URL references for each object, and naively wrote a simple
| Python script that would use ten concurrent threads to ping each
| URL and record the HTTP status code that came back. I let this
| run for some time and was happy with the results, only to find
| out later that a large CDN provider has identified me as a spammy
| client with their client reputation score and blocked my IP
| address on all of the websites that they serve.
|
| Changing your IP address with AT&T is a pain (even though the
| claim is that your IP is dynamic, in practice it hardly ever
| changes) so I opted to contact the CDN vendor by filling out a
| form and luckily my ban was lifted in a day or two. Nevertheless,
| it was annoying that suddenly a quarter of the websites I
| normally visit were not accessible to me since the CDN covers a
| large swath of the Internet.
| ricardo81 wrote:
| It makes very little difference what IP you scrape from, unless
| you're from a very dodgy subnet.
|
| The major content providers tend to go on a whitelist only
| based approach, you're either a human-like visitor or facing
| their anti-scraping methodologies.
| remram wrote:
| I think the emphasis is on "never scrape from YOUR
| local/residential network".
| delfinom wrote:
| >(even though the claim is that your IP is dynamic, in practice
| it hardly ever changes)
|
| Every ISP just uses DHCP for router IPs. It's dynamic, you just
| have to let the lease time expire to renew it.
|
| Or, have your own configurable router instead of the ISPs so
| that you can actually send a dhcp release command though they
| don't all support this part. Changing MAC Address will work
| otherwise.
| cand0r wrote:
| When the lease expires, the same IP is prioritized for
| renewal. Leases are generally for a week or two, but I've
| noticed dynamic IPs staying for 3 months or more. Swapping
| modems is really the best way to get a new external IP.
| pknerd wrote:
| Most probably cloud based scraping services were not available
| in 2012. Now there are services available like scraperapi and
| others that don't need you to install anything at your end. You
| pay them, use their cloud infra, infinite proxies and even
| headless browsers. Shameless plug, I had written about it a few
| years ago on my blog post [1]
|
| [1] https://blog.adnansiddiqi.me/scraping-dynamic-websites-
| using...
| timeon wrote:
| Not sure how is it in python, but what about using something
| like arti-client? Would it be already blocked?
| hinkley wrote:
| I've learned a bunch of stuff about batch processing in the
| last few years that I would have sworn I already knew.
|
| We had a periodic script that had all of these caveats about
| checking telemetry on the affected systems before running it,
| and even when it was happy it took gobs of hardware and ran for
| over 30 minutes.
|
| There were all sorts of mistakes about traffic shaping that
| made it very bursty, like batching versus rate limiting, so the
| settings were determined by trial and error, essentially based
| on the 95th percentile of worst case (which is to say
| occasionally you'd get unlucky and knock things over). It also
| had to gather data from three services to feed the fourth and
| it was very spammy about that as well.
|
| I reworked the whole thing with actual rate limiting, some
| different async blocks to interleave traffic to different
| services, and some composite rate limiting so we would call
| service C no faster than Service D could retire requests.
|
| At one point I cut the cluster core count by 70% and the run
| time down to 8 minutes. Around a 12x speed up. Doing exactly
| the same amount of work, but doing it smarter.
|
| CDNs and SaaS companies are in a weird spot where typical
| spider etiquette falls down. Good spiders limit themselves to N
| simultaneous requests per domain, trying to balance their
| burden across the entire internet. But they are capable of M*N
| total simultaneous requests, and so if you have a narrow domain
| or get unlucky they can spider twenty of your sites at the same
| time. Depending on how your cluster works (ie, cache expiry)
| that may actually cause more stress on the cluster than just
| blowing up one Host at a time.
|
| People can get quite grumpy about this behind closed doors, and
| punishing the miscreants definitely gets discussed.
| marginalia_nu wrote:
| I run a search engine crawler from my residential network. I
| get this too sometimes, but a lot of the time the IP shit-
| listing is temporary. It also seems to happen more often if you
| don't use a high enough crawl delay, ignore robots.txt, do deep
| crawls ignoring HTTP 429 errors and so on. You know, overall
| being a bad bot.
|
| Overall, it's not as bad as it seems. I doubt anyone would
| accidentally damage their IP reputation doing otherwise above-
| board stuff.
| chaostheory wrote:
| What makes this post more interesting is that Reddit might now be
| ushering in a new era of the crawler arms race
| polote wrote:
| For all the people that say this is easy. Try it ! That's not
| easy at all, I've tried it and spend a few weeks to get similar
| performance. Receiving thousands of request is not similar to
| making thousands of requests, you can saturate your network,
| saturated with latency of random websites, get site that never
| timeout, parse multi megabytes malformed html, get infinite
| redirections.
|
| My fastest implementation in python was actually using threads
| and was much faster than any async variant
___________________________________________________________________
(page generated 2023-06-15 23:01 UTC)