hngopher.com

       [HN Gopher] Crawling a quarter billion webpages in 40 hours (2012)
       ___________________________________________________________________
        
       Crawling a quarter billion webpages in 40 hours (2012)
        
       Author : swyx
       Score  : 184 points
       Date   : 2023-06-15 07:33 UTC (15 hours ago)
        
 (HTM) web link (michaelnielsen.org)
 (TXT) w3m dump (michaelnielsen.org)
        
       | merek wrote:
       | Great to see this again. This was the article that introduced me
       | to Redis (and more broadly the NoSQL rollercoaster) all those
       | years ago!
        
       | luckystarr wrote:
       | How to get the top million sites list today? Alexa has shifted
       | focus recently.
        
         | jonatron wrote:
         | There's a list here: https://radar.cloudflare.com/domains
        
         | zX41ZdbW wrote:
         | Here is an example of how to obtain a list of the top six
         | million domains from Tranco and analyze their content with
         | ClickHouse:
         | https://github.com/ClickHouse/ClickHouse/issues/18842
        
         | spyder wrote:
         | CommonCrawl / Open PageRank ?
         | 
         | https://www.domcop.com/top-10-million-websites
        
         | gnfargbl wrote:
         | Check out Tranco [1], which uses Cisco Umbrella, Majestic and
         | also now a list sourced from Farsight passive DNS [2]. They're
         | "working on" adding Chrome UX and Cloudflare Radar.
         | 
         | There's also a list from Netcraft [3].
         | 
         | [1] https://tranco-list.eu/
         | 
         | [2] https://www.domaintools.com/resources/blog/mirror-mirror-
         | on-...
         | 
         | [3] https://trends.netcraft.com/topsites
        
       | bottom999mottob wrote:
       | Just use 500 android phones running Tmux they said. It'll be easy
       | they said.
        
         | jasonjayr wrote:
         | Imagine a Beowulf cluster of android phones running tmux!
        
           | marginalia_nu wrote:
           | Be careful with that reference, it's an antique.
        
             | 0xdeadbeefbabe wrote:
             | A Beowulf cluster running screen
        
       | onion2k wrote:
       | To save you a click: Use 20 machines.
        
         | sethammons wrote:
         | 250M / 40hrs / 60min / 60s ~= 1,737 rps. That over 20 machines
         | is ~87 rps per machine.
         | 
         | Depending on a few factors, I rough out my backend Go stuff to
         | handle between 1-5k rps per machine before we have real
         | numbers.
        
           | onion2k wrote:
           | You didn't write your Go stuff 11 years ago though...
        
             | sethammons wrote:
             | We started with Go version 1.2 which was released over ten
             | years ago. Pretty darn close to 11 years.
             | 
             | https://go.dev/doc/go1.2
        
       | ddorian43 wrote:
       | With async in any language of choice this should need just 1
       | server. 250M/40hours=1.7K requests/second.
       | 
       | You can probably do it on a single core in a low level language
       | like Rust/Java.
        
         | marginalia_nu wrote:
         | Threads isn't the only bottleneck in crawling.
         | 
         | Assuming you're crawling at a civilized rate of about 1
         | request/second, you're only so many network hiccups away from
         | consuming the entire ephemeral port range with connections in
         | TIME_WAIT or CLOSE_WAIT.
        
           | sethammons wrote:
           | Crank up the ulimit. And what is this 1 req/second nonsense?
           | 1 req/sec/domain maybe. I have to agree, my first thought is
           | "why this is not a single node in Go?"
        
             | marginalia_nu wrote:
             | Oh yeah, I mean 1 req/sec per domain of course.
             | 
             | It's very easy to end up with tens of thousands of not-
             | quite-closed connections while crawling, even with
             | SO_LINGER(0), proper closure, tweaking the TCP settings,
             | and doing all things by the book.
             | 
             | It's a different situation from e.g. having a bunch of
             | incoming connections to a node, the traffic patterns are
             | not at all similar.
        
       | spacebacon wrote:
       | How to spend $580 in 40 hours. More can be done for much less in
       | 2012.
        
         | ricardo81 wrote:
         | Yes. I used lib curl's multi interface on one $40/m server
         | around that time. Indeed at any scale the rate limiting becomes
         | the main bottleneck, mainly because a lot of sites are
         | concentrated on certain hosts. Speed isn't the problem and
         | multiple servers aren't really needed.
        
       | [deleted]
        
       | swyx wrote:
       | a bit of context - posted this because i found it via
       | https://twitter.com/willdepue/status/1669074758208749572
       | 
       | if they replicate it in 2023 it would be pretty interesting to
       | me. i can think of a few times a year i need a good scraper.
       | 
       | but also thought it a good look into 2012 Michael Nielsen, and
       | into thinking about performance.
        
       | trevioustrouble wrote:
       | [flagged]
        
         | dspillett wrote:
         | Nothing to stop you writing a similar project and releasing it
         | if you feel so strongly that it should be out there, instead of
         | being incensed by not being give then sweat off someone else's
         | brow.
        
           | trevioustrouble wrote:
           | Okey walley
        
         | bryanrasmussen wrote:
         | don't want to release the source code because of moral
         | reservations about how people might use it. Ok upstanding
         | citizen.
        
           | EGreg wrote:
           | Just like AI people
        
         | KomoD wrote:
         | No point being a douchebag.
        
           | trevioustrouble wrote:
           | Cant help myself to the stupidity
        
       | krishadi wrote:
       | I've been recently reading up on multithreading and
       | multiprocessing in python. You mention that you've taken a multi
       | threaded approach since the processes are i/o bound. Is this the
       | same as running the script with asyncio as async/await?
        
         | Jorge1o1 wrote:
         | At a 10,000 foot view (pedants will take offense) you should
         | look to use multiprocessing for tasks which are CPU bound, and
         | asyncio/threads (but really asyncio if you can) for problems
         | which are IO-bound.
         | 
         | This is a massive simplification but most useful for a
         | beginner.
         | 
         | Additionally, asyncio is not the same as multithreading,
         | because typically asyncio is powered by a single-threaded event
         | loop and use of a mechanism like select/kqueue/IOCP/epoll.
        
         | krishadi wrote:
         | Oops, I meant to ask the author but realised that the author is
         | not the same as the op. Hah!
        
       | samwillis wrote:
       | This is a good overview, but being from 2012 it's missing comment
       | on one now common area, using a real browser for
       | crawling/scraping.
       | 
       | You will often now hear recommendations to use a real browser, in
       | headless mode, for crawling. There are two reasons for this:
       | 
       | - SPAs with front end only rendering are hard to scrape with a
       | traditional http library.
       | 
       | - Anti bot/scraping technology fingerprints browsers, looks at
       | request patterns, and browser behaviour to try and detect and
       | block bots.
       | 
       | Using a "real browser" is often advised as a way around these
       | issues.
       | 
       | However from my experience you should avoid headless browser
       | crawling until it is absolutely necessary, I have found:
       | 
       | - Headless browser scraping is between 10x and 100x more resource
       | intensive, even if you carefully block requests and cache
       | resources.
       | 
       | - Most SPAs now have some level of server side rendering, and
       | often that includes having a handy JSON in the returned document
       | that contains the data you actually want.
       | 
       | - Advanced browser fingerprinting is vanishingly rare. At most I
       | have seen detection of user agent strings and comparing them to
       | http headers and the order of them. If you make your http lib
       | look like a current browser you are %99.9 of the way there.
        
         | jaymzcampbell wrote:
         | On the topic of JSON and whatnot that reminded me of this
         | awesome post[1] about looking within heap snapshots for entire
         | data structures where they may have a lot of nicely structured
         | data within it not readily available from more direct URL
         | calls.
         | 
         | [1] https://www.adriancooney.ie/blog/web-scraping-via-
         | javascript...
         | 
         | (Previous comments about it from 2022 here:
         | https://news.ycombinator.com/item?id=31205139)
        
         | moneywoes wrote:
         | Is there like a custom performance based headless browser
         | 
         | How are they scaled?
        
         | tempest_ wrote:
         | Site owners adding browser fingerprinting explicitly is
         | vanishingly rare however site owners who sit behind cloud flare
         | who very likely fingerprint browsers is very common.
        
         | londons_explore wrote:
         | Headless browsers are expensive, but if you needed them, I
         | suspect there are easy wins to be had on the performance front.
         | 
         | For example, you can probably skip all layout/rendering work
         | and simply return faked values whenever javascript tries to
         | read back a canvas it just drew into, or tries to read the
         | computed width of some element. The vast majority of those
         | things won't prevent the page loading.
        
         | arek_nawo wrote:
         | Which way to go will depend on your use-case and what websites
         | you want to scrape.
         | 
         | For occasional web crawling, headless browser is great as it's
         | easy to set up and you're almost guaranteed to get the data you
         | need.
         | 
         | For frequent or large-scale crawling, it's a different story.
         | Surely you can implement something with just HTTP library, but
         | you'll need to test it throughly and make some research before
         | hand. That said, most scraped, content-heavy websites use
         | either static HTML or SSR, in which case you can use HTTP no
         | problem.
        
         | marginalia_nu wrote:
         | Depends on what you want with the data I guess. From a search
         | engine point of view, the SPA case isn't _that_ relevant, since
         | SPAs can 't reliably be linked to and in general tend to not be
         | very stable and it's overall difficult to figure out how to
         | enumerate and traverse their views.
         | 
         | I think a good middle ground might be to do a first pass with a
         | "stupid" crawler, and then re-visit the sites where you were
         | blocked or that just contained a bunch of javascript with a
         | headless browser.
        
           | abhibeckert wrote:
           | The SPA's we create can be reliably linked to (the current
           | URL changes as the user moves around, even though the page
           | hasn't reloaded) and they are "stable" because our business
           | would go bankrupt if Google couldn't crawl our content.
           | 
           | If Google can crawl it, then you can too. And while Google
           | doesn't use a headless browser (or at least I assume they
           | don't) they absolutely do execute javascript before loading
           | the content of the page. And they execute the click event
           | handlers on every link/button and when we use
           | "history.pushState()" to change the URL Google considers that
           | a new page.
           | 
           | You're just going to get a loading spinner with no content if
           | you do a dumb crawl (I disagree with that and think we should
           | be running a headless browser server side to execute
           | javascript and generate the initial page content for all our
           | pages... but so far management hasn't prioritised that
           | change... instead they just keep telling us to make our
           | client side javascript run faster... imagine if there was no
           | javascript to execute at all? At least none before first
           | contentfull paint)
        
             | marginalia_nu wrote:
             | > The SPA's we create can be reliably linked to (the
             | current URL changes as the user moves around, even though
             | the page hasn't reloaded) and they are "stable" because our
             | business would go bankrupt if Google couldn't crawl our
             | content.
             | 
             | This is true for some SPAs, but not all SPAs, and there's
             | not really any way of telling which is which.
             | 
             | I don't personally attempt to crawl SPAs because it's not
             | the sort of content I want to index.
        
               | r3trohack3r wrote:
               | I have a pet theory that there are two forms of the web:
               | the document web and the application web. SPAs have some
               | very attractive properties for the application web but
               | complicate/break the document web.
               | 
               | That being said, with sites like HN, Reddit, LinkedIn,
               | Twitter, news outlets, etc. the lines between "document"
               | and "application" get blurred. In some ways they've built
               | a micro-application that hosts documents. Content can be
               | user submitted in-browser. Content can be "engaged with"
               | in browser. Some handle this blurring better than others.
               | HN is an example IMO of getting it right where nearly
               | everything that should be addressable (like comments) can
               | be linked to. Others not so much.
               | 
               | (As an aside, I love marginalia!)
        
               | marginalia_nu wrote:
               | For application websites like the ones you listed, you'd
               | typically end up building a special integration for
               | crawling against their API or data dumps. This is also
               | true for github, stackoverflow, and even document:y
               | websites like wikipedia.
               | 
               | It's simply not feasible to treat them as any other
               | website if you wanna index their data.
        
         | cacoos wrote:
         | I've always wondered why not to use request interceptors, get
         | the html/json/xml/whatever url and then just call them.
         | 
         | If you need cookies/headers, you can always open the browser,
         | log in and then make the same requests in the console, instead
         | of waiting for the browser to load and scrape the UI (by xpath,
         | etc.)
         | 
         | sounds weird going in circles: - SPA call some URL - SPA use
         | the response data to populate the UI - You scrape the UI
         | 
         | instead of just calling the url inside the browser? am i
         | missing something?
        
         | rocho wrote:
         | It also depends on what the goal is. If you need to extract
         | some data from a specific site (as opposed to obtaining
         | faithful representations of pages from arbitrary sites like a
         | search engine might need to), then SPAs might be the easiest
         | targets of all. Just do some inspection beforehand to find out
         | where the SPA is loading the data from, very often the easiest
         | thing is to query the API directly.
         | 
         | Sometimes the API will need some form of session tokens found
         | in HTTP meta tags, form inputs, or cookies. Sometimes they have
         | some CORS checks in place that are very easy to bypass from a
         | server-side environment by spoofing the Origin header.
        
         | Nakroma wrote:
         | I do a lot of scraping at my day job, and I agree the times we
         | need to use a headless browser are very rare. Vast majority of
         | the time you can either find an API or JSON endpoint, or scrape
         | the JS contained in the returned SSR document to get what you
         | want.
        
         | ricardo81 wrote:
         | The issues you list can probably be split into 2 main issues
         | 
         | - The host has a tendency to block unknown user agents, or user
         | agents that claim to be a browser but are not
         | 
         | - Anything that requires client side rendering
         | 
         | I'd suppose both problems are more pertinent in 2023 than they
         | were in 2012.
         | 
         | At web scale, the issue appears on at what point would you be
         | required to use a headless browser when not using one in the
         | first place e.g. if React is included/referenced. Perhaps some
         | simple fingerprinting of JS files would do, IMO in reality the
         | line is very blurry, so you either do or don't.
        
         | bluedino wrote:
         | > Headless browser scraping is between 10x and 100x more
         | resource intensive, even if you carefully block requests and
         | cache resources.
         | 
         | Instead of setting up some kind of partnership with our
         | vendors, where they just send us information or provide an API,
         | we scrape their websites.
         | 
         | The old version ran in a hour, using one thread on one machine.
         | Downloaded PDF's and extracted the values.
         | 
         | The new version is Selenium based, uses 20 cores, 300GB of
         | memory, and takes all night to run. It does the same thing, but
         | from inside of a browser.
         | 
         | As a bonus, the 'web scrapers' were blamed for every
         | performance issue we had for a long time.
        
       | marginalia_nu wrote:
       | Archive link: https://archive.is/yUWjh
       | 
       | Also kinda wish the author paid any sort of attention to the fact
       | that doing this incorrectly may create a flood of DNS queries. At
       | least have the decency to set up a bind cache or something.
        
         | swyx wrote:
         | what is a bind cache? always assume most of us have terrible
         | knowledge of networking
        
           | marginalia_nu wrote:
           | I mean running bind[1] locally configured to act as a DNS
           | cache.
           | 
           | The operating system does some DNS caching as well, but it's
           | not really tuned for crawling, and as a result it's very easy
           | to end up spamming innocent DNS servers with an insane amount
           | of lookup requests.
           | 
           | [1] https://www.isc.org/bind/
        
             | swyx wrote:
             | ok but in my understanding isnt DNS also cached on the
             | nearest wifi/ISP routers? the whole DNS system is just
             | layer after layer of caches right? i.e. does caching on
             | local machine actually matter? (real question, i dont know)
        
               | marginalia_nu wrote:
               | Yeah sure, but most of those caching layers (including
               | possibly on the ISP level) aren't really configured for
               | the DNS flood a crawling operation may result in.
               | 
               | If you're going to do way more DNS lookups than are
               | expected from your connection, it's a good custom to
               | provide your own caching layer that's scaled accordingly.
               | 
               | Not doing so probably won't break anything, but it risks
               | degrading the DNS performance of other people using the
               | same resolvers.
        
               | swyx wrote:
               | gotcha. thanks for indulging my curiosity! hopefully
               | others will learn good dns hygiene from this as well.
        
       | m4r71n wrote:
       | A note of caution: never scrape the web from your
       | local/residential network. Few months back I wanted to verify a
       | set of around 200k URLs from a large data set that included a set
       | of URL references for each object, and naively wrote a simple
       | Python script that would use ten concurrent threads to ping each
       | URL and record the HTTP status code that came back. I let this
       | run for some time and was happy with the results, only to find
       | out later that a large CDN provider has identified me as a spammy
       | client with their client reputation score and blocked my IP
       | address on all of the websites that they serve.
       | 
       | Changing your IP address with AT&T is a pain (even though the
       | claim is that your IP is dynamic, in practice it hardly ever
       | changes) so I opted to contact the CDN vendor by filling out a
       | form and luckily my ban was lifted in a day or two. Nevertheless,
       | it was annoying that suddenly a quarter of the websites I
       | normally visit were not accessible to me since the CDN covers a
       | large swath of the Internet.
        
         | ricardo81 wrote:
         | It makes very little difference what IP you scrape from, unless
         | you're from a very dodgy subnet.
         | 
         | The major content providers tend to go on a whitelist only
         | based approach, you're either a human-like visitor or facing
         | their anti-scraping methodologies.
        
           | remram wrote:
           | I think the emphasis is on "never scrape from YOUR
           | local/residential network".
        
         | delfinom wrote:
         | >(even though the claim is that your IP is dynamic, in practice
         | it hardly ever changes)
         | 
         | Every ISP just uses DHCP for router IPs. It's dynamic, you just
         | have to let the lease time expire to renew it.
         | 
         | Or, have your own configurable router instead of the ISPs so
         | that you can actually send a dhcp release command though they
         | don't all support this part. Changing MAC Address will work
         | otherwise.
        
           | cand0r wrote:
           | When the lease expires, the same IP is prioritized for
           | renewal. Leases are generally for a week or two, but I've
           | noticed dynamic IPs staying for 3 months or more. Swapping
           | modems is really the best way to get a new external IP.
        
         | pknerd wrote:
         | Most probably cloud based scraping services were not available
         | in 2012. Now there are services available like scraperapi and
         | others that don't need you to install anything at your end. You
         | pay them, use their cloud infra, infinite proxies and even
         | headless browsers. Shameless plug, I had written about it a few
         | years ago on my blog post [1]
         | 
         | [1] https://blog.adnansiddiqi.me/scraping-dynamic-websites-
         | using...
        
         | timeon wrote:
         | Not sure how is it in python, but what about using something
         | like arti-client? Would it be already blocked?
        
         | hinkley wrote:
         | I've learned a bunch of stuff about batch processing in the
         | last few years that I would have sworn I already knew.
         | 
         | We had a periodic script that had all of these caveats about
         | checking telemetry on the affected systems before running it,
         | and even when it was happy it took gobs of hardware and ran for
         | over 30 minutes.
         | 
         | There were all sorts of mistakes about traffic shaping that
         | made it very bursty, like batching versus rate limiting, so the
         | settings were determined by trial and error, essentially based
         | on the 95th percentile of worst case (which is to say
         | occasionally you'd get unlucky and knock things over). It also
         | had to gather data from three services to feed the fourth and
         | it was very spammy about that as well.
         | 
         | I reworked the whole thing with actual rate limiting, some
         | different async blocks to interleave traffic to different
         | services, and some composite rate limiting so we would call
         | service C no faster than Service D could retire requests.
         | 
         | At one point I cut the cluster core count by 70% and the run
         | time down to 8 minutes. Around a 12x speed up. Doing exactly
         | the same amount of work, but doing it smarter.
         | 
         | CDNs and SaaS companies are in a weird spot where typical
         | spider etiquette falls down. Good spiders limit themselves to N
         | simultaneous requests per domain, trying to balance their
         | burden across the entire internet. But they are capable of M*N
         | total simultaneous requests, and so if you have a narrow domain
         | or get unlucky they can spider twenty of your sites at the same
         | time. Depending on how your cluster works (ie, cache expiry)
         | that may actually cause more stress on the cluster than just
         | blowing up one Host at a time.
         | 
         | People can get quite grumpy about this behind closed doors, and
         | punishing the miscreants definitely gets discussed.
        
         | marginalia_nu wrote:
         | I run a search engine crawler from my residential network. I
         | get this too sometimes, but a lot of the time the IP shit-
         | listing is temporary. It also seems to happen more often if you
         | don't use a high enough crawl delay, ignore robots.txt, do deep
         | crawls ignoring HTTP 429 errors and so on. You know, overall
         | being a bad bot.
         | 
         | Overall, it's not as bad as it seems. I doubt anyone would
         | accidentally damage their IP reputation doing otherwise above-
         | board stuff.
        
       | chaostheory wrote:
       | What makes this post more interesting is that Reddit might now be
       | ushering in a new era of the crawler arms race
        
       | polote wrote:
       | For all the people that say this is easy. Try it ! That's not
       | easy at all, I've tried it and spend a few weeks to get similar
       | performance. Receiving thousands of request is not similar to
       | making thousands of requests, you can saturate your network,
       | saturated with latency of random websites, get site that never
       | timeout, parse multi megabytes malformed html, get infinite
       | redirections.
       | 
       | My fastest implementation in python was actually using threads
       | and was much faster than any async variant
        
       ___________________________________________________________________
       (page generated 2023-06-15 23:01 UTC)