[HN Gopher] Notes on Writing Web Scrapers
       ___________________________________________________________________
        
       Notes on Writing Web Scrapers
        
       Author : cushychicken
       Score  : 57 points
       Date   : 2021-12-02 18:03 UTC (4 hours ago)
        
 (HTM) web link (cushychicken.github.io)
 (TXT) w3m dump (cushychicken.github.io)
        
       | throwaway894345 wrote:
       | > When you're reading >1000 results a day, and you're inserting
       | generous politeness intervals, an end-to-end scrape is expensive
       | and time consuming. As such, you don't want to interrupt it. Your
       | exception handling should be air. Tight.
       | 
       | Seems like it would be more robust to just persist the state of
       | your scrape so you can resume it. In general, I try to minimize
       | the amount of code that a developer _MUST_ get right.
        
       | marginalia_nu wrote:
       | > Be nice to your data sources.
       | 
       | Very much this. If I know I'm about to possibly consume a non-
       | negligible amount of resources from a server by indexing a
       | website with lots of subdomains, I typically send the webmaster
       | an email asking them if this is fine, telling them what I use the
       | data for and how often my crawler re-visits a site, asking if I
       | should set any specific crawl delays or do it in particular times
       | of day.
       | 
       | In every case I've done this, they have been fine with it. It
       | goes down a lot better than just barging in and grabbing data.
       | This also gives them a way of reaching me if something should go
       | wrong. I'm not just some IP address grabbing thousands of
       | documents, I'm a human person they can talk to.
       | 
       | If I'd get a no, then I would respect that too, not try to
       | circumvent it like some entitled asshole. No means no.
        
         | cushychicken wrote:
         | I've never actually reached out to a webmaster to ask
         | permission, but I think that's a great idea. (They may even
         | have some suggestions for a better way to achieve what I'm
         | doing.)
         | 
         | How do you typically find contact info for someone like that?
         | 
         | I'm running _very_ generous politeness intervals at the moment
         | to try and ensure I 'm not a nuisance - one query every two
         | seconds.
        
           | marginalia_nu wrote:
           | If you can't find the address on the website, sometimes it's
           | in robots.txt, you could also try support@domain,
           | admin@domain or webmaster@domain.
           | 
           | Contact forms can as well, usually they seem to get forwarded
           | to the IT department if they look like they have IT stuff in
           | them, and I've had reasonable success getting hold of the
           | right people that way.
        
             | cushychicken wrote:
             | I'll give that a shot!
             | 
             | Do you know if this is generally an in-house position for
             | companies that use third party platforms?
             | 
             | I ask because Workday has been the absolute bane of my
             | indexing existence, and I suspect they make it hard so they
             | can own distribution of jobs to approved search engines.
             | (Makes it easier to upcharge that way, I suppose.)
             | 
             | If the administrator for the job site is the Workday
             | customer (i.e. Qualcomm or NXP or whoever is using Workday
             | to host their job ads), I'd suspect I'd have a chance at
             | getting a better way to index their jobs. (My god, I'd love
             | API access if that's a thing I can get. I'd be a fly on the
             | wall in most cases - one index a day is plenty for my
             | purposes!)
        
         | gjs278 wrote:
         | complete waste of time
        
         | buffet_overflow wrote:
         | This is a nice approach. I generally leave a project specific
         | email in the request headers with a similar short summary of my
         | goals.
        
           | marginalia_nu wrote:
           | Yeah, my User-Agent is "search.marginalia.nu", and my contact
           | information is not hard to find on that site. Misbehaving
           | bots are very annoying and it's incredibly frustrating when
           | you can't get hold of the owner to tell them about it.
        
       | enigmatic02 wrote:
       | Ah the sanitization links were great, thanks!
       | 
       | Do you plan on handling sanitization of roles so people can
       | search by that? I ended up using a LONG case when statement to
       | group roles into buckets, probably not ideal
       | 
       | Doing something similar to you, but focused on startups and jobs
       | funded by Tier 1 investors: https://topstartups.io/
        
         | yolo3000 wrote:
         | Do you think there's a lot of job boards lately? Or have they
         | always been there? It feels like I've seen a lot of them
         | popping up this year.
        
       | cushychicken wrote:
       | Author heeyah. Would love your feedback, either here or at
       | @cushychicken on the tweet street.
        
         | funnyflamigo wrote:
         | Can you elaborate on what you mean by not interrupting the
         | scrape and instead flagging those pages?
         | 
         | Let's say you're scraping product info from a large list of
         | products. I'm assuming you mean if it's strange one-off type
         | errors to handle those, and you'd stop altogether if too many
         | fail? Otherwise you'd just be DOS'ing the site.
        
           | cushychicken wrote:
           | _Can you elaborate on what you mean by not interrupting the
           | scrape and instead flagging those pages?_
           | 
           | Sure! I can get a little more concrete about this project
           | more easily than I can comment on your hypothetical about a
           | large list of products, though, so forgive me in advance for
           | pivoting on the scenario here.
           | 
           | I'm scraping job pages. Typically, one job posting == one
           | link. I can go through that link for the job posting and
           | extract data from given HTML elements using CSS selectors or
           | XPath statements. However, sometimes the data I'm looking for
           | isn't structured in a way I expect. The major area I notice
           | variations in job ad data is location data. There are a
           | zillion little variations in how you can structure the
           | location of a job ad. City+country, city+state+country, comma
           | separated, space separated, localized states, no states or
           | provinces, all the permutations thereof.
           | 
           | I've written the extractor to expect a certain format of
           | location data for a given job site - let's say "<city>,
           | <country>", for example. If the scraper comes across an entry
           | that happens to be "<city>, <state>, <country>", it's
           | generally not smart enough to generalize its transform logic
           | to deal with that. So, to handle it, I mark that particular
           | job page link as needing human review, so it pops up as an
           | ERROR in my logs, and as an entry in the database that has
           | post_status == 5. After that, it gets inserted into the
           | database, but not posted live onto the site.
           | 
           | That way, I can go in and manually fix the posting, approve
           | it to go on the site (if it's relevant), and, ideally, tweak
           | the scraper logic so that it handles transforms of that style
           | of data formatting as well as the "<city>, <country>" format
           | I originally expected.
           | 
           | Does that make sense?
           | 
           | I suspect I'm just writing logic to deal with
           | malformed/irregular entries that humans make into job sites
           | XD
        
             | marginalia_nu wrote:
             | I've had a lot of success just saving the data into gzipped
             | tarballs, like a few thousand documents per tarball. That
             | way I can replay the data and tweak the algorithms without
             | causing traffic.
        
               | cushychicken wrote:
               | Is that still practical even if you're storing the page
               | text?
               | 
               | The reason I don't do that is because I have a few
               | functions that analyze the job descriptions for
               | relevance, but don't store the post text. I mostly did
               | that to save space - I'm just aggregating links to
               | relevant roles, not hosting job posts.
               | 
               | I figured saving ~1000 job descriptions would take up a
               | needlessly large chunk of space, but truth be told I
               | never did the math to check.
               | 
               | Edit: I understand scrapy does something similar to what
               | you're describing; have considered using that as my
               | scraper frontend but haven't gotten around to doing the
               | work for it yet.
        
               | marginalia_nu wrote:
               | Yeah, sure. The text itself is usually at most a few
               | hundred Kb, and HTML compresses extremely well. Like it's
               | pretty slow to unpack and replay the documents, but it's
               | still a lot faster than downloading them again.
        
               | MrMetlHed wrote:
               | And it's friendlier to the server you're getting the data
               | from.
               | 
               | As a journalist, I have to scrape government sites now
               | and then for datasets they won't hand over via FOIA
               | requests ("It's on our site, that's the bare minimum to
               | comply with the law so we're not going to give you the
               | actual database we store this information in.") They're
               | notoriously slow and often will block any type of
               | systematic scraping. Better to get whatever you can and
               | save it, then run your parsing and analysis on that
               | instead of hoping you can get it from the website again.
        
               | muxator wrote:
               | First of all, thanks for marginalia.nu.
               | 
               | Have you considered stored compressed blobs in a sqlite
               | file? Works fine for me, you can do indexed searches on
               | your "stored" data, and can extract single pages if you
               | want.
        
               | marginalia_nu wrote:
               | The main reason I'm doing it this way is because I'm
               | saving this stuff to a mechanical drive, and I want
               | consistent write performance and low memory overhead.
               | Since it's essentially just an archive copy, I don't mind
               | if it takes half an hour to chew through looking for some
               | particular set of files. Since this is a format deigned
               | for tape drives, it causes very little random access.
               | It's important that it's relatively consistent to write
               | since my crawler does while it's crawling, and it can
               | reach speeds of 50-100 documents per second, which would
               | be extremely rough on any sort of database based on a
               | single mechanical hard drive.
               | 
               | These archives are just an intermediate stage that's used
               | if I need to reconstruct the index to tweak say keyword
               | extraction or something, so random access performance
               | isn't something that is particularly useful.
        
         | gringo_flamingo wrote:
         | I can truly relate to this article, especially where you
         | mentioned trying to extract only the specific contents of
         | elements that you need; without bloating your software. To me,
         | that seemed intuitive with the minimal experience I have in web
         | scraping. However, I ended up fighting the frameworks. Me being
         | stubborn, I did not try your approach and kept trying to be a
         | perfectionist about it LMAO. Thank you for this read, glad I am
         | not the only one who has been through this. Haha...
        
           | cushychicken wrote:
           | Yeah it's an easy thing to get into a perfectionist streak
           | over.
           | 
           | Thinking about separation of concerns helped me a _lot_ in
           | getting over the hump of perfectionism. Once I realized I was
           | trying to make my software do too much, it was easier to see
           | how it would be much less work to write as two separate
           | programs bundled together. (Talking specifically about the
           | extract /transform stages here.)
           | 
           | Upon reflection, this project has been just as much self-
           | study of good software engineering practices as it has been
           | learning how to scrape job sites XD
        
         | cogburnd02 wrote:
         | There's a project called woob (https://woob.tech/) that
         | implements python scripts that are a little bit like scrapers
         | but only 'scrape' on demand from requests from console &
         | desktop programs.
         | 
         | How much of this article do you think would apply to something
         | like that? e.g. something like 'wait a second (or even two!)
         | between successive hits' might not be necessary (one could
         | perhaps shorten it to 1/4 second) if one is only doing a few
         | requests followed by long periods of no requests.
        
           | cushychicken wrote:
           | Interesting question. My first instinct is to say that woob
           | seems closer in use case to a browser than a scraper, as it
           | seems largely geared towards making rich websites more easily
           | accessible. (If I'm reading the page right, anyway.) A
           | scraper is basically just hitting a web page over and over
           | again, as fast as you can manage.
           | 
           | The trick, IMO, is to be closer to browser level loading on a
           | server than scraper. Make sense?
        
       | ggambetta wrote:
       | I've written scrapers over the years, mostly for fun, and I've
       | followed a different approach. Re. "don't interrupt the scrape",
       | whenever URLs are stable, I keep a local cache of downloaded
       | pages, and have a bit of logic that checks the cache first when
       | retrieving an URL. This way you can restart the scrape at any
       | time and most accesses will not hit the network until the point
       | where the previous run was interrupted.
       | 
       | This also helps with the "grab more than you think you need" part
       | - just grab the whole page! If you later realize you needed to
       | extract more than you thought, you have everything in the local
       | cache, ready to be processed again.
        
         | cushychicken wrote:
         | You're not the first person in this thread to suggest grabbing
         | the whole page text. I've never tried, just because I assumed
         | it was so much space as to be impractical, but I don't see the
         | harm in trying!
        
           | muxator wrote:
           | My current favorite cache for whole pages is a single sqlite
           | file with the page source stored with brotli compression.
           | Additional columns for any metadata you might need (URL,
           | scraping sessionid, age). The resulting file is big (but
           | brotli for this is even better than zstd), and having a
           | single file is very convenient.
        
       ___________________________________________________________________
       (page generated 2021-12-02 23:01 UTC)