[HN Gopher] Notes on Writing Web Scrapers
___________________________________________________________________
Notes on Writing Web Scrapers
Author : cushychicken
Score : 57 points
Date : 2021-12-02 18:03 UTC (4 hours ago)
(HTM) web link (cushychicken.github.io)
(TXT) w3m dump (cushychicken.github.io)
| throwaway894345 wrote:
| > When you're reading >1000 results a day, and you're inserting
| generous politeness intervals, an end-to-end scrape is expensive
| and time consuming. As such, you don't want to interrupt it. Your
| exception handling should be air. Tight.
|
| Seems like it would be more robust to just persist the state of
| your scrape so you can resume it. In general, I try to minimize
| the amount of code that a developer _MUST_ get right.
| marginalia_nu wrote:
| > Be nice to your data sources.
|
| Very much this. If I know I'm about to possibly consume a non-
| negligible amount of resources from a server by indexing a
| website with lots of subdomains, I typically send the webmaster
| an email asking them if this is fine, telling them what I use the
| data for and how often my crawler re-visits a site, asking if I
| should set any specific crawl delays or do it in particular times
| of day.
|
| In every case I've done this, they have been fine with it. It
| goes down a lot better than just barging in and grabbing data.
| This also gives them a way of reaching me if something should go
| wrong. I'm not just some IP address grabbing thousands of
| documents, I'm a human person they can talk to.
|
| If I'd get a no, then I would respect that too, not try to
| circumvent it like some entitled asshole. No means no.
| cushychicken wrote:
| I've never actually reached out to a webmaster to ask
| permission, but I think that's a great idea. (They may even
| have some suggestions for a better way to achieve what I'm
| doing.)
|
| How do you typically find contact info for someone like that?
|
| I'm running _very_ generous politeness intervals at the moment
| to try and ensure I 'm not a nuisance - one query every two
| seconds.
| marginalia_nu wrote:
| If you can't find the address on the website, sometimes it's
| in robots.txt, you could also try support@domain,
| admin@domain or webmaster@domain.
|
| Contact forms can as well, usually they seem to get forwarded
| to the IT department if they look like they have IT stuff in
| them, and I've had reasonable success getting hold of the
| right people that way.
| cushychicken wrote:
| I'll give that a shot!
|
| Do you know if this is generally an in-house position for
| companies that use third party platforms?
|
| I ask because Workday has been the absolute bane of my
| indexing existence, and I suspect they make it hard so they
| can own distribution of jobs to approved search engines.
| (Makes it easier to upcharge that way, I suppose.)
|
| If the administrator for the job site is the Workday
| customer (i.e. Qualcomm or NXP or whoever is using Workday
| to host their job ads), I'd suspect I'd have a chance at
| getting a better way to index their jobs. (My god, I'd love
| API access if that's a thing I can get. I'd be a fly on the
| wall in most cases - one index a day is plenty for my
| purposes!)
| gjs278 wrote:
| complete waste of time
| buffet_overflow wrote:
| This is a nice approach. I generally leave a project specific
| email in the request headers with a similar short summary of my
| goals.
| marginalia_nu wrote:
| Yeah, my User-Agent is "search.marginalia.nu", and my contact
| information is not hard to find on that site. Misbehaving
| bots are very annoying and it's incredibly frustrating when
| you can't get hold of the owner to tell them about it.
| enigmatic02 wrote:
| Ah the sanitization links were great, thanks!
|
| Do you plan on handling sanitization of roles so people can
| search by that? I ended up using a LONG case when statement to
| group roles into buckets, probably not ideal
|
| Doing something similar to you, but focused on startups and jobs
| funded by Tier 1 investors: https://topstartups.io/
| yolo3000 wrote:
| Do you think there's a lot of job boards lately? Or have they
| always been there? It feels like I've seen a lot of them
| popping up this year.
| cushychicken wrote:
| Author heeyah. Would love your feedback, either here or at
| @cushychicken on the tweet street.
| funnyflamigo wrote:
| Can you elaborate on what you mean by not interrupting the
| scrape and instead flagging those pages?
|
| Let's say you're scraping product info from a large list of
| products. I'm assuming you mean if it's strange one-off type
| errors to handle those, and you'd stop altogether if too many
| fail? Otherwise you'd just be DOS'ing the site.
| cushychicken wrote:
| _Can you elaborate on what you mean by not interrupting the
| scrape and instead flagging those pages?_
|
| Sure! I can get a little more concrete about this project
| more easily than I can comment on your hypothetical about a
| large list of products, though, so forgive me in advance for
| pivoting on the scenario here.
|
| I'm scraping job pages. Typically, one job posting == one
| link. I can go through that link for the job posting and
| extract data from given HTML elements using CSS selectors or
| XPath statements. However, sometimes the data I'm looking for
| isn't structured in a way I expect. The major area I notice
| variations in job ad data is location data. There are a
| zillion little variations in how you can structure the
| location of a job ad. City+country, city+state+country, comma
| separated, space separated, localized states, no states or
| provinces, all the permutations thereof.
|
| I've written the extractor to expect a certain format of
| location data for a given job site - let's say "<city>,
| <country>", for example. If the scraper comes across an entry
| that happens to be "<city>, <state>, <country>", it's
| generally not smart enough to generalize its transform logic
| to deal with that. So, to handle it, I mark that particular
| job page link as needing human review, so it pops up as an
| ERROR in my logs, and as an entry in the database that has
| post_status == 5. After that, it gets inserted into the
| database, but not posted live onto the site.
|
| That way, I can go in and manually fix the posting, approve
| it to go on the site (if it's relevant), and, ideally, tweak
| the scraper logic so that it handles transforms of that style
| of data formatting as well as the "<city>, <country>" format
| I originally expected.
|
| Does that make sense?
|
| I suspect I'm just writing logic to deal with
| malformed/irregular entries that humans make into job sites
| XD
| marginalia_nu wrote:
| I've had a lot of success just saving the data into gzipped
| tarballs, like a few thousand documents per tarball. That
| way I can replay the data and tweak the algorithms without
| causing traffic.
| cushychicken wrote:
| Is that still practical even if you're storing the page
| text?
|
| The reason I don't do that is because I have a few
| functions that analyze the job descriptions for
| relevance, but don't store the post text. I mostly did
| that to save space - I'm just aggregating links to
| relevant roles, not hosting job posts.
|
| I figured saving ~1000 job descriptions would take up a
| needlessly large chunk of space, but truth be told I
| never did the math to check.
|
| Edit: I understand scrapy does something similar to what
| you're describing; have considered using that as my
| scraper frontend but haven't gotten around to doing the
| work for it yet.
| marginalia_nu wrote:
| Yeah, sure. The text itself is usually at most a few
| hundred Kb, and HTML compresses extremely well. Like it's
| pretty slow to unpack and replay the documents, but it's
| still a lot faster than downloading them again.
| MrMetlHed wrote:
| And it's friendlier to the server you're getting the data
| from.
|
| As a journalist, I have to scrape government sites now
| and then for datasets they won't hand over via FOIA
| requests ("It's on our site, that's the bare minimum to
| comply with the law so we're not going to give you the
| actual database we store this information in.") They're
| notoriously slow and often will block any type of
| systematic scraping. Better to get whatever you can and
| save it, then run your parsing and analysis on that
| instead of hoping you can get it from the website again.
| muxator wrote:
| First of all, thanks for marginalia.nu.
|
| Have you considered stored compressed blobs in a sqlite
| file? Works fine for me, you can do indexed searches on
| your "stored" data, and can extract single pages if you
| want.
| marginalia_nu wrote:
| The main reason I'm doing it this way is because I'm
| saving this stuff to a mechanical drive, and I want
| consistent write performance and low memory overhead.
| Since it's essentially just an archive copy, I don't mind
| if it takes half an hour to chew through looking for some
| particular set of files. Since this is a format deigned
| for tape drives, it causes very little random access.
| It's important that it's relatively consistent to write
| since my crawler does while it's crawling, and it can
| reach speeds of 50-100 documents per second, which would
| be extremely rough on any sort of database based on a
| single mechanical hard drive.
|
| These archives are just an intermediate stage that's used
| if I need to reconstruct the index to tweak say keyword
| extraction or something, so random access performance
| isn't something that is particularly useful.
| gringo_flamingo wrote:
| I can truly relate to this article, especially where you
| mentioned trying to extract only the specific contents of
| elements that you need; without bloating your software. To me,
| that seemed intuitive with the minimal experience I have in web
| scraping. However, I ended up fighting the frameworks. Me being
| stubborn, I did not try your approach and kept trying to be a
| perfectionist about it LMAO. Thank you for this read, glad I am
| not the only one who has been through this. Haha...
| cushychicken wrote:
| Yeah it's an easy thing to get into a perfectionist streak
| over.
|
| Thinking about separation of concerns helped me a _lot_ in
| getting over the hump of perfectionism. Once I realized I was
| trying to make my software do too much, it was easier to see
| how it would be much less work to write as two separate
| programs bundled together. (Talking specifically about the
| extract /transform stages here.)
|
| Upon reflection, this project has been just as much self-
| study of good software engineering practices as it has been
| learning how to scrape job sites XD
| cogburnd02 wrote:
| There's a project called woob (https://woob.tech/) that
| implements python scripts that are a little bit like scrapers
| but only 'scrape' on demand from requests from console &
| desktop programs.
|
| How much of this article do you think would apply to something
| like that? e.g. something like 'wait a second (or even two!)
| between successive hits' might not be necessary (one could
| perhaps shorten it to 1/4 second) if one is only doing a few
| requests followed by long periods of no requests.
| cushychicken wrote:
| Interesting question. My first instinct is to say that woob
| seems closer in use case to a browser than a scraper, as it
| seems largely geared towards making rich websites more easily
| accessible. (If I'm reading the page right, anyway.) A
| scraper is basically just hitting a web page over and over
| again, as fast as you can manage.
|
| The trick, IMO, is to be closer to browser level loading on a
| server than scraper. Make sense?
| ggambetta wrote:
| I've written scrapers over the years, mostly for fun, and I've
| followed a different approach. Re. "don't interrupt the scrape",
| whenever URLs are stable, I keep a local cache of downloaded
| pages, and have a bit of logic that checks the cache first when
| retrieving an URL. This way you can restart the scrape at any
| time and most accesses will not hit the network until the point
| where the previous run was interrupted.
|
| This also helps with the "grab more than you think you need" part
| - just grab the whole page! If you later realize you needed to
| extract more than you thought, you have everything in the local
| cache, ready to be processed again.
| cushychicken wrote:
| You're not the first person in this thread to suggest grabbing
| the whole page text. I've never tried, just because I assumed
| it was so much space as to be impractical, but I don't see the
| harm in trying!
| muxator wrote:
| My current favorite cache for whole pages is a single sqlite
| file with the page source stored with brotli compression.
| Additional columns for any metadata you might need (URL,
| scraping sessionid, age). The resulting file is big (but
| brotli for this is even better than zstd), and having a
| single file is very convenient.
___________________________________________________________________
(page generated 2021-12-02 23:01 UTC)