[HN Gopher] The State of Web Scraping in 2021
___________________________________________________________________
The State of Web Scraping in 2021
Author : marvram
Score : 186 points
Date : 2021-10-11 12:21 UTC (10 hours ago)
(HTM) web link (mihaisplace.blog)
(TXT) w3m dump (mihaisplace.blog)
| nathell wrote:
| I'll chime in with mine: Skyscraper (Clojure) [0] builds on
| Enlive/Reaver (which in turn build on JSoup), but tries to
| address cross-cutting concerns like caching, fetching HTML
| (preferably in parallel), throttling, retries, navigation,
| emitting the output as a dataset, etc.
| Jenkins2000 wrote:
| For Java/Kotlin, HtmlUnit is generally pretty great.
| say_it_as_it_is wrote:
| Pyppetteer is feature complete and worth noting:
| https://github.com/pyppeteer/pyppeteer
| amelius wrote:
| > Crawl at off-peak traffic times. If a news service has most of
| its users present between 9 am and 10 pm - then it might be good
| to crawl around 11 pm or in the wee hours of the morning.
|
| How do you know this if it is not your website?
|
| Also, the internet has no time zone.
| chucky wrote:
| For sites where there is a peak usage time, it's probably
| obvious what that peak usage time is. A news service (their
| example) presumably primarily serves a country or a region -
| then off-peak traffic times are likely at night.
|
| The Internet has no time zone, but its human users all do.
| numeralls wrote:
| If your scraping a popular website Google Trends should be a
| pretty good proxy
| m_ke wrote:
| Another tip, there are a few browser extensions that can record
| your interactions and generate a playwright script.
|
| Here's one: https://chrome.google.com/webstore/detail/headless-
| recorder/...
| sidharthv wrote:
| If you don't want to install another extension, Playwright has
| built in support for recording.
|
| npx playwright codegen wikipedia.org
|
| https://playwright.dev/docs/next/codegen
| jrochkind1 wrote:
| I'm not familiar with "playwright", it doesn't seem to be
| mentioned in OP either.
|
| When I google, I see it advertised as a "testing" tool.
|
| Can I also use it for scraping? Where would I learn more about
| doing so?
| xnyan wrote:
| Playwright is essentially a headless chrome, firefox, and
| webkit browser with a nice API that's intended for
| automation/scraping. It's far more heavy than something like
| curl, but it has all the capabilities of any browser you want
| (not just chrome as with puppeteer) and makes stuff like
| interacting with javascript a breeze.
|
| It's similar to Google's puppeteer, but in my opinion even
| with chrome much more pleasant and productive. Microsoft's
| best developer tool IMO, saves me tons of time.
| rmetzler wrote:
| Playwright is similar to Puppeteer, but can use different
| browsers not only Chrome.
| lucasverra wrote:
| any recommendation to scrape behind google social login?
| SubiculumCode wrote:
| I've not followed this space. When I did, there were a lot of
| questions concerning the legality of automated scraping. Have
| those legal issues been resolved?
| rafale wrote:
| What's the best way to get around AWS/Azure/... ip range ban and
| VPN ban when scrapping?
| killingtime74 wrote:
| There are other providers. I think big time scrapers use
| residential IPs
| jmuguy wrote:
| The large proxy providers operate in a sort of gray market. You
| pay for "residential" or "ISP" based IP addresses. In some
| instances these proxy connections are literally being tunneled
| through browser extensions running on a real world system
| somewhere (https://hola.org/ for instance)
| [deleted]
| beardyw wrote:
| I tried Python/ BeautifulSoup and Node/Puppeteer recently. It may
| be because my Python is poor, but puppeteer seemed more natural
| to me. Injecting functionality into a properly formed web page
| felt quite powerful and started me thinking about what you could
| do with it.
| Quessked73 wrote:
| Does anyone have a resource for getting into app-based scraping,
| if the API is obfuscated or rate limited?
| elorant wrote:
| Run an http debugger, or some proxy, and find the endpoints.
| colinramsay wrote:
| If you're familiar with Go, there's Colly too [1]. I liked its
| simplicity and approach and even wrote a little wrapper around it
| to run it via Docker and a config file:
|
| https://gotripod.com/insights/super-simple-site-crawling-and...
|
| [1] http://go-colly.org/
| IceWreck wrote:
| +1 for Go. Its easy concurrency makes it an awesome language
| for web scraping.
|
| The go-colly framework was a bit too restrictive for my needs,
| but its very easy to build something on top of the standard
| lib's net/http, its cookiejar, and a third party library called
| goquery (afaik go-colly uses this too).
|
| Fun Fact: We were scraping something from an apparantly zero
| rate limits azure blob container, and we had to enumerate
| around a million URLs daily (didn't know which URLs actually
| existed so we guessed an offset and enumerated from there, also
| we had to do it at a fixed time daily). We had proxys at our
| disposal but didn't need them cause the blob container did not
| rate-limit.
|
| I wrote the scraper in Go, but a friend wrote it in Rust. Using
| Go was fast enough, satisfying all our requirements, but it
| turned out that the Rust one was 3-5 times faster. I tried to
| improve the Go scraper's speed by tweaking net/http transport's
| parameters, increasing workers, removing all NOFILE limits from
| SystemD and tried to profile and remove the low hanging speed
| issues. Nothing reduced the gap. Then I replaced the net/http
| client with valyala/fasthttp (another http implementation in
| Go) , which made it as fast as or slightly faster than the Rust
| one which was using the reqwest crate as http client.
| hivacruz wrote:
| I used this library to get familiar with Go. It is indeed very
| powerful and really easy to create a scraper.
|
| My main concerns though were about testing. What if you want to
| create tests to check if your scraper still gets the data we
| want? Colly allows nested scraping and it's easy to implement
| but you have all your logic into one big function, making it
| harder to test.
|
| Did you find a solution to this? I'm considering switching to
| net/http + GoQuery only to have more freedom.
| colinramsay wrote:
| Not yet but my plan was to just have a static HTML site which
| the tests could run against.
| throwawaysea wrote:
| Is there open source software that can extract the "content" part
| of a given page cleanly? I'm thinking about what the reader mode
| in browsers can do as an example, where the main content is
| somehow isolated and displayed.
| specproc wrote:
| I believe the main library for reader mode is called
| readability. I played around with a python implementation a
| while back. Just pipe in your raw html as part of the process.
| It's good, but not flawless. If I remember correctly, it
| included some quotes and image text as part of the body for the
| site I tried it on.
| stef25 wrote:
| There's a PHP port of Readability and it works for some
| sites, for others not at all. Very far from perfect.
| tannhaeuser wrote:
| You can use SGML (on which HTML is/was based) and my LGPL-
| licensed sgmljs package [1] for that, plus my SGML DTD grammar
| for HTML5. [2] describes common tasks in preservation of Web
| content to give you a flavor, but you can customize what SGML
| does with your markup to death really; in your case, you'll
| probably want to throw away divs and navs to get clean semantic
| HTML which you can do using SGML link processes (= pipeline of
| markup filters and transformations), but you could also convert
| HTML into canonical markup (eg XML) and use Turing-complete XML
| processing tools such as XSLT as described in the linked
| tutorial.
|
| [1]: http://sgmljs.net
|
| [2]: http://sgmljs.net/docs/parsing-html-tutorial/parsing-html-
| tu...
| toastal wrote:
| OCaml's Lambda Soup (https://aantron.github.io/lambdasoup/) is a
| amazing library/, especially for those that prefer functional
| programming
| dec0dedab0de wrote:
| Scraping things that don't want to be scraped is one of my
| favorite things to do. At work this is usually an interface for
| some sort of "network appliance." Though with the push for REST
| APIs over the last 6 years or so, I don't have a need to do it
| all to often. Plus with things like selenium it's too easy to
| just run the page as is, and I can't justify spending the time to
| figuring out the undocumented API.
|
| My favorite one implemented CSRF protections by polling an
| endpoint, and adding in the hashed data from that endpoint and a
| timestamp on every request.
|
| When I hear a junior dev give up on something because the API
| doesn't provide the functionality of the UI, It makes me very sad
| that they're missing out.
| eastendguy wrote:
| > Scraping things that don't want to be scraped
|
| If all else fails, no website can withstand OCR-based screen
| scraping. It is slow(er), but fast enough for many use cases.
| elorant wrote:
| Assuming that you eventually manage to load the page somehow.
| Which in some edge cases may entail simulating mouse
| movements and random delays.
| mkl wrote:
| A browser extension is probably an easier way to extract text
| than OCR (unless you're targeting a wide range of sites, I
| suppose).
| timwis wrote:
| Have you tried on a page protected by cloudflare captcha?
| tmpz22 wrote:
| To be fair selenium style scraping can take a lot of time to
| setup if you aren't already familiar with the tooling, and the
| browser rendering apis are unintuitive and sometimes flat out
| broken.
| dec0dedab0de wrote:
| Maybe it's because I'm using the python bindings, but it took
| me about an hour to go from never using it to having it do
| what I needed it to do. I just messed around in a jupyter
| notebook until I got what I needed working. Tab complete on
| live objects is your friend. The hardest part was figuring
| out where to download a headless browser from.
|
| Though I do prefer requests/bs4. I wrote a helper to generate
| a requests.Session object from a selenium Browser object. I
| had something recently where the only thing I needed the
| javascript engine for was a login form that changed. So by
| doing it this way I didn't have to rewrite the whole thing.
| Still kind of bothers me I didn't take the time to figure out
| how to do it without the headless browser, but it works fine,
| and I have other things to do.
| ipaddr wrote:
| That's why things like laravel's Dusk exists to put a layer
| over that complex experience.
| f311a wrote:
| For Python, instead of BeautifulSoup I prefer to use selectolax
| which is 3-5 times faster.
|
| Also, I think very few people use MechanicalSoup nowadays. There
| are libraries that allow you to use headless Chrome, e.g.
| Playwright.
|
| It looks like the author of the article just googled some
| libraries for each language and didn't research the topic.
| xnyan wrote:
| I agree with your conclusion, but in any discussion about web
| scraping it's probably a good idea to mention BeautifulSoup
| given how popular it is (virtually a builtin in terms how much
| it's used) and given all the documentation available for it, a
| good starting point if perf is not going to be a concern.
| jacurtis wrote:
| > It looks like the author of the article just googled some
| libraries for each language and didn't research the topic
|
| Yep, this seemed like an aggregate Google results page.
|
| I was initially intrigued by the article and then realized it
| was a list of libraries the author found via Google. There were
| significantly notable omissions from this list and a bunch of
| weird stuff that feels unnecessary. I don't think the author
| has actually scraped a page before.
| mdaniel wrote:
| Lazyweb link: https://github.com/rushter/selectolax
|
| although I don't follow the need to have what appears to be two
| completely separate HTML parsing C libraries as dependencies;
| seeing this in the readme for Modest gives me the shivers
| because lxml has _seen some shit_
|
| > Modest is a fast HTML renderer implemented as a pure C99
| library with no outside dependencies.
|
| although its other dep seems much more cognizant about the
| HTML5 standard, for whatever that's worth:
| https://github.com/lexbor/lexbor#lexbor
|
| ---
|
| > It looks like the author of the article just googled some
| libraries for each language and didn't research the topic
|
| Heh, oh, new to the Internet, are you? :-D
| heavyset_go wrote:
| requests-html is faster than bs4 using lxml. It's a wrapper
| over lxml. I built something similar years ago using a similar
| method, it was much faster than bs4, too.
| holoduke wrote:
| Still using casperjs and phantomjs. Both are deprecated for many
| years, but I cannot find any replacement. Some of my scraping
| programs are running over 10 years without any issues.
| gcatalfamo wrote:
| Why no mention of selenium? Is it not cool anymore? I have never
| heard of mechanicalsoup: is it selenium replacement?
| duckmysick wrote:
| I moved from selenium to playwright. It has a pleasant API and
| things just works our of box. I ran into odd problems with
| selenium before, especially when waiting for a specific
| element. Selenium didn't register it, but I could see it load.
|
| It was uncharacteristic of me, because I tend to use boring,
| older technologies. But this gamble paid off for me.
|
| https://playwright.dev/
| IceWreck wrote:
| > is it selenium replacement
|
| No completely different use case. Selenium is browser
| automation. Mechanical soup/Mechanize/Robobrowser are not
| actually web browsers, they have no javascript support either.
| They're python libraries that can simulate a web browser but
| doing GET requests, storing cookies across requests, filling
| http POST forms, etc.
|
| The downside is that they don't work with websites which rely
| on JavaScript to load content. But if you're scraping a website
| like that, then it might be easier and way way faster to
| analyze web requests using dev tools or mitmproxy, then
| automating those API calls instead of automating a browser.
| gcatalfamo wrote:
| I started reverse engineering web apis years ago as a more
| efficient way to scrape. Unfortunately there are new website
| builders (like salesforce lightning) that make really hard to
| recreate an http request due to the complexity of the
| parameters
| nicoburns wrote:
| Selenium is famously unreliable, so a lot of people have been
| replacing it with headless chrome where they can.
| melomal wrote:
| Interesting. I was about to start on some web automation and
| so far I've had hammered into my head that Selenium is the
| 'language of the internet' or something along those lines.
|
| What would be a better solution, if you have any to
| recommend?
| xzel wrote:
| I'd suggest Puppeteer / Playwright. Both are great. Iirc
| the puppeteer team largely moved to playwright.
| kjkjadksj wrote:
| Puppeteer is frustrating to me. When I tried to use it I
| couldn't get it to click buttons, but I did get it to
| hover on the button so I know I had the correct element
| in my code. Their click function just did nothing at all.
| I resorted to tabbing a certain amount of times and
| hitting enter.
| gcatalfamo wrote:
| I have been using selenium with chromedriver, I mistakenly
| thought those were basically the same thing.
|
| Can you tell me more?
| MrDresden wrote:
| I've been working on a scraping project in Scrapy over the last
| month, using Selenium as well. My Python skills are mediocre
| (mostly a Java/Kotlin dev).
|
| Not only has it been a blast to try out, but also surprisingly
| easy to setup.
|
| I now have around 11 domains being scraped 4 times a day through
| a well defined pipeline + ETL then pipes it to Firebase Firestore
| for consumption.
|
| Next step is to write the page on top of it.
| heavyset_go wrote:
| Are you using Scrapy mainly for scraping, or do you do
| crawling, as well?
| [deleted]
| benzible wrote:
| I just needed a service to reliably fetch raw pages that I can
| process in my own application and so far I've been happy with
| this: https://promptapi.com/marketplace/description/adv_scraper-
| ap...
|
| $30 / month for 300K requests, rotating residential proxies, uses
| headless Chromium, etc.
| synergy20 wrote:
| In my own experience puppeteer is much better/capable than
| selenium but the problem is that puppeteer requires nodejs. its
| python-wrapper https://github.com/pyppeteer/pyppeteer was not as
| good as selenium when you like to use python.
| novaleaf wrote:
| Self promotion: my SaaS is the lowest cost web scraping tool for
| high volume, and has been in business since 2016.
|
| https://PhantomJsCloud.com
|
| My SaaS requires some technical knowledge to use (call a web api)
| which I suppose is why it's not ever in these lists.
|
| Some of my customers are *very* large businesses. If you are
| looking at evading bot countermeasures, my product isn't probably
| the best for you. but for TCO nothing beats it.
| lloydatkinson wrote:
| Isn't phantomjs deprecated and unmaintained?
| spiffytech wrote:
| Yep, according to PhantomJS' README, their "development is
| suspended until further notice".
|
| It looks like phantomjscloud.com also supports Puppeteer.
| novaleaf wrote:
| yes, bad naming on my part. While it does support PhantomJs
| still, the default is a Puppeteer backend.
| mro_name wrote:
| I am scraping radio broadcast pages for a decade now. Started
| with (ruby) scrapy, then nokogiri, then moved on to go and their
| html package.
|
| Currently sport a mix of curl + grep + xsltproc + lambdasoup
| (OCaml) and am happy with it. Sounds like a mess but is shallow,
| inspectable, changeable and concise.
| http://purl.mro.name/recorder
| abzug wrote:
| On the Ruby side both Nokogiri and Mechanize should be
| mentioned...
| travisporter wrote:
| I'm looking to roll my own Plaid-like service so I can download
| the CSVfiles from my bank account and credit card. Would selenium
| be the way to go?
| juanse wrote:
| Nowadays is more and more common for websites to have some kind
| of rate limiting middleware such as rack attack for ruby. It
| would be interesting to explore the strategies to deal with it.
| hdjjhhvvhga wrote:
| Same as always - proxy farms, random popular UAs with random
| delays etc.
| bryanrasmussen wrote:
| so will Google's freezing of the UA lead to less ability to
| web scrape for the non big company scrapers out there?
| dreyfan wrote:
| What? It's just a text string in the header. How in the
| world would that possibly make it more difficult to scrape?
|
| All Chrome is doing is stop appending the current semver in
| the UA it sends.
| heavyset_go wrote:
| The switch from UA to browser fingerprinting makes it
| harder to scrape without being stopped.
|
| Yes, at any time the UA could be ignored and clients
| could be fingerprinted, but now the UA is being made next
| to useless, so fingerprinting will now become the default
| everywhere.
| bryanrasmussen wrote:
| I mean I realize it probably isn't a problematic, just
| wondering, but on the other hand it shouldn't be so
| difficult to follow the reasoning based on the context I
| would think:
|
| poster says - in order to be able to scrape effectively
| you should appear to be a real human, use different UAs
| etc.
|
| So as this change happens different UAs become one less
| thing that you can easily change to seem less suspicious,
| as a non-frozen UA would then be a suspicious sign after
| some time.
|
| So a sort of side effect.
| [deleted]
| [deleted]
| vivekv wrote:
| Sorry can you please elaborate what is Google doing?
| bryanrasmussen wrote:
| sorry, I thought it was a well known thing here given the
| various discussions over past year or so
| https://groups.google.com/a/chromium.org/g/blink-
| dev/c/-2JIR...
|
| on edit: so I'm thinking as there will only be one UA
| floating around then, sure, older UAs can exist, but
| those become progressively more suspicious.
| hdfinenwvdun wrote:
| You can easily spoof the UA
| juanse wrote:
| Sorry, what is a UA?
| shmoogy wrote:
| User Agent
| Veen wrote:
| User agent (browser or some other web client).
| heavyset_go wrote:
| > _It would be interesting to explore the strategies to deal
| with it._
|
| The strategy to deal with it is to behave well when making
| requests so that you don't get rate limited.
| marginalia_nu wrote:
| Why not just lower the crawl rate? My search engine crawler
| visits tens of million documents in a week at a rate of 1
| doc/second, but a few hundred different domains at the same
| time.
|
| Going as low as 0.2 dps could easily be doable I think.
| Ozzie_osman wrote:
| Fundamentally, what most scrapers learn is that the more their
| scraper can behave like a human browsing the site, the less
| likely they are to get detected and blocked.
|
| This does put limits on how quickly they can crawl, of course,
| but scrapers find ways around it like changing ip and user
| agent (ip is probably the main one, bec you can then pretend
| that you are multiple humans browsing the site normally).
| prox wrote:
| Yeah, there are services that give you a range of IPs for a
| certain time.
| stef25 wrote:
| Rotate through a bunch of proxies.
| jmnicolas wrote:
| Last year I needed some quick scraping and I used a headless
| Chromium to render webpages and print the HTML then analyze it
| with C#.
|
| I don't remember exactly, but I think it was around 100 or 200
| loc, so not exactly something that took long to write. In fact
| the most difficult thing was to figure how to pass the right args
| to Chromium.
|
| I wonder what does a scraping framework offer?
| Veen wrote:
| > I wonder what does a scraping framework offer?
|
| HTTP requests, HTML parsing, crawling, data extraction,
| wrapping complex browser APIs etc. Nothing you couldn't do
| yourself, but like most frameworks, they abstract the messy
| details so you can get a scraper working quickly without having
| to cobble together a bunch of libraries or re-invent the wheel.
| jmnicolas wrote:
| I see thanks.
| jrochkind1 wrote:
| Just for one example, when you have to get a form, and then
| submit the form, with the CSRF protection that was in the
| form... of course you COULD write that yourself by printing
| HTML and then analyzing it with C# (which triggers more
| requests to chromium I guess), but you're probably going to
| wonder why you are reinventing the wheel when you want to be
| getting on to the domain-specific stuff.
| jmnicolas wrote:
| Ah yes I see. Mine was read-only, so no need for complex
| stuff.
| elorant wrote:
| Throttling is a prime example. If you start loading multitudes
| of sites in asynchronous fashion you'll have to enter some
| delay otherwise you run the risk of choking the server in
| misconfigured sites. I've DDoSed sites accidentally this way.
| You can of course build a framework on your own, and that's
| pretty much what every scraper does eventually, but it takes
| time and a lot of effort.
| gizdan wrote:
| Surprised Woob[0] (formerly Weboob) isn't on the list. It's
| designed for specific tasks, such as getting transactions from
| your bank, events from different event pages, and much.
|
| [0] https://woob.tech/
| marban wrote:
| Cloudflare's protection is quite a b*tch to circumvent with any
| headless or python library.
| heavyset_go wrote:
| It's a pain even when you aren't a bot. For a while there,
| Cloudflare's fingerprinting page would trigger Firefox on Linux
| to crash instantly.
| password4321 wrote:
| https://news.ycombinator.com/item?id=28514998#28515629
|
| > _Cloudflare 's bot protection mostly makes use of TLS
| fingerprinting, and thus pretty easy to bypass._
|
| https://news.ycombinator.com/item?id=28251700 ->
| https://github.com/refraction-networking/utls
|
| Disclaimer: haven't tried it.
| alphabet9000 wrote:
| with node, i've had success with puppeteer-extra using
| puppeteer-extra-plugin-stealth
| zlib wrote:
| What kind of stuff are people needing to scrape?
| mdaniel wrote:
| I would expect it's roughly the same answers, just varying in
| the specifics:
|
| * those which don't offer a _reasonable_ API, or (I would guess
| a larger subset) those which don't expose all the same
| information over their API
|
| * those things which one wishes to preserve (yes, I'm aware
| that submitting them to the Internet Archive might achieve that
| goal)
|
| * and then the subset of projects where it's just a fun
| challenge or the ubiquitous $other
|
| As an example answer to your question, some sites are even
| offering bounties for scraped data, so one could scratch a
| technical itch and help data science at the same time:
|
| https://www.dolthub.com/repositories/pdap/datasets/bounties
| rmetzler wrote:
| Websites which change over time and don't provide a simpler way
| of getting an update (e.g. an RSS feed or a JSON api).
| conradfr wrote:
| I have a side-project where I display the schedule of the day
| of 100+ French radios, like you would for TV channels.
|
| Scraping works great to get the data.
|
| I don't like node/js but I use it to do the scraping as I view
| the code as trash and full of edge cases and unreliable data /
| types and I can't complain, a dynamic scripting language is
| great for that.
| [deleted]
| ianhawes wrote:
| Scraping saved untold lives this past spring when large
| healthcare providers (i.e. Walgreens & CVS) opted to hide their
| vaccination appointments behind redundant survey questions.
| This made it more difficult to quickly ascertain when an
| appointment slot would become available. The elderly were less
| likely to look more than once a day, delaying vaccines for
| those that needed it the most.
|
| GoodRX built a scraping system that tapped into all the major
| providers. Thats what a group of vaccine hunters in my state
| used to get appointments for folks that had tried but were
| unable to.
| hyeomans wrote:
| I scrape multiple government sites to fill all the data for
| https://www.quienmerepresenta.com.mx/
|
| It tells you who is your governor, local/federal
| representative, senator and municipal president. Each
| representative lives on a different website so I wrote
| scrappers for each one.
___________________________________________________________________
(page generated 2021-10-11 23:01 UTC)