[HN Gopher] The State of Web Scraping in 2021
       ___________________________________________________________________
        
       The State of Web Scraping in 2021
        
       Author : marvram
       Score  : 186 points
       Date   : 2021-10-11 12:21 UTC (10 hours ago)
        
 (HTM) web link (mihaisplace.blog)
 (TXT) w3m dump (mihaisplace.blog)
        
       | nathell wrote:
       | I'll chime in with mine: Skyscraper (Clojure) [0] builds on
       | Enlive/Reaver (which in turn build on JSoup), but tries to
       | address cross-cutting concerns like caching, fetching HTML
       | (preferably in parallel), throttling, retries, navigation,
       | emitting the output as a dataset, etc.
        
       | Jenkins2000 wrote:
       | For Java/Kotlin, HtmlUnit is generally pretty great.
        
       | say_it_as_it_is wrote:
       | Pyppetteer is feature complete and worth noting:
       | https://github.com/pyppeteer/pyppeteer
        
       | amelius wrote:
       | > Crawl at off-peak traffic times. If a news service has most of
       | its users present between 9 am and 10 pm - then it might be good
       | to crawl around 11 pm or in the wee hours of the morning.
       | 
       | How do you know this if it is not your website?
       | 
       | Also, the internet has no time zone.
        
         | chucky wrote:
         | For sites where there is a peak usage time, it's probably
         | obvious what that peak usage time is. A news service (their
         | example) presumably primarily serves a country or a region -
         | then off-peak traffic times are likely at night.
         | 
         | The Internet has no time zone, but its human users all do.
        
         | numeralls wrote:
         | If your scraping a popular website Google Trends should be a
         | pretty good proxy
        
       | m_ke wrote:
       | Another tip, there are a few browser extensions that can record
       | your interactions and generate a playwright script.
       | 
       | Here's one: https://chrome.google.com/webstore/detail/headless-
       | recorder/...
        
         | sidharthv wrote:
         | If you don't want to install another extension, Playwright has
         | built in support for recording.
         | 
         | npx playwright codegen wikipedia.org
         | 
         | https://playwright.dev/docs/next/codegen
        
         | jrochkind1 wrote:
         | I'm not familiar with "playwright", it doesn't seem to be
         | mentioned in OP either.
         | 
         | When I google, I see it advertised as a "testing" tool.
         | 
         | Can I also use it for scraping? Where would I learn more about
         | doing so?
        
           | xnyan wrote:
           | Playwright is essentially a headless chrome, firefox, and
           | webkit browser with a nice API that's intended for
           | automation/scraping. It's far more heavy than something like
           | curl, but it has all the capabilities of any browser you want
           | (not just chrome as with puppeteer) and makes stuff like
           | interacting with javascript a breeze.
           | 
           | It's similar to Google's puppeteer, but in my opinion even
           | with chrome much more pleasant and productive. Microsoft's
           | best developer tool IMO, saves me tons of time.
        
           | rmetzler wrote:
           | Playwright is similar to Puppeteer, but can use different
           | browsers not only Chrome.
        
       | lucasverra wrote:
       | any recommendation to scrape behind google social login?
        
       | SubiculumCode wrote:
       | I've not followed this space. When I did, there were a lot of
       | questions concerning the legality of automated scraping. Have
       | those legal issues been resolved?
        
       | rafale wrote:
       | What's the best way to get around AWS/Azure/... ip range ban and
       | VPN ban when scrapping?
        
         | killingtime74 wrote:
         | There are other providers. I think big time scrapers use
         | residential IPs
        
         | jmuguy wrote:
         | The large proxy providers operate in a sort of gray market. You
         | pay for "residential" or "ISP" based IP addresses. In some
         | instances these proxy connections are literally being tunneled
         | through browser extensions running on a real world system
         | somewhere (https://hola.org/ for instance)
        
         | [deleted]
        
       | beardyw wrote:
       | I tried Python/ BeautifulSoup and Node/Puppeteer recently. It may
       | be because my Python is poor, but puppeteer seemed more natural
       | to me. Injecting functionality into a properly formed web page
       | felt quite powerful and started me thinking about what you could
       | do with it.
        
       | Quessked73 wrote:
       | Does anyone have a resource for getting into app-based scraping,
       | if the API is obfuscated or rate limited?
        
         | elorant wrote:
         | Run an http debugger, or some proxy, and find the endpoints.
        
       | colinramsay wrote:
       | If you're familiar with Go, there's Colly too [1]. I liked its
       | simplicity and approach and even wrote a little wrapper around it
       | to run it via Docker and a config file:
       | 
       | https://gotripod.com/insights/super-simple-site-crawling-and...
       | 
       | [1] http://go-colly.org/
        
         | IceWreck wrote:
         | +1 for Go. Its easy concurrency makes it an awesome language
         | for web scraping.
         | 
         | The go-colly framework was a bit too restrictive for my needs,
         | but its very easy to build something on top of the standard
         | lib's net/http, its cookiejar, and a third party library called
         | goquery (afaik go-colly uses this too).
         | 
         | Fun Fact: We were scraping something from an apparantly zero
         | rate limits azure blob container, and we had to enumerate
         | around a million URLs daily (didn't know which URLs actually
         | existed so we guessed an offset and enumerated from there, also
         | we had to do it at a fixed time daily). We had proxys at our
         | disposal but didn't need them cause the blob container did not
         | rate-limit.
         | 
         | I wrote the scraper in Go, but a friend wrote it in Rust. Using
         | Go was fast enough, satisfying all our requirements, but it
         | turned out that the Rust one was 3-5 times faster. I tried to
         | improve the Go scraper's speed by tweaking net/http transport's
         | parameters, increasing workers, removing all NOFILE limits from
         | SystemD and tried to profile and remove the low hanging speed
         | issues. Nothing reduced the gap. Then I replaced the net/http
         | client with valyala/fasthttp (another http implementation in
         | Go) , which made it as fast as or slightly faster than the Rust
         | one which was using the reqwest crate as http client.
        
         | hivacruz wrote:
         | I used this library to get familiar with Go. It is indeed very
         | powerful and really easy to create a scraper.
         | 
         | My main concerns though were about testing. What if you want to
         | create tests to check if your scraper still gets the data we
         | want? Colly allows nested scraping and it's easy to implement
         | but you have all your logic into one big function, making it
         | harder to test.
         | 
         | Did you find a solution to this? I'm considering switching to
         | net/http + GoQuery only to have more freedom.
        
           | colinramsay wrote:
           | Not yet but my plan was to just have a static HTML site which
           | the tests could run against.
        
       | throwawaysea wrote:
       | Is there open source software that can extract the "content" part
       | of a given page cleanly? I'm thinking about what the reader mode
       | in browsers can do as an example, where the main content is
       | somehow isolated and displayed.
        
         | specproc wrote:
         | I believe the main library for reader mode is called
         | readability. I played around with a python implementation a
         | while back. Just pipe in your raw html as part of the process.
         | It's good, but not flawless. If I remember correctly, it
         | included some quotes and image text as part of the body for the
         | site I tried it on.
        
           | stef25 wrote:
           | There's a PHP port of Readability and it works for some
           | sites, for others not at all. Very far from perfect.
        
         | tannhaeuser wrote:
         | You can use SGML (on which HTML is/was based) and my LGPL-
         | licensed sgmljs package [1] for that, plus my SGML DTD grammar
         | for HTML5. [2] describes common tasks in preservation of Web
         | content to give you a flavor, but you can customize what SGML
         | does with your markup to death really; in your case, you'll
         | probably want to throw away divs and navs to get clean semantic
         | HTML which you can do using SGML link processes (= pipeline of
         | markup filters and transformations), but you could also convert
         | HTML into canonical markup (eg XML) and use Turing-complete XML
         | processing tools such as XSLT as described in the linked
         | tutorial.
         | 
         | [1]: http://sgmljs.net
         | 
         | [2]: http://sgmljs.net/docs/parsing-html-tutorial/parsing-html-
         | tu...
        
       | toastal wrote:
       | OCaml's Lambda Soup (https://aantron.github.io/lambdasoup/) is a
       | amazing library/, especially for those that prefer functional
       | programming
        
       | dec0dedab0de wrote:
       | Scraping things that don't want to be scraped is one of my
       | favorite things to do. At work this is usually an interface for
       | some sort of "network appliance." Though with the push for REST
       | APIs over the last 6 years or so, I don't have a need to do it
       | all to often. Plus with things like selenium it's too easy to
       | just run the page as is, and I can't justify spending the time to
       | figuring out the undocumented API.
       | 
       | My favorite one implemented CSRF protections by polling an
       | endpoint, and adding in the hashed data from that endpoint and a
       | timestamp on every request.
       | 
       | When I hear a junior dev give up on something because the API
       | doesn't provide the functionality of the UI, It makes me very sad
       | that they're missing out.
        
         | eastendguy wrote:
         | > Scraping things that don't want to be scraped
         | 
         | If all else fails, no website can withstand OCR-based screen
         | scraping. It is slow(er), but fast enough for many use cases.
        
           | elorant wrote:
           | Assuming that you eventually manage to load the page somehow.
           | Which in some edge cases may entail simulating mouse
           | movements and random delays.
        
           | mkl wrote:
           | A browser extension is probably an easier way to extract text
           | than OCR (unless you're targeting a wide range of sites, I
           | suppose).
        
           | timwis wrote:
           | Have you tried on a page protected by cloudflare captcha?
        
         | tmpz22 wrote:
         | To be fair selenium style scraping can take a lot of time to
         | setup if you aren't already familiar with the tooling, and the
         | browser rendering apis are unintuitive and sometimes flat out
         | broken.
        
           | dec0dedab0de wrote:
           | Maybe it's because I'm using the python bindings, but it took
           | me about an hour to go from never using it to having it do
           | what I needed it to do. I just messed around in a jupyter
           | notebook until I got what I needed working. Tab complete on
           | live objects is your friend. The hardest part was figuring
           | out where to download a headless browser from.
           | 
           | Though I do prefer requests/bs4. I wrote a helper to generate
           | a requests.Session object from a selenium Browser object. I
           | had something recently where the only thing I needed the
           | javascript engine for was a login form that changed. So by
           | doing it this way I didn't have to rewrite the whole thing.
           | Still kind of bothers me I didn't take the time to figure out
           | how to do it without the headless browser, but it works fine,
           | and I have other things to do.
        
           | ipaddr wrote:
           | That's why things like laravel's Dusk exists to put a layer
           | over that complex experience.
        
       | f311a wrote:
       | For Python, instead of BeautifulSoup I prefer to use selectolax
       | which is 3-5 times faster.
       | 
       | Also, I think very few people use MechanicalSoup nowadays. There
       | are libraries that allow you to use headless Chrome, e.g.
       | Playwright.
       | 
       | It looks like the author of the article just googled some
       | libraries for each language and didn't research the topic.
        
         | xnyan wrote:
         | I agree with your conclusion, but in any discussion about web
         | scraping it's probably a good idea to mention BeautifulSoup
         | given how popular it is (virtually a builtin in terms how much
         | it's used) and given all the documentation available for it, a
         | good starting point if perf is not going to be a concern.
        
         | jacurtis wrote:
         | > It looks like the author of the article just googled some
         | libraries for each language and didn't research the topic
         | 
         | Yep, this seemed like an aggregate Google results page.
         | 
         | I was initially intrigued by the article and then realized it
         | was a list of libraries the author found via Google. There were
         | significantly notable omissions from this list and a bunch of
         | weird stuff that feels unnecessary. I don't think the author
         | has actually scraped a page before.
        
         | mdaniel wrote:
         | Lazyweb link: https://github.com/rushter/selectolax
         | 
         | although I don't follow the need to have what appears to be two
         | completely separate HTML parsing C libraries as dependencies;
         | seeing this in the readme for Modest gives me the shivers
         | because lxml has _seen some shit_
         | 
         | > Modest is a fast HTML renderer implemented as a pure C99
         | library with no outside dependencies.
         | 
         | although its other dep seems much more cognizant about the
         | HTML5 standard, for whatever that's worth:
         | https://github.com/lexbor/lexbor#lexbor
         | 
         | ---
         | 
         | > It looks like the author of the article just googled some
         | libraries for each language and didn't research the topic
         | 
         | Heh, oh, new to the Internet, are you? :-D
        
         | heavyset_go wrote:
         | requests-html is faster than bs4 using lxml. It's a wrapper
         | over lxml. I built something similar years ago using a similar
         | method, it was much faster than bs4, too.
        
       | holoduke wrote:
       | Still using casperjs and phantomjs. Both are deprecated for many
       | years, but I cannot find any replacement. Some of my scraping
       | programs are running over 10 years without any issues.
        
       | gcatalfamo wrote:
       | Why no mention of selenium? Is it not cool anymore? I have never
       | heard of mechanicalsoup: is it selenium replacement?
        
         | duckmysick wrote:
         | I moved from selenium to playwright. It has a pleasant API and
         | things just works our of box. I ran into odd problems with
         | selenium before, especially when waiting for a specific
         | element. Selenium didn't register it, but I could see it load.
         | 
         | It was uncharacteristic of me, because I tend to use boring,
         | older technologies. But this gamble paid off for me.
         | 
         | https://playwright.dev/
        
         | IceWreck wrote:
         | > is it selenium replacement
         | 
         | No completely different use case. Selenium is browser
         | automation. Mechanical soup/Mechanize/Robobrowser are not
         | actually web browsers, they have no javascript support either.
         | They're python libraries that can simulate a web browser but
         | doing GET requests, storing cookies across requests, filling
         | http POST forms, etc.
         | 
         | The downside is that they don't work with websites which rely
         | on JavaScript to load content. But if you're scraping a website
         | like that, then it might be easier and way way faster to
         | analyze web requests using dev tools or mitmproxy, then
         | automating those API calls instead of automating a browser.
        
           | gcatalfamo wrote:
           | I started reverse engineering web apis years ago as a more
           | efficient way to scrape. Unfortunately there are new website
           | builders (like salesforce lightning) that make really hard to
           | recreate an http request due to the complexity of the
           | parameters
        
         | nicoburns wrote:
         | Selenium is famously unreliable, so a lot of people have been
         | replacing it with headless chrome where they can.
        
           | melomal wrote:
           | Interesting. I was about to start on some web automation and
           | so far I've had hammered into my head that Selenium is the
           | 'language of the internet' or something along those lines.
           | 
           | What would be a better solution, if you have any to
           | recommend?
        
             | xzel wrote:
             | I'd suggest Puppeteer / Playwright. Both are great. Iirc
             | the puppeteer team largely moved to playwright.
        
               | kjkjadksj wrote:
               | Puppeteer is frustrating to me. When I tried to use it I
               | couldn't get it to click buttons, but I did get it to
               | hover on the button so I know I had the correct element
               | in my code. Their click function just did nothing at all.
               | I resorted to tabbing a certain amount of times and
               | hitting enter.
        
           | gcatalfamo wrote:
           | I have been using selenium with chromedriver, I mistakenly
           | thought those were basically the same thing.
           | 
           | Can you tell me more?
        
       | MrDresden wrote:
       | I've been working on a scraping project in Scrapy over the last
       | month, using Selenium as well. My Python skills are mediocre
       | (mostly a Java/Kotlin dev).
       | 
       | Not only has it been a blast to try out, but also surprisingly
       | easy to setup.
       | 
       | I now have around 11 domains being scraped 4 times a day through
       | a well defined pipeline + ETL then pipes it to Firebase Firestore
       | for consumption.
       | 
       | Next step is to write the page on top of it.
        
         | heavyset_go wrote:
         | Are you using Scrapy mainly for scraping, or do you do
         | crawling, as well?
        
       | [deleted]
        
       | benzible wrote:
       | I just needed a service to reliably fetch raw pages that I can
       | process in my own application and so far I've been happy with
       | this: https://promptapi.com/marketplace/description/adv_scraper-
       | ap...
       | 
       | $30 / month for 300K requests, rotating residential proxies, uses
       | headless Chromium, etc.
        
       | synergy20 wrote:
       | In my own experience puppeteer is much better/capable than
       | selenium but the problem is that puppeteer requires nodejs. its
       | python-wrapper https://github.com/pyppeteer/pyppeteer was not as
       | good as selenium when you like to use python.
        
       | novaleaf wrote:
       | Self promotion: my SaaS is the lowest cost web scraping tool for
       | high volume, and has been in business since 2016.
       | 
       | https://PhantomJsCloud.com
       | 
       | My SaaS requires some technical knowledge to use (call a web api)
       | which I suppose is why it's not ever in these lists.
       | 
       | Some of my customers are *very* large businesses. If you are
       | looking at evading bot countermeasures, my product isn't probably
       | the best for you. but for TCO nothing beats it.
        
         | lloydatkinson wrote:
         | Isn't phantomjs deprecated and unmaintained?
        
           | spiffytech wrote:
           | Yep, according to PhantomJS' README, their "development is
           | suspended until further notice".
           | 
           | It looks like phantomjscloud.com also supports Puppeteer.
        
           | novaleaf wrote:
           | yes, bad naming on my part. While it does support PhantomJs
           | still, the default is a Puppeteer backend.
        
       | mro_name wrote:
       | I am scraping radio broadcast pages for a decade now. Started
       | with (ruby) scrapy, then nokogiri, then moved on to go and their
       | html package.
       | 
       | Currently sport a mix of curl + grep + xsltproc + lambdasoup
       | (OCaml) and am happy with it. Sounds like a mess but is shallow,
       | inspectable, changeable and concise.
       | http://purl.mro.name/recorder
        
       | abzug wrote:
       | On the Ruby side both Nokogiri and Mechanize should be
       | mentioned...
        
       | travisporter wrote:
       | I'm looking to roll my own Plaid-like service so I can download
       | the CSVfiles from my bank account and credit card. Would selenium
       | be the way to go?
        
       | juanse wrote:
       | Nowadays is more and more common for websites to have some kind
       | of rate limiting middleware such as rack attack for ruby. It
       | would be interesting to explore the strategies to deal with it.
        
         | hdjjhhvvhga wrote:
         | Same as always - proxy farms, random popular UAs with random
         | delays etc.
        
           | bryanrasmussen wrote:
           | so will Google's freezing of the UA lead to less ability to
           | web scrape for the non big company scrapers out there?
        
             | dreyfan wrote:
             | What? It's just a text string in the header. How in the
             | world would that possibly make it more difficult to scrape?
             | 
             | All Chrome is doing is stop appending the current semver in
             | the UA it sends.
        
               | heavyset_go wrote:
               | The switch from UA to browser fingerprinting makes it
               | harder to scrape without being stopped.
               | 
               | Yes, at any time the UA could be ignored and clients
               | could be fingerprinted, but now the UA is being made next
               | to useless, so fingerprinting will now become the default
               | everywhere.
        
               | bryanrasmussen wrote:
               | I mean I realize it probably isn't a problematic, just
               | wondering, but on the other hand it shouldn't be so
               | difficult to follow the reasoning based on the context I
               | would think:
               | 
               | poster says - in order to be able to scrape effectively
               | you should appear to be a real human, use different UAs
               | etc.
               | 
               | So as this change happens different UAs become one less
               | thing that you can easily change to seem less suspicious,
               | as a non-frozen UA would then be a suspicious sign after
               | some time.
               | 
               | So a sort of side effect.
        
               | [deleted]
        
               | [deleted]
        
             | vivekv wrote:
             | Sorry can you please elaborate what is Google doing?
        
               | bryanrasmussen wrote:
               | sorry, I thought it was a well known thing here given the
               | various discussions over past year or so
               | https://groups.google.com/a/chromium.org/g/blink-
               | dev/c/-2JIR...
               | 
               | on edit: so I'm thinking as there will only be one UA
               | floating around then, sure, older UAs can exist, but
               | those become progressively more suspicious.
        
             | hdfinenwvdun wrote:
             | You can easily spoof the UA
        
           | juanse wrote:
           | Sorry, what is a UA?
        
             | shmoogy wrote:
             | User Agent
        
             | Veen wrote:
             | User agent (browser or some other web client).
        
         | heavyset_go wrote:
         | > _It would be interesting to explore the strategies to deal
         | with it._
         | 
         | The strategy to deal with it is to behave well when making
         | requests so that you don't get rate limited.
        
         | marginalia_nu wrote:
         | Why not just lower the crawl rate? My search engine crawler
         | visits tens of million documents in a week at a rate of 1
         | doc/second, but a few hundred different domains at the same
         | time.
         | 
         | Going as low as 0.2 dps could easily be doable I think.
        
         | Ozzie_osman wrote:
         | Fundamentally, what most scrapers learn is that the more their
         | scraper can behave like a human browsing the site, the less
         | likely they are to get detected and blocked.
         | 
         | This does put limits on how quickly they can crawl, of course,
         | but scrapers find ways around it like changing ip and user
         | agent (ip is probably the main one, bec you can then pretend
         | that you are multiple humans browsing the site normally).
        
           | prox wrote:
           | Yeah, there are services that give you a range of IPs for a
           | certain time.
        
         | stef25 wrote:
         | Rotate through a bunch of proxies.
        
       | jmnicolas wrote:
       | Last year I needed some quick scraping and I used a headless
       | Chromium to render webpages and print the HTML then analyze it
       | with C#.
       | 
       | I don't remember exactly, but I think it was around 100 or 200
       | loc, so not exactly something that took long to write. In fact
       | the most difficult thing was to figure how to pass the right args
       | to Chromium.
       | 
       | I wonder what does a scraping framework offer?
        
         | Veen wrote:
         | > I wonder what does a scraping framework offer?
         | 
         | HTTP requests, HTML parsing, crawling, data extraction,
         | wrapping complex browser APIs etc. Nothing you couldn't do
         | yourself, but like most frameworks, they abstract the messy
         | details so you can get a scraper working quickly without having
         | to cobble together a bunch of libraries or re-invent the wheel.
        
           | jmnicolas wrote:
           | I see thanks.
        
         | jrochkind1 wrote:
         | Just for one example, when you have to get a form, and then
         | submit the form, with the CSRF protection that was in the
         | form... of course you COULD write that yourself by printing
         | HTML and then analyzing it with C# (which triggers more
         | requests to chromium I guess), but you're probably going to
         | wonder why you are reinventing the wheel when you want to be
         | getting on to the domain-specific stuff.
        
           | jmnicolas wrote:
           | Ah yes I see. Mine was read-only, so no need for complex
           | stuff.
        
         | elorant wrote:
         | Throttling is a prime example. If you start loading multitudes
         | of sites in asynchronous fashion you'll have to enter some
         | delay otherwise you run the risk of choking the server in
         | misconfigured sites. I've DDoSed sites accidentally this way.
         | You can of course build a framework on your own, and that's
         | pretty much what every scraper does eventually, but it takes
         | time and a lot of effort.
        
       | gizdan wrote:
       | Surprised Woob[0] (formerly Weboob) isn't on the list. It's
       | designed for specific tasks, such as getting transactions from
       | your bank, events from different event pages, and much.
       | 
       | [0] https://woob.tech/
        
       | marban wrote:
       | Cloudflare's protection is quite a b*tch to circumvent with any
       | headless or python library.
        
         | heavyset_go wrote:
         | It's a pain even when you aren't a bot. For a while there,
         | Cloudflare's fingerprinting page would trigger Firefox on Linux
         | to crash instantly.
        
         | password4321 wrote:
         | https://news.ycombinator.com/item?id=28514998#28515629
         | 
         | > _Cloudflare 's bot protection mostly makes use of TLS
         | fingerprinting, and thus pretty easy to bypass._
         | 
         | https://news.ycombinator.com/item?id=28251700 ->
         | https://github.com/refraction-networking/utls
         | 
         | Disclaimer: haven't tried it.
        
         | alphabet9000 wrote:
         | with node, i've had success with puppeteer-extra using
         | puppeteer-extra-plugin-stealth
        
       | zlib wrote:
       | What kind of stuff are people needing to scrape?
        
         | mdaniel wrote:
         | I would expect it's roughly the same answers, just varying in
         | the specifics:
         | 
         | * those which don't offer a _reasonable_ API, or (I would guess
         | a larger subset) those which don't expose all the same
         | information over their API
         | 
         | * those things which one wishes to preserve (yes, I'm aware
         | that submitting them to the Internet Archive might achieve that
         | goal)
         | 
         | * and then the subset of projects where it's just a fun
         | challenge or the ubiquitous $other
         | 
         | As an example answer to your question, some sites are even
         | offering bounties for scraped data, so one could scratch a
         | technical itch and help data science at the same time:
         | 
         | https://www.dolthub.com/repositories/pdap/datasets/bounties
        
         | rmetzler wrote:
         | Websites which change over time and don't provide a simpler way
         | of getting an update (e.g. an RSS feed or a JSON api).
        
         | conradfr wrote:
         | I have a side-project where I display the schedule of the day
         | of 100+ French radios, like you would for TV channels.
         | 
         | Scraping works great to get the data.
         | 
         | I don't like node/js but I use it to do the scraping as I view
         | the code as trash and full of edge cases and unreliable data /
         | types and I can't complain, a dynamic scripting language is
         | great for that.
        
         | [deleted]
        
         | ianhawes wrote:
         | Scraping saved untold lives this past spring when large
         | healthcare providers (i.e. Walgreens & CVS) opted to hide their
         | vaccination appointments behind redundant survey questions.
         | This made it more difficult to quickly ascertain when an
         | appointment slot would become available. The elderly were less
         | likely to look more than once a day, delaying vaccines for
         | those that needed it the most.
         | 
         | GoodRX built a scraping system that tapped into all the major
         | providers. Thats what a group of vaccine hunters in my state
         | used to get appointments for folks that had tried but were
         | unable to.
        
         | hyeomans wrote:
         | I scrape multiple government sites to fill all the data for
         | https://www.quienmerepresenta.com.mx/
         | 
         | It tells you who is your governor, local/federal
         | representative, senator and municipal president. Each
         | representative lives on a different website so I wrote
         | scrappers for each one.
        
       ___________________________________________________________________
       (page generated 2021-10-11 23:01 UTC)