hngopher.com

       [HN Gopher] Web Scraping 101 with Python
       ___________________________________________________________________
        
       Web Scraping 101 with Python
        
       Author : daolf
       Score  : 235 points
       Date   : 2021-02-10 15:17 UTC (7 hours ago)
        
 (HTM) web link (www.scrapingbee.com)
 (TXT) w3m dump (www.scrapingbee.com)
        
       | NDizzle wrote:
       | I've been involved in many web scraper jobs over the past 25
       | years or so. The most recent one, which was a long time ago at
       | this point, was using scrapy. I went with XML tools for
       | controlling the DOM.
       | 
       | It's worked unbelievably well. It's been running for roughly 5
       | years at this point. I send a command at a random time between
       | 11pm and 4am to wake up an ec2 instance. It checks its tags to
       | see if it should execute the script. If so, it does so. When it's
       | done with its scraping for the day, it turns itself off.
       | 
       | This is a tiny snapshot of why it's been so difficult for me to
       | go from python2 to python3. I'm strongly in the camp of "if it
       | ain't broke, don't fix it".
        
         | silicon2401 wrote:
         | why can't you just keep using python2? surely some people out
         | there are interested enough to keep updating and maintaining
         | it?
        
           | NDizzle wrote:
           | I certainly can keep using it. There have been so many
           | efforts to get people to update Python 2 code to Python 3
           | code that it's on my backlog to do it. Will I get to it this
           | year? Probably not.
        
         | Topgamer7 wrote:
         | Using `2to3` might get you 80% of the way there. Although cases
         | like this make tests really valuable.
        
       | jC6fhrfHRLM9b3 wrote:
       | This is an ad.
        
       | ohmyblock wrote:
       | Any advantage/disadvantage in using Javascript instead of Python
       | for web scraping?
        
         | ddorian43 wrote:
         | It's just a language. Might be faster. Use what you know best.
        
       | pythonbase wrote:
       | I do web scraping for fun and profit, primarily using Python.
       | Wrote a post some time back about it.
       | 
       | https://www.kashifaziz.me/web-scraping-python-beautifulsoup....
        
       | [deleted]
        
       | kruchone wrote:
       | In my career I found several reasons not to use regular
       | expressions for parsing an HTML response, but the largest was the
       | fact that it may work for 'properly formed' documents, but you
       | would be surprised how lax all browsers are about requiring the
       | document to be well-formed. Your regex, unless particularly
       | handled, will not be able to handle sites like this (and there
       | are a lot, at least from my career experience). And you may be
       | able to work 'edge cases' into your RegEx, but good luck finding
       | anyone but the expression author who fully understands and can
       | confidently change it as time goes on. It is also a PITA to debug
       | when groupings/etc. aren't working (and there will be a LOT of
       | these cases with HTML/XML documents).
       | 
       | It is honestly almost never worth it unless you have constraints
       | on what packages you can use and you MUST use regular
       | expressions. Just do your future-self a favor and use
       | BeautifulSoup or some other package designed to parse the tree-
       | like structure of these documents.
       | 
       | One way it can be used appropriately is just finding a pattern in
       | the document- without caring where it is w.r.t. the rest of the
       | document. But even then, do you really want to match: <!-- <div>
       | --> ?
        
         | btown wrote:
         | For all the things jQuery got wrong, it got one thing right:
         | arguably the most intuitive way to target a set of data in a
         | document is by having a concise DSL that works on a parsed
         | representation of the document.
         | 
         | I'd love to see more innovation/developer-UX research on the
         | interactions between regexes, document parse trees, and NLP.
         | For instance, "match every verb phrase where the verb has
         | similar meaning to 'call' within the context of a specific CSS
         | selector, and be able to capture any data along that path in
         | capturing groups, and do something with it" right now takes
         | significant amounts of coding.
         | 
         | https://spacy.io/usage/rule-based-matching does a lot, but (a)
         | it's not particularly concise, (b) there's not a standardized
         | syntax for e.g. replacement strings once you detect something,
         | and (c) there's no real facilities to bake in a knowledge of
         | hierarchy within a larger markup-language document.
        
         | xapata wrote:
         | > confidently change it
         | 
         | Having a good variety of tests helps.
         | 
         | > tree structure
         | 
         | You'll need a complete language to parse a tree.
        
       | turtlebits wrote:
       | Been scraping for a long time. If handling JS isn't a
       | requirement, XPath is the 100% the way to go. It's a standard
       | query language, very powerful, and there are great browser
       | extensions for helping you write queries.
        
       | VBprogrammer wrote:
       | One tip I would pass on when trying to scrape data from a
       | website, start by using wget in mirror mode to download the
       | useful pages. It's much faster to iterate on scraping the data
       | once you have it locally. Also, less likely to accidentally kill
       | the site or attract the attention of the host.
        
         | strin wrote:
         | That works only for static page though. Many modern pages would
         | require you to run a selenium or puppetteer to scrape the
         | content.
        
           | thaumasiotes wrote:
           | That's never required; the data shows up in the web page
           | because you requested it from somewhere. You can do the same
           | thing in your scraper.
        
             | dewey wrote:
             | > You can do the same thing in your scraper
             | 
             | Rendering the page in Puppeteer / Selenium and then
             | scraping it from there sounds like a lot easier than
             | somehow trying to replicate that in your scraper?
        
               | thaumasiotes wrote:
               | Sure. How does that relate to the claim that your scraper
               | is actually unable to make the same requests your browser
               | does?
        
               | dewey wrote:
               | How are you going to deal with values generated by JS and
               | used to sign requests?
        
               | thaumasiotes wrote:
               | If they're really being generated client-side, you're
               | free to generate them yourself by any means you want. But
               | also, that's a strange thing for the website to do, since
               | it's applying a security feature (signatures) in a way
               | that prevents it from providing any security.
               | 
               | If they're generated server-side like you would expect,
               | and sent to the client, you'd get them the same way you
               | get anything else, by asking for them.
        
               | tester756 wrote:
               | >If they're really being generated client-side, you're
               | free to generate them yourself by any means you want. But
               | also, that's a strange thing for the website to do
               | 
               | what??
               | 
               | Page loads -> Javascript sends request to backend -> it
               | returns data -> javascript does stuff with it and renders
               | it.
        
               | thaumasiotes wrote:
               | Sure, that's the model from several comments up. It
               | doesn't involve signing anything.
        
               | dewey wrote:
               | I'm not sure what's your point. Of course you can
               | replicate every request in your scraper / with curl if
               | you want to if you know all the input variables.
               | 
               | Doing that for web scraping purposes where everything is
               | changing all the time and you have more than one target
               | website is just not feasible if you have to reverse
               | engineer some custom JS for every site. Using some kind
               | of headless browser for modern websites will be way
               | easier and more reliable.
        
               | pocket_cheese wrote:
               | As someone who has done a good bit of scraping, how a
               | website is designed dictates how I scrape.
               | 
               | If it's a static website that has consistently structured
               | HTML and is easy to enumerate through all the webpages
               | I'm looking for, then simple python requests code will
               | work.
               | 
               | The less clear case is when to use a headless browser vs
               | reverse engineering JS/server side APIs. Typically, I
               | will do like a 10 minute dive into the client side js and
               | monitor ajax requests to see if it would be super easy to
               | hit some API that returns JSON to get my data. If reverse
               | engineering seems to hairy, then I will just do headless
               | browser.
               | 
               | I have a really strong preference for hitting JSON apis
               | directly because, well, you get JSON! Also you usually
               | get more data then you even knew existed.
               | 
               | Then again, if I was creating a spider to recursively
               | crawl a non-static website, then I think Headless is the
               | path of least resistance. But usually, I'm trying to get
               | data in the HTML, and not the whole document.
        
           | edmundsauto wrote:
           | For these sites, I crawl using a JS powered engine, and just
           | save the relevant page content to disk.
           | 
           | Then I can craft my regex/selectors/etc., once I have the
           | data stored locally.
           | 
           | This helps if you get caught and shut down - it won't turn
           | off your development effort, and you can create a separate
           | task to proxy requests.
        
         | spsphulse wrote:
         | Definitely! Scrape the disk, not the web.
        
         | johtso wrote:
         | Or just use scrapy's caching functionality. Super convenient.
        
       | fudged71 wrote:
       | I really appreciate the tips in the comments here.
       | 
       | As a beginner it makes a lot of sense to iterate on a local copy
       | with jupyter rather than fetching resources over and over until
       | you get it right. I wish more tutorials focused on this workflow.
        
       | dastx wrote:
       | For data extraction I highly recommend weboob. Despite the
       | unfortunate name, it does some really cool stuff. Writing modules
       | is quite straightforward and the structure they've chosen makes a
       | lot of sense.
       | 
       | I do wish there was a Go version of it, mostly because I much
       | prefer working with Go, but also because single binary is
       | extremely useful.
        
       | ackbar03 wrote:
       | I've always had pretty bad experiences with web scrapping, it's
       | such a pain in the ass and frequently breaks. I'm not sure if I'm
       | doing it wrong or if that's how it's supposed to be.
        
         | edmundsauto wrote:
         | It's heavily dependent on the site you're scraping. If they put
         | in active counter measures, have a complex structure, or update
         | their templates frequently, it's going to be an uphill battle.
         | 
         | Most sites IME are pretty easy.
        
         | hansvm wrote:
         | > pain in the ass
         | 
         | Yes, unequivocally.
         | 
         | > frequently breaks
         | 
         | It can definitely depend on what you're scraping, but in the
         | last few years or so the only project I had trouble with was
         | one where they changed the units for the unpublished API (the
         | real UI made two requests which mattered, one to grab the
         | units, and I missed that in my initial inspection -- it bit me
         | awhile later when they changed the default behavior for both
         | locations).
         | 
         | A few tips:
         | 
         | As much as possible, try to find the original source for the
         | data. E.g., are there any hidden APIs, or is the data maybe
         | just sitting around in a script being used to populate the
         | HTML? Selenium is great when you need it, but in my experience
         | UI details change much more frequently than the raw data.
         | 
         | When choosing data selectors you'll get a feel for those which
         | might not be robust. E.g., the nth item in a list is prone to
         | breakage as minor UI tweaks are made.
         | 
         | If robustness is important, consider selecting the same data
         | multiple ways and validating your assumptions about the page.
         | E.g., you might want the data with a particular ID, combination
         | of classes, preceding title, or which is the only text element
         | formatted like a version number. When all of those methods
         | agree you're much more likely to have found the right thing,
         | and if they don't then you still have options for graceful
         | degradation; use a majority vote to guess at a value, use the
         | last known value, record N/A or some indication that we're not
         | sure right now, etc. Critically though, your monitoring can
         | instantly report that something is amiss so that you can
         | inspect the problem in more detail while the service still
         | operates in a hopefully acceptable degraded state.
        
       | 1vuio0pswjnm7 wrote:
       | One thing I notice with all blog articles, and HN comments, on
       | scraping is that they always omit the actual use case, i.e., the
       | specific website that someone is trying to scrape. Any examples
       | tend to be so trivial as to be practically meaningless. They do
       | not prove anything.
       | 
       | If authors did name websites they wanted to scrape, or show tests
       | on actual websites, then we might see others come forward with
       | different solutions. Some of them might beat the ones being put
       | forward by the pre-packaged software libraries/frameworks and
       | commercial scraping services built on them, e.g., less brittle,
       | faster, less code, easier to repair.
       | 
       | We will never know.
        
       | tyingq wrote:
       | PyPpeteer might be worth a look as well. Basically a port of the
       | JS puppeteer project that drives headless Chrome via the Devtools
       | API.
       | 
       | As mentioned elsewhere, using anything other than headless isn't
       | useful beyond a fairly narrow scope these days.
       | 
       | https://github.com/pyppeteer/pyppeteer
        
         | daolf wrote:
         | I think you'd be surprised by the amount of website you can
         | scrape without an headless browser.
         | 
         | Even Google SERP can be scraped with a simple HTTP client.
        
           | tyingq wrote:
           | That's what I meant by narrow...a known set of sites and data
           | you want to extract.
           | 
           | I imagine, for example, building on the SERP example might
           | hit a wall if you added logged in vs not logged SERPS,
           | iterating over carousel data, reading advertisement data etc.
        
             | daolf wrote:
             | log in wall can easily be bypassed with an HTTP client by
             | setting the correct auth header.
             | 
             | From what I can observe, 2/3 websites can be scraped
             | without using a headless browser.
        
         | JackC wrote:
         | There's an official Python library for Playwright as well:
         | https://github.com/microsoft/playwright-python
        
         | pknerd wrote:
         | I am often contacted by people who ask me to scrape a
         | dynamic/JS rendered websites. You might be surprised to know
         | that many of such dynamic websites are actually depending on
         | some API end-point which is being accessed via some AJAX like
         | functionality which you can access directly and get the
         | required data. I often faced the situation where data was not
         | fetched via some external source was already available either
         | in data-field or some JSON like structure hence no need to use
         | Selenium with headless browser.
        
           | tyingq wrote:
           | Sure. This one happens not be Selenium.
        
       | philshem wrote:
       | Before jumping into frameworks, if your data is lucky enough to
       | be stored in an html table:                   import pandas as pd
       | dfs = pd.read_html(url)
       | 
       | Where 'dfs' is an array of dataframes - one item for each html
       | table on the page.
       | 
       | https://pandas.pydata.org/pandas-docs/stable/reference/api/p...
        
         | andreilys wrote:
         | +1 this has saved me countless of hours
        
         | mikesholiu wrote:
         | what does this do
        
           | SirSourdough wrote:
           | It reads HTML and returns the tables contained in the HTML as
           | pandas dataframes. It's a simple way to scrape tabular data
           | from websites.
        
         | JosephRedfern wrote:
         | Woah. I've used pandas a fair amount and had no idea about
         | this. Thank you!
        
       | FL33TW00D wrote:
       | I recently undertook my first scraping project, and after trying
       | a number of things landed upon Scrapy.
       | 
       | It's been a blessing. Not only can it handle difficult sites, but
       | it's super quick to write another spider for the easy sites that
       | provide the JSON blob in a handy single API call.
       | 
       | Only problem I had was getting around cloudflare, tried a few
       | things like puppeteer but no luck.
        
       | toolslive wrote:
       | fetching html and then parsing it navigating the parsed result
       | (or with regexp) is what used to work 20 years ago. These days,
       | with all these reactive javascript frameworks you better skip to
       | item number 5: headless browsing. Also mind that Facebook,
       | Instagram, ... will have anti-scraping measures in place. It's a
       | race ;)
        
         | lemagedurage wrote:
         | It's not all bad, many modern sites just expose a JSON API that
         | can be used. It really depends on how protective and large the
         | company behind it is.
        
         | A4ET8a8uTh0 wrote:
         | This. Even relatively simple websites are much harder to parse
         | today. I did a minor side project for a customer scraping some
         | info and anti-scraping measures were in full force. It feels
         | like an all out war.
        
           | iagovar wrote:
           | Such as? I've never encounter anything I wasn't able to
           | overcome.
        
             | dewey wrote:
             | Once recaptcha is in the mix it'll get tricky pretty
             | quickly. Everything else is easy to overcome most of the
             | time.
        
             | ttoomm28 wrote:
             | Datadome, Incapsula
        
             | toolslive wrote:
             | please click on all the traffic lights you see below.
        
             | amenod wrote:
             | Recaptcha comes to mind.
             | 
             | That said, there are quite a few services which battle
             | these systems for you nowadays (such as scraperapi - not
             | affiliated, not a user). They are not always successful,
             | but they have an advantage of maaany residential proxies
             | (no doubt totally ethically obtained /s, but that's another
             | story).
        
             | thinkingkong wrote:
             | Usually starts from simple to difficult. User agent stuff,
             | IP address detection, aggressive rate limiting, captcha
             | checking, browser fingerprinting, etc.
        
             | A4ET8a8uTh0 wrote:
             | It can be overcome and, admittedly, I am new to this so for
             | me that means way more time spent trying to make it work.
             | The odd one that I got stuck on for a while was a presented
             | list where individual record held pertinent details, but
             | the list had, seemingly randomly, items that looked like
             | records on the surface, but were not ( so those had to be
             | identified and ignored ). Small things like that.
             | 
             | Still, I would love to learn more about your approach if
             | you would be willing to share.
        
         | almost wrote:
         | I've found that a lot of the time that's not needed. Often
         | you'll find the data as a JSON blob in the page and can just
         | read it directly from there. Or find that there's an API
         | endpoint that the javascript reads.
        
           | monkeybutton wrote:
           | I find this method works best. Skip looking at the page and
           | instead watch all the network requests as the page loads.
        
             | WrtCdEvrydy wrote:
             | ZAP HUD Proxy is the best option here...
             | 
             | Load the page on it and it shows you all the request being
             | made along with the payload as it happens.
             | 
             | Find the one you need, copy the data, endpoint and HTTP
             | verb and recreate it in your language of choice :D
        
       | pknerd wrote:
       | I have been developing scrapers and crawlers and writing[1] about
       | them for many years and used many Python based libs so far
       | including Selenium. I have write such scrapers for individuals
       | and startups for several purposes. The biggest issue I faced was
       | rendering of dynamic sites and blocking of IPs due to absence of
       | proxies which are not cheap at all, especially for individuals.
       | 
       | Services like Scrapingbee and ScraperAPI are serving quite good
       | for such problems. I personally liked ScraperAPI for rendering
       | dynamic websites due to the better response time.
       | 
       | Shameless Plug: In case if anyone is interested, long time back,
       | I had written about it on my blog which you can read here[2]. Now
       | you do not need to setup remote Chrome instance or anything. What
       | all is required is to hit an API endpoint to fetch content from a
       | dyanmic JS rendered websites.
       | 
       | [1] http://blog.adnansiddiqi.me/tag/scraping/
       | 
       | [2] http://blog.adnansiddiqi.me/scraping-dynamic-websites-
       | using-...
        
       | mikece wrote:
       | Aside from the Beautiful Soup library, is there something about
       | Python that makes it a better choice for web scraping than
       | languages such as Java, JavaScript, Go, Perl or even C#?
        
         | lemagedurage wrote:
         | I think Python makes sense, at least for the prototyping phase.
         | There's a lot of trial and error involved, and Python is quick
         | to write.
        
         | freedomben wrote:
         | I find javascript (node) to be best suited to web scraping
         | personally. Using the same language to scrape/process as you
         | use to develop those interfaces seems most natural.
        
           | isbvhodnvemrwvn wrote:
           | Especially with stuff like Puppeteer which allows you to
           | execute JS in context of the browser (which admittedly can
           | lead to weird bugs as the functions are serialized and lose
           | context)
        
         | monkeybutton wrote:
         | I like python for the ease of use and scraping is I/O bound
         | anyways so there's no pressure to switch to a more performant
         | language.
        
           | RhodesianHunter wrote:
           | I'd say that really depends on your scale and what you're
           | doing with the content you scrape.
           | 
           | In my experience with large scale scraping you're much better
           | off using something like Java where you can more easily have
           | a thread pool with thousands of threads (or better yet,
           | Kotlin coroutines) handling the crawling itself and a *NUM
           | CORES thread pool handling CPU bound tasks like parsing.
        
             | monkeybutton wrote:
             | Could you give a ballpark figure for what you mean by large
             | scale scraping? I've only worked on a couple projects, one
             | was a broad (100K to 500K domains) and shallow (root + 1
             | level of page depth, also with a low cap on the number of
             | children pages). The other just a single domain but
             | scraping around 50K pages from it.
        
               | RhodesianHunter wrote:
               | My experience was with e-commerce scraping. Not many
               | domains, but a massive catalogue.
        
               | tluyben2 wrote:
               | I would say millions of domains regularly. That's where
               | the pricing of most 'scraping services' falls down too
               | compared to just doing it yourself.
        
         | jbergstroem wrote:
         | The Scrapy library written in Python (https://scrapy.org) is
         | excellent for writing and deploying scrapers.
        
         | diarrhea wrote:
         | Don't know about large scales, but just today I threw together
         | a script using selenium, imported pandas to mangle the scraped
         | data and quickly exported to json. For quick and dirty,
         | possibly one-off jobs like that, Python is a great choice.
        
       | max_ wrote:
       | How does one do scraping properly on dynamic client side rendered
       | pages?
        
         | Topgamer7 wrote:
         | Most SPA pages will honour direct uri requests and route you
         | properly in javascript. You just need to have your scraping
         | pipeline use phantomJS or selenium to wait until the page stops
         | loading then scrape the html.
         | 
         | Although it might just be easier to scrape their api endpoints
         | directly instead of mucking with html if its a dynamic page.
         | The data is structured that way, and easier to query.
        
         | jmt_ wrote:
         | Usually the approach is to use a headless browser. The headless
         | browser instance runs purely in memory without a GUI then
         | renders the website you're interested in. Then, it comes down
         | to regular DOM parsing. A common library that I enjoy is
         | Selenium with Python.
        
       | spsphulse wrote:
       | Is there a SOTA library for common web scraping issues at scale(
       | especially distributed over cluster of nodes) for Captcha
       | detection, IP rotation, Rate throttling, Queue Management etc.?
        
         | ddorian43 wrote:
         | What's a "SOTA library" ?
        
           | banana_giraffe wrote:
           | A contextual guess: "'State of the art' library"
           | 
           | In other words: Is there a drop in library to solve all the
           | big common issues people run into scraping websites in the
           | wild?
           | 
           | At least, that's how I read it.
        
             | ddorian43 wrote:
             | There is no "state of the art library" to build your own
             | google. But "Rate throttling/limiting" can be done with
             | Redis, rotating ip is still rate-limiting with Redis,
             | Captcha Detection - You have to pay $$ I think.
        
       | Tistel wrote:
       | It's fun to combine jupyter notebooks and py scraping. If you are
       | working 15 pages/screens deep, you can "stay at the coal face"
       | and not have to rerun the whole script after making a change to
       | the latest step.
        
         | psychomugs wrote:
         | I write scrapers for fun and notebooks for work but never
         | thought to combine the two. Great idea!
        
         | CapriciousCptl wrote:
         | Oh! That's a good idea. My goto has always been pipelining
         | along a series of functions, but never thought of just using
         | Jupyter for some reason.
        
         | diarrhea wrote:
         | `ipython -i <script>` also works similarly for debugging, by
         | having the powerful interpreter open after running the script,
         | without the jupyter overhead.
        
         | whoisburbansky wrote:
         | I love the imagery of this being "at the coal face" thanks for
         | that
        
       | cameroncairns wrote:
       | I think this article does an OK job covering how to scrape
       | websites rendered serverside, but I strongly discourage people
       | from scraping SPAs using a headless browser unless they
       | absolutely have to. The article's author touches on this briefly,
       | but you're far better off using the network tab in your browser's
       | debug tools to see what AJAX requests are being made and figuring
       | out how those APIs work. This approach results in far less server
       | load for the target website as you don't need to request a bunch
       | of other resources, reduces the overall bandwidth costs, and
       | greatly speeds up the runtime of your script since you don't need
       | to spend time running javascript in the headless browser. That
       | can be especially slow if your script has to click/interact with
       | elements on the page to get the results you need.
       | 
       | Other than that, I'd strongly caution anyone looking into making
       | parallel requests. Always keep in mind the sysadmin and engineers
       | behind the site you are targeting. It's can be tempting to value
       | your own time by making a ton of parallel requests to reduce the
       | overall time of your script, but you can potentially cause
       | massive server load for the site you're targeting. If that isn't
       | enough motivation to cause you pause, keep in mind that the site
       | owner is more likely to make the site hostile to scrapers if
       | there are too many bad actors hitting the site heavily.
        
         | jamra wrote:
         | How would you deal with authentication?
        
           | cameroncairns wrote:
           | I don't! As far as I know, scraping data behind a login is
           | illegal in the united states. You can look into the supreme
           | court case Facebook v Powers Inc for information behind that.
           | This page https://www.rcfp.org/scraping-not-violation-cfaa/
           | seems to have a decent overview of scraping laws in general.
           | It's definitely a legal gray area so I'd suggest doing your
           | research! This doesn't constitute legal advice and all that,
           | I'm not a lawyer just a guy who does some scraping here and
           | there :)
        
           | ddorian43 wrote:
           | There was (is?) a DARPA project called "Memex" that was built
           | to crawl the hidden web that has many tools like crawling
           | with authentication, automatic registration, machine-learning
           | to detect search-forms, auto detecting pagination etc etc etc
           | etc https://github.com/darpa-i2o/memex-program-index
        
       | tluyben2 wrote:
       | I wanted to do some larger distributed scraping jobs recently and
       | although it was easy to get everything running on one machine
       | (with different tools including Scrapy), I was surprised how hard
       | it was to do at scale. The open source ones I could find was
       | hard/impossible to get working, overly complex, badly documented
       | etc.
       | 
       | The services I found to be reasonably priced for small jobs, but
       | at scale they quickly become vastly more expensive than setting
       | this up yourself. Especially when you need to run these jobs
       | every month or so. Even if you have to write some code to make
       | the open source solutions actually work.
        
         | nr2x wrote:
         | It gets more complicated when you need to leverage real browser
         | engines (eg Chrome). I've got jobs spread across ~ 20 machines/
         | 140 concurrent browser instances, it's non-trivial.
        
           | ddorian43 wrote:
           | How many pages can you render in a a second per vcpu core ?
        
         | turtlebits wrote:
         | AWS Lambdas are an easy way to get scheduled scraping jobs
         | running.
         | 
         | I use their Python-based chalice framework
         | (https://github.com/aws/chalice) which allows you to add a
         | decorator to a method for a schedule,
         | @app.schedule(Rate(30, unit=Rate.MINUTES))
         | 
         | It's also a breeze to deploy.                 chalice deploy
        
       | cubano wrote:
       | My last contract job was to build a 100% perfect website
       | mirroring program for a group of lawyers who were interested in
       | building class action lawsuits against some of the more henious
       | scammers out there.
       | 
       | I ended up building like 8 versions of it, literally using every
       | PHP and Python library and resource I could find.
       | 
       | I tried httrack, php-ultimate-web-scraper (from github), headless
       | chromium. headless selenium, and a few others
       | 
       | By far the biggest problem was dealing with JS links...you
       | wouldn't think from the start it would be such a big deal but
       | yet..it was.
       | 
       | Selenium with python turned out to be the winning combination,
       | and of course, it was the last one I tried. Also, this is an
       | ideal project to implement recursion altho you have to be careful
       | about exit conditions.
       | 
       | One thing that was VERY important for performance was not
       | visiting any page more then once because, obviously, certain
       | links in headers and footers are duped sometimes 100s of times.
       | 
       | JS links often made it very difficult to discover the linked
       | page, are certain library calls that were supposed to get this
       | info for you often didn't work.
       | 
       | It was a super fun project, and in the end considering I only
       | worked for 2 months, I shipped some decent code that was getting
       | like 98.6% of the pages perfectly.
       | 
       | The final presentation was interesting...for some reason my
       | client I think got in his head that I wasn't very good programmer
       | or something, and as we ran thru his list of sample sites
       | expecting my program to error out or incorrectly mirror the site,
       | but it handled all 10 of the sites about perfectly and he was
       | rather flabbergasted because he told me it would have taken him a
       | week hand clicking the site for the mirror but instead the
       | program did them all in under an hour.
        
         | lysecret wrote:
         | What stack did you end up using ?
        
           | sdfsrrte543 wrote:
           | >Selenium with python turned out to be the winning
           | combination, and of course, it was the last one I tried.
        
         | banana_giraffe wrote:
         | I had to solve nearly the exact same problem for the same
         | reasons. I too ended up with Selenium.
         | 
         | My favorite part was having a nice working system, then
         | throwing it in the cloud and finding out a socking number of
         | sites tell you to go away if you come at them from a cloud-
         | based IP.
         | 
         | Shouldn't be surprising, but it was still annoying.
        
           | Terretta wrote:
           | There are a number of so-called "residential VPN" services
           | with clients that also serve as the firm's p2p VPN / proxy
           | edge. Some can be subscribed to commercially to resolve
           | precisely the above issue.
           | 
           | Preferably, only give money to one that tells their users
           | this is how it works.
        
         | gazelle21 wrote:
         | Ha! I am currently building something very similar at work, and
         | JS links are driving me up a wall. Its interesting because you
         | would think its super simple, but I still haven't found a good
         | solution. Luckily my boss is understanding.
        
       | js8 wrote:
       | Does anyone know how could I script Save Page WE extension in
       | Firefox? It does a really nice job of saving the page as it
       | looks, including dynamic content.
        
       | inovica wrote:
       | We created a fun side project to grab the index page of every
       | domain - we downloaded a list of approx 200m domains. However, we
       | ran into problems when our provider complained. It was something
       | to do with the DNS side of things and we were told to run our own
       | DNS server. If there is anyone on here with experience of
       | crawling across this number of domain names it would be great to
       | talk!
        
       | bityard wrote:
       | Is web scraping going to continue to be a viable thing, now that
       | the web is mainly an app delivery platform rather than a content
       | delivery platform?
       | 
       | Can you scrape a webasm site?
        
         | tluyben2 wrote:
         | Maybe it will turn out to be that way, but this is far from
         | reality at the moment. There are not many sites that cannot be
         | scraped statically and there definitely are very few sites/apps
         | that are webasm.
         | 
         | It'll change, but who knows how much. At least currently, most
         | scraping professionals are not even using headless browsers as
         | their targets are statically rendered.
        
         | jjice wrote:
         | I'm not 100% sure what you mean by a 'webasm' site (web
         | assembly powered?), but the article describes scarping via
         | headless browsers which actually render the page and allow you
         | to select elements that are client rendered.
        
       ___________________________________________________________________
       (page generated 2021-02-10 23:01 UTC)