[HN Gopher] Web Scraping 101 with Python
___________________________________________________________________
Web Scraping 101 with Python
Author : daolf
Score : 235 points
Date : 2021-02-10 15:17 UTC (7 hours ago)
(HTM) web link (www.scrapingbee.com)
(TXT) w3m dump (www.scrapingbee.com)
| NDizzle wrote:
| I've been involved in many web scraper jobs over the past 25
| years or so. The most recent one, which was a long time ago at
| this point, was using scrapy. I went with XML tools for
| controlling the DOM.
|
| It's worked unbelievably well. It's been running for roughly 5
| years at this point. I send a command at a random time between
| 11pm and 4am to wake up an ec2 instance. It checks its tags to
| see if it should execute the script. If so, it does so. When it's
| done with its scraping for the day, it turns itself off.
|
| This is a tiny snapshot of why it's been so difficult for me to
| go from python2 to python3. I'm strongly in the camp of "if it
| ain't broke, don't fix it".
| silicon2401 wrote:
| why can't you just keep using python2? surely some people out
| there are interested enough to keep updating and maintaining
| it?
| NDizzle wrote:
| I certainly can keep using it. There have been so many
| efforts to get people to update Python 2 code to Python 3
| code that it's on my backlog to do it. Will I get to it this
| year? Probably not.
| Topgamer7 wrote:
| Using `2to3` might get you 80% of the way there. Although cases
| like this make tests really valuable.
| jC6fhrfHRLM9b3 wrote:
| This is an ad.
| ohmyblock wrote:
| Any advantage/disadvantage in using Javascript instead of Python
| for web scraping?
| ddorian43 wrote:
| It's just a language. Might be faster. Use what you know best.
| pythonbase wrote:
| I do web scraping for fun and profit, primarily using Python.
| Wrote a post some time back about it.
|
| https://www.kashifaziz.me/web-scraping-python-beautifulsoup....
| [deleted]
| kruchone wrote:
| In my career I found several reasons not to use regular
| expressions for parsing an HTML response, but the largest was the
| fact that it may work for 'properly formed' documents, but you
| would be surprised how lax all browsers are about requiring the
| document to be well-formed. Your regex, unless particularly
| handled, will not be able to handle sites like this (and there
| are a lot, at least from my career experience). And you may be
| able to work 'edge cases' into your RegEx, but good luck finding
| anyone but the expression author who fully understands and can
| confidently change it as time goes on. It is also a PITA to debug
| when groupings/etc. aren't working (and there will be a LOT of
| these cases with HTML/XML documents).
|
| It is honestly almost never worth it unless you have constraints
| on what packages you can use and you MUST use regular
| expressions. Just do your future-self a favor and use
| BeautifulSoup or some other package designed to parse the tree-
| like structure of these documents.
|
| One way it can be used appropriately is just finding a pattern in
| the document- without caring where it is w.r.t. the rest of the
| document. But even then, do you really want to match: <!-- <div>
| --> ?
| btown wrote:
| For all the things jQuery got wrong, it got one thing right:
| arguably the most intuitive way to target a set of data in a
| document is by having a concise DSL that works on a parsed
| representation of the document.
|
| I'd love to see more innovation/developer-UX research on the
| interactions between regexes, document parse trees, and NLP.
| For instance, "match every verb phrase where the verb has
| similar meaning to 'call' within the context of a specific CSS
| selector, and be able to capture any data along that path in
| capturing groups, and do something with it" right now takes
| significant amounts of coding.
|
| https://spacy.io/usage/rule-based-matching does a lot, but (a)
| it's not particularly concise, (b) there's not a standardized
| syntax for e.g. replacement strings once you detect something,
| and (c) there's no real facilities to bake in a knowledge of
| hierarchy within a larger markup-language document.
| xapata wrote:
| > confidently change it
|
| Having a good variety of tests helps.
|
| > tree structure
|
| You'll need a complete language to parse a tree.
| turtlebits wrote:
| Been scraping for a long time. If handling JS isn't a
| requirement, XPath is the 100% the way to go. It's a standard
| query language, very powerful, and there are great browser
| extensions for helping you write queries.
| VBprogrammer wrote:
| One tip I would pass on when trying to scrape data from a
| website, start by using wget in mirror mode to download the
| useful pages. It's much faster to iterate on scraping the data
| once you have it locally. Also, less likely to accidentally kill
| the site or attract the attention of the host.
| strin wrote:
| That works only for static page though. Many modern pages would
| require you to run a selenium or puppetteer to scrape the
| content.
| thaumasiotes wrote:
| That's never required; the data shows up in the web page
| because you requested it from somewhere. You can do the same
| thing in your scraper.
| dewey wrote:
| > You can do the same thing in your scraper
|
| Rendering the page in Puppeteer / Selenium and then
| scraping it from there sounds like a lot easier than
| somehow trying to replicate that in your scraper?
| thaumasiotes wrote:
| Sure. How does that relate to the claim that your scraper
| is actually unable to make the same requests your browser
| does?
| dewey wrote:
| How are you going to deal with values generated by JS and
| used to sign requests?
| thaumasiotes wrote:
| If they're really being generated client-side, you're
| free to generate them yourself by any means you want. But
| also, that's a strange thing for the website to do, since
| it's applying a security feature (signatures) in a way
| that prevents it from providing any security.
|
| If they're generated server-side like you would expect,
| and sent to the client, you'd get them the same way you
| get anything else, by asking for them.
| tester756 wrote:
| >If they're really being generated client-side, you're
| free to generate them yourself by any means you want. But
| also, that's a strange thing for the website to do
|
| what??
|
| Page loads -> Javascript sends request to backend -> it
| returns data -> javascript does stuff with it and renders
| it.
| thaumasiotes wrote:
| Sure, that's the model from several comments up. It
| doesn't involve signing anything.
| dewey wrote:
| I'm not sure what's your point. Of course you can
| replicate every request in your scraper / with curl if
| you want to if you know all the input variables.
|
| Doing that for web scraping purposes where everything is
| changing all the time and you have more than one target
| website is just not feasible if you have to reverse
| engineer some custom JS for every site. Using some kind
| of headless browser for modern websites will be way
| easier and more reliable.
| pocket_cheese wrote:
| As someone who has done a good bit of scraping, how a
| website is designed dictates how I scrape.
|
| If it's a static website that has consistently structured
| HTML and is easy to enumerate through all the webpages
| I'm looking for, then simple python requests code will
| work.
|
| The less clear case is when to use a headless browser vs
| reverse engineering JS/server side APIs. Typically, I
| will do like a 10 minute dive into the client side js and
| monitor ajax requests to see if it would be super easy to
| hit some API that returns JSON to get my data. If reverse
| engineering seems to hairy, then I will just do headless
| browser.
|
| I have a really strong preference for hitting JSON apis
| directly because, well, you get JSON! Also you usually
| get more data then you even knew existed.
|
| Then again, if I was creating a spider to recursively
| crawl a non-static website, then I think Headless is the
| path of least resistance. But usually, I'm trying to get
| data in the HTML, and not the whole document.
| edmundsauto wrote:
| For these sites, I crawl using a JS powered engine, and just
| save the relevant page content to disk.
|
| Then I can craft my regex/selectors/etc., once I have the
| data stored locally.
|
| This helps if you get caught and shut down - it won't turn
| off your development effort, and you can create a separate
| task to proxy requests.
| spsphulse wrote:
| Definitely! Scrape the disk, not the web.
| johtso wrote:
| Or just use scrapy's caching functionality. Super convenient.
| fudged71 wrote:
| I really appreciate the tips in the comments here.
|
| As a beginner it makes a lot of sense to iterate on a local copy
| with jupyter rather than fetching resources over and over until
| you get it right. I wish more tutorials focused on this workflow.
| dastx wrote:
| For data extraction I highly recommend weboob. Despite the
| unfortunate name, it does some really cool stuff. Writing modules
| is quite straightforward and the structure they've chosen makes a
| lot of sense.
|
| I do wish there was a Go version of it, mostly because I much
| prefer working with Go, but also because single binary is
| extremely useful.
| ackbar03 wrote:
| I've always had pretty bad experiences with web scrapping, it's
| such a pain in the ass and frequently breaks. I'm not sure if I'm
| doing it wrong or if that's how it's supposed to be.
| edmundsauto wrote:
| It's heavily dependent on the site you're scraping. If they put
| in active counter measures, have a complex structure, or update
| their templates frequently, it's going to be an uphill battle.
|
| Most sites IME are pretty easy.
| hansvm wrote:
| > pain in the ass
|
| Yes, unequivocally.
|
| > frequently breaks
|
| It can definitely depend on what you're scraping, but in the
| last few years or so the only project I had trouble with was
| one where they changed the units for the unpublished API (the
| real UI made two requests which mattered, one to grab the
| units, and I missed that in my initial inspection -- it bit me
| awhile later when they changed the default behavior for both
| locations).
|
| A few tips:
|
| As much as possible, try to find the original source for the
| data. E.g., are there any hidden APIs, or is the data maybe
| just sitting around in a script being used to populate the
| HTML? Selenium is great when you need it, but in my experience
| UI details change much more frequently than the raw data.
|
| When choosing data selectors you'll get a feel for those which
| might not be robust. E.g., the nth item in a list is prone to
| breakage as minor UI tweaks are made.
|
| If robustness is important, consider selecting the same data
| multiple ways and validating your assumptions about the page.
| E.g., you might want the data with a particular ID, combination
| of classes, preceding title, or which is the only text element
| formatted like a version number. When all of those methods
| agree you're much more likely to have found the right thing,
| and if they don't then you still have options for graceful
| degradation; use a majority vote to guess at a value, use the
| last known value, record N/A or some indication that we're not
| sure right now, etc. Critically though, your monitoring can
| instantly report that something is amiss so that you can
| inspect the problem in more detail while the service still
| operates in a hopefully acceptable degraded state.
| 1vuio0pswjnm7 wrote:
| One thing I notice with all blog articles, and HN comments, on
| scraping is that they always omit the actual use case, i.e., the
| specific website that someone is trying to scrape. Any examples
| tend to be so trivial as to be practically meaningless. They do
| not prove anything.
|
| If authors did name websites they wanted to scrape, or show tests
| on actual websites, then we might see others come forward with
| different solutions. Some of them might beat the ones being put
| forward by the pre-packaged software libraries/frameworks and
| commercial scraping services built on them, e.g., less brittle,
| faster, less code, easier to repair.
|
| We will never know.
| tyingq wrote:
| PyPpeteer might be worth a look as well. Basically a port of the
| JS puppeteer project that drives headless Chrome via the Devtools
| API.
|
| As mentioned elsewhere, using anything other than headless isn't
| useful beyond a fairly narrow scope these days.
|
| https://github.com/pyppeteer/pyppeteer
| daolf wrote:
| I think you'd be surprised by the amount of website you can
| scrape without an headless browser.
|
| Even Google SERP can be scraped with a simple HTTP client.
| tyingq wrote:
| That's what I meant by narrow...a known set of sites and data
| you want to extract.
|
| I imagine, for example, building on the SERP example might
| hit a wall if you added logged in vs not logged SERPS,
| iterating over carousel data, reading advertisement data etc.
| daolf wrote:
| log in wall can easily be bypassed with an HTTP client by
| setting the correct auth header.
|
| From what I can observe, 2/3 websites can be scraped
| without using a headless browser.
| JackC wrote:
| There's an official Python library for Playwright as well:
| https://github.com/microsoft/playwright-python
| pknerd wrote:
| I am often contacted by people who ask me to scrape a
| dynamic/JS rendered websites. You might be surprised to know
| that many of such dynamic websites are actually depending on
| some API end-point which is being accessed via some AJAX like
| functionality which you can access directly and get the
| required data. I often faced the situation where data was not
| fetched via some external source was already available either
| in data-field or some JSON like structure hence no need to use
| Selenium with headless browser.
| tyingq wrote:
| Sure. This one happens not be Selenium.
| philshem wrote:
| Before jumping into frameworks, if your data is lucky enough to
| be stored in an html table: import pandas as pd
| dfs = pd.read_html(url)
|
| Where 'dfs' is an array of dataframes - one item for each html
| table on the page.
|
| https://pandas.pydata.org/pandas-docs/stable/reference/api/p...
| andreilys wrote:
| +1 this has saved me countless of hours
| mikesholiu wrote:
| what does this do
| SirSourdough wrote:
| It reads HTML and returns the tables contained in the HTML as
| pandas dataframes. It's a simple way to scrape tabular data
| from websites.
| JosephRedfern wrote:
| Woah. I've used pandas a fair amount and had no idea about
| this. Thank you!
| FL33TW00D wrote:
| I recently undertook my first scraping project, and after trying
| a number of things landed upon Scrapy.
|
| It's been a blessing. Not only can it handle difficult sites, but
| it's super quick to write another spider for the easy sites that
| provide the JSON blob in a handy single API call.
|
| Only problem I had was getting around cloudflare, tried a few
| things like puppeteer but no luck.
| toolslive wrote:
| fetching html and then parsing it navigating the parsed result
| (or with regexp) is what used to work 20 years ago. These days,
| with all these reactive javascript frameworks you better skip to
| item number 5: headless browsing. Also mind that Facebook,
| Instagram, ... will have anti-scraping measures in place. It's a
| race ;)
| lemagedurage wrote:
| It's not all bad, many modern sites just expose a JSON API that
| can be used. It really depends on how protective and large the
| company behind it is.
| A4ET8a8uTh0 wrote:
| This. Even relatively simple websites are much harder to parse
| today. I did a minor side project for a customer scraping some
| info and anti-scraping measures were in full force. It feels
| like an all out war.
| iagovar wrote:
| Such as? I've never encounter anything I wasn't able to
| overcome.
| dewey wrote:
| Once recaptcha is in the mix it'll get tricky pretty
| quickly. Everything else is easy to overcome most of the
| time.
| ttoomm28 wrote:
| Datadome, Incapsula
| toolslive wrote:
| please click on all the traffic lights you see below.
| amenod wrote:
| Recaptcha comes to mind.
|
| That said, there are quite a few services which battle
| these systems for you nowadays (such as scraperapi - not
| affiliated, not a user). They are not always successful,
| but they have an advantage of maaany residential proxies
| (no doubt totally ethically obtained /s, but that's another
| story).
| thinkingkong wrote:
| Usually starts from simple to difficult. User agent stuff,
| IP address detection, aggressive rate limiting, captcha
| checking, browser fingerprinting, etc.
| A4ET8a8uTh0 wrote:
| It can be overcome and, admittedly, I am new to this so for
| me that means way more time spent trying to make it work.
| The odd one that I got stuck on for a while was a presented
| list where individual record held pertinent details, but
| the list had, seemingly randomly, items that looked like
| records on the surface, but were not ( so those had to be
| identified and ignored ). Small things like that.
|
| Still, I would love to learn more about your approach if
| you would be willing to share.
| almost wrote:
| I've found that a lot of the time that's not needed. Often
| you'll find the data as a JSON blob in the page and can just
| read it directly from there. Or find that there's an API
| endpoint that the javascript reads.
| monkeybutton wrote:
| I find this method works best. Skip looking at the page and
| instead watch all the network requests as the page loads.
| WrtCdEvrydy wrote:
| ZAP HUD Proxy is the best option here...
|
| Load the page on it and it shows you all the request being
| made along with the payload as it happens.
|
| Find the one you need, copy the data, endpoint and HTTP
| verb and recreate it in your language of choice :D
| pknerd wrote:
| I have been developing scrapers and crawlers and writing[1] about
| them for many years and used many Python based libs so far
| including Selenium. I have write such scrapers for individuals
| and startups for several purposes. The biggest issue I faced was
| rendering of dynamic sites and blocking of IPs due to absence of
| proxies which are not cheap at all, especially for individuals.
|
| Services like Scrapingbee and ScraperAPI are serving quite good
| for such problems. I personally liked ScraperAPI for rendering
| dynamic websites due to the better response time.
|
| Shameless Plug: In case if anyone is interested, long time back,
| I had written about it on my blog which you can read here[2]. Now
| you do not need to setup remote Chrome instance or anything. What
| all is required is to hit an API endpoint to fetch content from a
| dyanmic JS rendered websites.
|
| [1] http://blog.adnansiddiqi.me/tag/scraping/
|
| [2] http://blog.adnansiddiqi.me/scraping-dynamic-websites-
| using-...
| mikece wrote:
| Aside from the Beautiful Soup library, is there something about
| Python that makes it a better choice for web scraping than
| languages such as Java, JavaScript, Go, Perl or even C#?
| lemagedurage wrote:
| I think Python makes sense, at least for the prototyping phase.
| There's a lot of trial and error involved, and Python is quick
| to write.
| freedomben wrote:
| I find javascript (node) to be best suited to web scraping
| personally. Using the same language to scrape/process as you
| use to develop those interfaces seems most natural.
| isbvhodnvemrwvn wrote:
| Especially with stuff like Puppeteer which allows you to
| execute JS in context of the browser (which admittedly can
| lead to weird bugs as the functions are serialized and lose
| context)
| monkeybutton wrote:
| I like python for the ease of use and scraping is I/O bound
| anyways so there's no pressure to switch to a more performant
| language.
| RhodesianHunter wrote:
| I'd say that really depends on your scale and what you're
| doing with the content you scrape.
|
| In my experience with large scale scraping you're much better
| off using something like Java where you can more easily have
| a thread pool with thousands of threads (or better yet,
| Kotlin coroutines) handling the crawling itself and a *NUM
| CORES thread pool handling CPU bound tasks like parsing.
| monkeybutton wrote:
| Could you give a ballpark figure for what you mean by large
| scale scraping? I've only worked on a couple projects, one
| was a broad (100K to 500K domains) and shallow (root + 1
| level of page depth, also with a low cap on the number of
| children pages). The other just a single domain but
| scraping around 50K pages from it.
| RhodesianHunter wrote:
| My experience was with e-commerce scraping. Not many
| domains, but a massive catalogue.
| tluyben2 wrote:
| I would say millions of domains regularly. That's where
| the pricing of most 'scraping services' falls down too
| compared to just doing it yourself.
| jbergstroem wrote:
| The Scrapy library written in Python (https://scrapy.org) is
| excellent for writing and deploying scrapers.
| diarrhea wrote:
| Don't know about large scales, but just today I threw together
| a script using selenium, imported pandas to mangle the scraped
| data and quickly exported to json. For quick and dirty,
| possibly one-off jobs like that, Python is a great choice.
| max_ wrote:
| How does one do scraping properly on dynamic client side rendered
| pages?
| Topgamer7 wrote:
| Most SPA pages will honour direct uri requests and route you
| properly in javascript. You just need to have your scraping
| pipeline use phantomJS or selenium to wait until the page stops
| loading then scrape the html.
|
| Although it might just be easier to scrape their api endpoints
| directly instead of mucking with html if its a dynamic page.
| The data is structured that way, and easier to query.
| jmt_ wrote:
| Usually the approach is to use a headless browser. The headless
| browser instance runs purely in memory without a GUI then
| renders the website you're interested in. Then, it comes down
| to regular DOM parsing. A common library that I enjoy is
| Selenium with Python.
| spsphulse wrote:
| Is there a SOTA library for common web scraping issues at scale(
| especially distributed over cluster of nodes) for Captcha
| detection, IP rotation, Rate throttling, Queue Management etc.?
| ddorian43 wrote:
| What's a "SOTA library" ?
| banana_giraffe wrote:
| A contextual guess: "'State of the art' library"
|
| In other words: Is there a drop in library to solve all the
| big common issues people run into scraping websites in the
| wild?
|
| At least, that's how I read it.
| ddorian43 wrote:
| There is no "state of the art library" to build your own
| google. But "Rate throttling/limiting" can be done with
| Redis, rotating ip is still rate-limiting with Redis,
| Captcha Detection - You have to pay $$ I think.
| Tistel wrote:
| It's fun to combine jupyter notebooks and py scraping. If you are
| working 15 pages/screens deep, you can "stay at the coal face"
| and not have to rerun the whole script after making a change to
| the latest step.
| psychomugs wrote:
| I write scrapers for fun and notebooks for work but never
| thought to combine the two. Great idea!
| CapriciousCptl wrote:
| Oh! That's a good idea. My goto has always been pipelining
| along a series of functions, but never thought of just using
| Jupyter for some reason.
| diarrhea wrote:
| `ipython -i <script>` also works similarly for debugging, by
| having the powerful interpreter open after running the script,
| without the jupyter overhead.
| whoisburbansky wrote:
| I love the imagery of this being "at the coal face" thanks for
| that
| cameroncairns wrote:
| I think this article does an OK job covering how to scrape
| websites rendered serverside, but I strongly discourage people
| from scraping SPAs using a headless browser unless they
| absolutely have to. The article's author touches on this briefly,
| but you're far better off using the network tab in your browser's
| debug tools to see what AJAX requests are being made and figuring
| out how those APIs work. This approach results in far less server
| load for the target website as you don't need to request a bunch
| of other resources, reduces the overall bandwidth costs, and
| greatly speeds up the runtime of your script since you don't need
| to spend time running javascript in the headless browser. That
| can be especially slow if your script has to click/interact with
| elements on the page to get the results you need.
|
| Other than that, I'd strongly caution anyone looking into making
| parallel requests. Always keep in mind the sysadmin and engineers
| behind the site you are targeting. It's can be tempting to value
| your own time by making a ton of parallel requests to reduce the
| overall time of your script, but you can potentially cause
| massive server load for the site you're targeting. If that isn't
| enough motivation to cause you pause, keep in mind that the site
| owner is more likely to make the site hostile to scrapers if
| there are too many bad actors hitting the site heavily.
| jamra wrote:
| How would you deal with authentication?
| cameroncairns wrote:
| I don't! As far as I know, scraping data behind a login is
| illegal in the united states. You can look into the supreme
| court case Facebook v Powers Inc for information behind that.
| This page https://www.rcfp.org/scraping-not-violation-cfaa/
| seems to have a decent overview of scraping laws in general.
| It's definitely a legal gray area so I'd suggest doing your
| research! This doesn't constitute legal advice and all that,
| I'm not a lawyer just a guy who does some scraping here and
| there :)
| ddorian43 wrote:
| There was (is?) a DARPA project called "Memex" that was built
| to crawl the hidden web that has many tools like crawling
| with authentication, automatic registration, machine-learning
| to detect search-forms, auto detecting pagination etc etc etc
| etc https://github.com/darpa-i2o/memex-program-index
| tluyben2 wrote:
| I wanted to do some larger distributed scraping jobs recently and
| although it was easy to get everything running on one machine
| (with different tools including Scrapy), I was surprised how hard
| it was to do at scale. The open source ones I could find was
| hard/impossible to get working, overly complex, badly documented
| etc.
|
| The services I found to be reasonably priced for small jobs, but
| at scale they quickly become vastly more expensive than setting
| this up yourself. Especially when you need to run these jobs
| every month or so. Even if you have to write some code to make
| the open source solutions actually work.
| nr2x wrote:
| It gets more complicated when you need to leverage real browser
| engines (eg Chrome). I've got jobs spread across ~ 20 machines/
| 140 concurrent browser instances, it's non-trivial.
| ddorian43 wrote:
| How many pages can you render in a a second per vcpu core ?
| turtlebits wrote:
| AWS Lambdas are an easy way to get scheduled scraping jobs
| running.
|
| I use their Python-based chalice framework
| (https://github.com/aws/chalice) which allows you to add a
| decorator to a method for a schedule,
| @app.schedule(Rate(30, unit=Rate.MINUTES))
|
| It's also a breeze to deploy. chalice deploy
| cubano wrote:
| My last contract job was to build a 100% perfect website
| mirroring program for a group of lawyers who were interested in
| building class action lawsuits against some of the more henious
| scammers out there.
|
| I ended up building like 8 versions of it, literally using every
| PHP and Python library and resource I could find.
|
| I tried httrack, php-ultimate-web-scraper (from github), headless
| chromium. headless selenium, and a few others
|
| By far the biggest problem was dealing with JS links...you
| wouldn't think from the start it would be such a big deal but
| yet..it was.
|
| Selenium with python turned out to be the winning combination,
| and of course, it was the last one I tried. Also, this is an
| ideal project to implement recursion altho you have to be careful
| about exit conditions.
|
| One thing that was VERY important for performance was not
| visiting any page more then once because, obviously, certain
| links in headers and footers are duped sometimes 100s of times.
|
| JS links often made it very difficult to discover the linked
| page, are certain library calls that were supposed to get this
| info for you often didn't work.
|
| It was a super fun project, and in the end considering I only
| worked for 2 months, I shipped some decent code that was getting
| like 98.6% of the pages perfectly.
|
| The final presentation was interesting...for some reason my
| client I think got in his head that I wasn't very good programmer
| or something, and as we ran thru his list of sample sites
| expecting my program to error out or incorrectly mirror the site,
| but it handled all 10 of the sites about perfectly and he was
| rather flabbergasted because he told me it would have taken him a
| week hand clicking the site for the mirror but instead the
| program did them all in under an hour.
| lysecret wrote:
| What stack did you end up using ?
| sdfsrrte543 wrote:
| >Selenium with python turned out to be the winning
| combination, and of course, it was the last one I tried.
| banana_giraffe wrote:
| I had to solve nearly the exact same problem for the same
| reasons. I too ended up with Selenium.
|
| My favorite part was having a nice working system, then
| throwing it in the cloud and finding out a socking number of
| sites tell you to go away if you come at them from a cloud-
| based IP.
|
| Shouldn't be surprising, but it was still annoying.
| Terretta wrote:
| There are a number of so-called "residential VPN" services
| with clients that also serve as the firm's p2p VPN / proxy
| edge. Some can be subscribed to commercially to resolve
| precisely the above issue.
|
| Preferably, only give money to one that tells their users
| this is how it works.
| gazelle21 wrote:
| Ha! I am currently building something very similar at work, and
| JS links are driving me up a wall. Its interesting because you
| would think its super simple, but I still haven't found a good
| solution. Luckily my boss is understanding.
| js8 wrote:
| Does anyone know how could I script Save Page WE extension in
| Firefox? It does a really nice job of saving the page as it
| looks, including dynamic content.
| inovica wrote:
| We created a fun side project to grab the index page of every
| domain - we downloaded a list of approx 200m domains. However, we
| ran into problems when our provider complained. It was something
| to do with the DNS side of things and we were told to run our own
| DNS server. If there is anyone on here with experience of
| crawling across this number of domain names it would be great to
| talk!
| bityard wrote:
| Is web scraping going to continue to be a viable thing, now that
| the web is mainly an app delivery platform rather than a content
| delivery platform?
|
| Can you scrape a webasm site?
| tluyben2 wrote:
| Maybe it will turn out to be that way, but this is far from
| reality at the moment. There are not many sites that cannot be
| scraped statically and there definitely are very few sites/apps
| that are webasm.
|
| It'll change, but who knows how much. At least currently, most
| scraping professionals are not even using headless browsers as
| their targets are statically rendered.
| jjice wrote:
| I'm not 100% sure what you mean by a 'webasm' site (web
| assembly powered?), but the article describes scarping via
| headless browsers which actually render the page and allow you
| to select elements that are client rendered.
___________________________________________________________________
(page generated 2021-02-10 23:01 UTC)