[HN Gopher] Web scraping with Python open knowledge
___________________________________________________________________
Web scraping with Python open knowledge
Author : PigiVinci83
Score : 114 points
Date : 2022-05-27 16:45 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| account-5 wrote:
| I was reading another thread about webscraping, someone mentioned
| CSS selectors being way quicker than xpath. I'm easy either way
| but apart from a more powerful syntax what other benefits are
| there?
| PigiVinci83 wrote:
| Having a large codebase like ours, we find out that XPATH are
| more readable, but i understand it's a personal feeling. We
| don't have high frequency scraping, so the performances of CSS
| vs XPATH were not considered. It's an interesting point i'd
| like to write more about, thanks for sharing.
| showerst wrote:
| CSS is nice because it's more readable than XPATH for longer
| queries, and is friendlier to newer programmers who didn't come
| up when XML was big.
|
| XPATH is generally more powerful for really gnarly things and
| for backtracking. "Show me the 3rd paragraph that's a sibling
| of the fourth div id="subhed" and contains the text "starting".
| dotancohen wrote:
| > XPATH is generally more powerful...
|
| That is a convincing argument is you can back it up with an
| XPATH expression.
| mdaniel wrote:
| Well, the rest of their sentence summed it up pretty well;
| try and implement that example using CSS selectors
|
| Hell, even "find id=subhead and _go up one element_" isn't
| possible in CSS because that's not a problem it was
| designed to solve
| mdaniel wrote:
| In my experience, it's not that CSS selectors are "more
| powerful," but rather "more legible." XPath is for sure more
| powerful, but also usually lower signal to noise ratio
| response.css("#the-id") # vs
| response.xpath("//*[@id='the-id']")
|
| Thankfully, Scrapy (well, pedantically "parsel") allows mixing
| and matching, using the one which makes the most sense
| response.css(".someClass").xpath(".//*[starts-with(text(),
| 'Price')]")
| PigiVinci83 wrote:
| A work in progress guide about web scraping in python, anti bot
| softwares and techniques and so on. Please feel free to share and
| contribute with your own experience too.
| alexchamberlain wrote:
| The tab formatting seems like an odd (and rather unPythonic)
| addition. What's the intention there?
| datalopers wrote:
| Why is whitespace even discussed in a tutorial about web
| scraping? I think speaks to how amateur this documentation
| is, but hey it's a tutorial on web scraping in python, which
| is a HN crowd favorite and guaranteed to hit the front page
| [1][2].
|
| [1] 3 days ago: https://news.ycombinator.com/item?id=31500007
|
| [2] 12 days ago:
| https://news.ycombinator.com/item?id=31387248
| PigiVinci83 wrote:
| Because it's not a tutorial on web scraping but a mix of
| what we suggest internally to do and what we've learnt from
| our experience in this field in these years. For our
| codebase we prefer tabs instead of spaces, but i understand
| it's a subject for debates that last decades :) But thanks
| for the point, I'll rephrase the topic in the guide
| civilized wrote:
| It's not a best practice, it's just a random thing your
| team does. It does make the team sound amateurish if it
| can't distinguish between meaningful best practices and
| just conventions the team happens to have.
| datalopers wrote:
| It's odd to me that your apparent revenue stream is from
| scraping difficult-to-scrape sites and you're
| broadcasting the exact tactics you use to bypass anti-
| scraping systems. You're making your own life difficult
| by giving Cloudflare/PerimeterX/etc the information
| necessary to improve their tooling.
|
| You also seem to advertise many of the sites/datasets
| you're scraping, which opens you up to litigation.
| Especially if they're employing anti-scraping tooling and
| you're brazenly bypassing those. It doesn't matter that
| it's legal in most jurisdictions of the world, you'll
| still have to handle cease and desists or potential
| lawsuits, which is a major cost and distraction.
| punnerud wrote:
| << You also seem to advertise many of the sites/datasets
| you're scraping, which opens you up to litigation.>>
|
| Is that a done deal now after the "LinkedIn vs HiQ" case
| public information only hold copyright, but you can use
| the by product as it's fit you for new business?
| datalopers wrote:
| The only clear outcome from the LinkedIn case, afaik, is
| that scraping publicly of accessible data is not a
| federal crime under CFAA [1]. There are still plenty of
| other civil ways that someone can sue you to stop
| scraping their site: breach of contract, trespass to
| chattels, trademark infringement, etc. And they can do so
| over and over again til you're broke. OP is based in
| Italy anyway so I have absolutely no clue what does and
| doesn't apply.
|
| I'd like to point out that, while HiQ Labs "won" the
| case, that company is basically dead. The CEO and CTO are
| both working for other companies now. So I think the
| bigger takeaway is: don't get yourself sued while you're
| a tiny little startup.
|
| [1] https://www.natlawreview.com/article/hiq-labs-v-
| linkedin
| hrbf wrote:
| Same here. It feels out of place, unnecessary and its
| rationalization unconvincing. Considering Python, an outright
| weird suggestion.
| wheelerof4te wrote:
| Python allows indenting using tabs, so I don't understand
| why it's a weird decision.
|
| In fact, they even stated their reasoning in the document.
| I don't see why anyone has to blindly follow PEP8 nor do I
| get why 4 spaces indent has to be considered a standard.
| etskinner wrote:
| On the page about canvas fingerprinting[0], it only mentions
| Cloudflare. From what I can tell, reCaptcha v3 also uses canvas
| fingerprinting [1]
|
| [0] https://github.com/reanalytics-databoutique/webscraping-
| open...
|
| [1] https://brianwjoe.com/2019/02/06/how-does-
| recaptcha-v3-work/
| PigiVinci83 wrote:
| Thanks for sharing, i'll update soon the page.
| jamestimmins wrote:
| I appreciate the inclusion of anti-bot software. As someone who
| builds plugins for enterprise apps (currently Airtable), I
| really want to build automated tests for my apps with Selenium,
| but keep getting foiled by anti-bot measures.
|
| Can anyone recommend other resources for understanding anti-bot
| tech and their workarounds?
| captn3m0 wrote:
| Good list, confused about the "tabs weighing less" bit. Isn't
| that a preference left for the end-devs?
|
| Another tip I've found is to check if the data is accessible on a
| mobile app and proxy it to see if there is a JSON API available.
| PigiVinci83 wrote:
| Thanks for your reply, mobile data it's a thing i need to add
| soon. Usually we check using Fiddler if there's an API inside,
| but only for really problematic website.
| [deleted]
| Xeoncross wrote:
| Plug for https://commoncrawl.org/ if you need billions of pages
| but don't want to deal with scraping the web yourself.
| [deleted]
| squiggy22 wrote:
| Is there subsets of common crawl anywhere for individual sites.
| E.g. YouTube for example?
| magundu wrote:
| You can query subset of specific sites from common crawl
| itself.
| mdaniel wrote:
| It would thrill me if common crawl were updated with such
| frequency that it would allow new search engines to enter the
| market
|
| I haven't dug into it enough to know if there's some technical
| reason it's not currently the case, or just lack of
| (interest|willpower)
| input_sh wrote:
| I'd argue that one broad crawl every 2-3 months in addition
| to their updated-daily news crawl[0] should be good enough to
| make a rudimentary search engine.
|
| [0] https://data.commoncrawl.org/crawl-data/CC-
| NEWS/index.html
| mdaniel wrote:
| You're right about the "rudimentary" part, because I don't
| know how they do it but the major players have some not-
| kidding-around freshness:
|
| https://www.google.com/search?hl=en&q=%22thrill%20me%22%20%
| 2...
|
| https://www.bing.com/search?q=%22thrill+me%22+%22common+cra
| w... _(and DDG similarly, because bing)_
|
| ed: I was curious if maybe HN publishes a sitemap, and it
| seems no. Then again, hnreplies knows about the HN API so
| maybe it's special-cased by the big crawlers
| https://github.com/ggerganov/hnreplies#hnreplies
| afandian wrote:
| If you're the kind of person who wants "open data" (read as
| broadly as you like) and could get it in snapshots direct from
| the source without having to scrape, what would your ideal format
| be?
|
| I know it's a very open ended question.
| PigiVinci83 wrote:
| Thanks for the question, i can speak for what we've encountered
| in these years of web scraping and nothing beats API and JSON,
| but i'm sure there are formats even more friendly to read.
| jshen wrote:
| Probably RDF serialized as hextuples
| https://github.com/ontola/hextuples
| afandian wrote:
| Looks interesting. From that page I couldn't see what 'graph'
| field relates to. Is it the identifier for a distinct named
| graph? It was blank in the examples.
|
| Do you use it? What for?
| sgtquack wrote:
| As someone who recently dealt with scraping sites behind
| cloudflare...I never want to scrape again
| jonatron wrote:
| As the second sentence says, it's a cat and mouse game, so
| there's no incentive on either side of bot vs anti-bot to share
| information.
| PigiVinci83 wrote:
| I'm sure no one will add here its secret sauce :)
___________________________________________________________________
(page generated 2022-05-27 23:00 UTC)