[HN Gopher] Show HN: Flyscrape - A standalone and scriptable web...
___________________________________________________________________
Show HN: Flyscrape - A standalone and scriptable web scraper in Go
Author : philippta
Score : 138 points
Date : 2023-11-11 14:18 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| lucgagan wrote:
| This looks great. I wish I had this a few months ago! Giving it a
| try.
| philippta wrote:
| Glad to hear! You're welcome to leave any feedback on Github
| (as an Issue) or right in here.
| bryanrasmussen wrote:
| Looks like it doesn't have the possibility of running it as a
| particular browser etc. Which I guess makes it fine for a lot of
| pages, but also a lot of scraping tasks would be affected. Am I
| right or did I miss something?
| philippta wrote:
| Yes, this is correct. As of right now there is no built-in
| support for running as a browser.
|
| What is possible though, is to use a service like ScrapingBee
| (not affiliated) and set it as the proxy. This would render the
| page on their end, in a browser.
| acheong08 wrote:
| Try tls-client. It gets around TLS fingerprinting by
| Cloudflare
| snake117 wrote:
| Looks interesting, and thank you for sharing this! One common
| issue with scraping web pages is dealing with data that is
| dynamically loaded. Is there a solution for this? For example,
| when using Scrapy, you can have Splash running in Docker via
| scrapy-splash (https://github.com/scrapy-plugins/scrapy-splash).
| figmert wrote:
| Can't you load the URL that is being dynamically loaded
| directly within your scraper?
| mdaniel wrote:
| Not only can you, in my experience it is substantially less
| drama and arguably less load on the target system since the
| full page may make many many other requests that a
| presentation layer would care about that I don't
|
| The trade-offs usually fall into:
|
| - authing to the endpoint can sometimes be weird
|
| - it for sure makes the traffic stand out since it isn't
| otherwise surrounded by those extraneous requests
|
| - it, as with all good things scraping, carries its own
| maintenance and monitoring burden
|
| However, similar to those tradeoffs, it's also been my
| experience that a full page load offers a ton more tracking
| opportunities that are not present in a direct endpoint
| fetch. I mean, look how many "stealth" plugins out there
| designed to mask the fact that a headless browser is headless
|
| But, having said all of that: without question the biggest
| risk to modern day scraping is Cloudflare and Akamai
| gatekeeping. I do appreciate the arguments of "but ddos!11"
| and yet I would rather only actors that are actually
| exhibiting bad behavior[1] be blocked instead of everyone
| trying with a copy of python who have set reasonable rate
| limits
|
| 1 = this setting aside that "bad behavior" can be defined as
| "downloading data that the site makes freely available to
| Chrome but not freely available to python"
| philippta wrote:
| Thanks! As mentioned in another comment, currently there is no
| build in support for this yet.
|
| As a workaround one could use a service like ScrapingBee (not
| affiliated) as a proxy, that renders the page in a browser for
| you.
|
| Surely, relying on a service for this is not always ideal. I am
| also working on a small wrapper that turns Chrome into an HTTPS
| proxy, which you could plug right into flyscrape. Unfortunately
| it is very experimental still and not public yet. I have not
| yet decided if I release it as part of flyscrape or as a
| separate project.
| xyzzy_plugh wrote:
| I like web scraping in Go. The support for parsing HTML in
| x/text/html is pretty good, and libraries like
| github.com/PuerkitoBio/goquery go a long way to matching
| ergonomics in other tools. This project uses both, but then also
| goes on to use github.com/dop251/goja, which is a JavaScript VM
| _and_ it 's accompanying nodejs compatability layer _and_ even
| esbuild, in order to _interpret scraping instruction scripts_.
|
| I mean, at this point I am not sure Go is the right tool for the
| job (I am _actually_ pretty confident that it is _not_ ).
|
| A pretty neat stack of engineering, sure! This is cool, niely
| done. But I can't help but feel disturbed.
| cxr wrote:
| Your comment was posted 4 minutes ago. That means you still
| have enough time to edit your comment to change it so it
| contains real URLs that link to the project repos for the
| packages mentioned:
|
| <https://github.com/PuerkitoBio/goquery>
|
| <https://github.com/dop251/goja>
|
| (Please do not reply to this comment of mine--if you do, I
| won't be able to delete it once the previous post is fixed,
| because the existence of the replies will prevent that.)
| cheapgeek wrote:
| Ok
| xyzzy_plugh wrote:
| Even if I saw this post in time, I wouldn't have edited it.
| They are all proper Go package names.
| sunshadow wrote:
| These days, I'm not even using Go for scraping that much, as the
| webpage changes makes me crazy and JS code evaluation is a
| lifesaver, so I moved to Typescript+Playwright. (Crawlee
| framework is cool, while not strictly necessary).
|
| Its been 8+ years since i started scraping. I even wrote a
| popular Go web scraping framework previously:
| (https://github.com/geziyor/geziyor).
|
| My favorite stack as of 2023:
| TypeScript+Playwright+Crawlee(Optional) If you're serious in
| scraping, you should learn javascript, thus, playwright should be
| good.
|
| Note: There are niche cases where lower-level language would be
| required (C++, Go etc), but probably only <%5
| mikercampbell wrote:
| Have you seen Crul??
|
| I love the JS flow, but I thought crul was an interesting newer
| tool!!
|
| But I agree, you gotta get in there and it's easier with JS
| reyostallenberg wrote:
| Can you add a link to it?
| mdaniel wrote:
| I'm sorry to hear that your searches for that very specific
| name didn't provide the information you were looking for
|
| its show hn: https://news.ycombinator.com/item?id=34970917
|
| tfl: https://www.crul.com/
| sunshadow wrote:
| Crul looks nice, though, you cannot imagine how many startups
| that I've seen failed doing a very similar thing as Crul.
| Wouldn't rely on it. The problem is complex: Humans
| generating messy pages
| hipadev23 wrote:
| How does that help you mitigate when a site changes? If you're
| fetching some value in a given <div> under a long XPATH and
| they decide to change that path?
| sunshadow wrote:
| You don't use XPath&CSS selectors at all (Except if you dont
| have choice). You rely on more generic stuff, e.g, "the
| button that has 'Sign in' on it": await
| page.getByRole('button', { name: 'Sign in' }).click();
|
| See playwright locators: https://playwright.dev/docs/locators
| 8n4vidtmkvmk wrote:
| I started putting data-testid attributes in my web app for
| automated testing using playwright. Prevents me from
| breaking my own script but it sure would make me more
| scrapable if anyone cared. Well.. I guess I only do it on
| inputs, not the rendered page which is what scrapers care
| most about.
| sunshadow wrote:
| Unless you start a war against scrapers, you don't need
| to worry about that as I'll always find a way to scrape
| your site as long as its valuable to 'me'. Even if it
| requires Real browser + OCR :)
| erhaetherth wrote:
| Oh I know I couldn't prevent it. But if you wanted to
| scrape me, you'd have to pay the monthly subscription
| because everything is behind a pay wall/login. And then
| you'd only have access to data you entered because it's
| just that kind of app :-)
| latchkey wrote:
| This is where you just train an LLM so you can write:
|
| 'get button named "sign in" and click'
|
| Then on the back end, it generates your example code.
| nurettin wrote:
| Don't know about the poster, but I try to find divs and
| buttons in a fuzzy way. Usually via element text. Sometimes
| it mitigates changes, sometimes it doesn't. It's a guessing
| game. Especially when they start using shadow elements or
| iframes in the page. If I'm looking for something specific
| like a price or dimensions, I can sometimes get away with it
| by collecting dollar amounts or X x Y x Z from the raw text.
| slig wrote:
| Thanks for sharing! Just a small nit: the links at the bottom of
| this page are broken [1].
|
| [1]:
| https://github.com/philippta/flyscrape/blob/master/docs/read...
| fyzix wrote:
| What happens if 'find()' returns a list and you call '.text()'.
| Intuition tells me it should fail but maybe it implicitly gets
| the text from the first item if it exists.
|
| Either way, I think you create a separate method 'find_all()'
| that returns a list to make the API easier to reason about.
___________________________________________________________________
(page generated 2023-11-11 23:00 UTC)