[HN Gopher] Scraperr - A Self Hosted Webscraper
___________________________________________________________________
Scraperr - A Self Hosted Webscraper
Author : jpyles
Score : 64 points
Date : 2025-05-11 18:29 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| iSloth wrote:
| Interesting, wish it had markdown output like firecrawl for
| embedding/llm use cases
| smartmic wrote:
| My preferred "self-hosted" webscraper is a local, single binary
| called xidel [1]. The feature I really like is that it can also
| follow links.
|
| [1] https://github.com/benibela/xidel
| darkwater wrote:
| Wow, it's written in Pascal! That surely brings me to memory
| lane.
| _QrE wrote:
| Is there a reason for using Selenium over something like
| Playwright? I haven't had very many positive experiences with
| selenium, and playwright I found is easier to use and more
| flexible.
|
| Also, for stuff like this:
|
| `modified_value = original_value.replace("HeadlessChrome",
| "Chrome")`
|
| There's quite a few ways to figure out that a browser is a bot,
| and I don't think replacing a few values like this does much. Not
| asking you to reveal any tricks, just saying that if you're using
| something like Playwright, you can e.g. run scripts in the
| browser to adjust your fingerprint more easily.
| jpyles wrote:
| I am quite aware, but I actually built most of the scraping
| logic a long time ago, before I even knew that playwright was a
| thing.
|
| I am looking to refactor a lot of this, and switching over to
| playwright is a high priority, using something like camoufox
| for scraping, instead of just chromium.
|
| Most of my work on this the past month has been simple
| additions that are nice to haves
| michaeljx wrote:
| I was in a similar boat with my scrapers. Started with
| Selenium 5-6 years ago and only discovered Playwright 2 years
| ago. Spend a month or so swapping the two, which was well
| worth it. Cleaner API, async support.
| jpyles wrote:
| Luckily, I have some experience with playwright, so
| swapping shouldn't take me too long.
|
| Currently working on a PR to swap over
| nkozyra wrote:
| Playwright was miles ahead of selenium but what I think is
| really overlooked is chromedp
| jpyles wrote:
| With the custom headers, you can actually trick a lot of sites
| with bot protection to let you load their sites (even big sites
| like youtube, which I have found success in)
| dotancohen wrote:
| How do you work around pop-ups for newsletters and such? Look
| at the BBC for a good example.
| windexh8er wrote:
| If you're a fan of Playwright check out Crawlee [0]. I've used
| it for a few small projects and it's been faster for me to get
| what I've needed done.
|
| [0] https://crawlee.dev/
| throwaway81523 wrote:
| Last time I looked, Selenium was able to use Firefox. IDK about
| Playwright, but Puppeteer was Chrome-only.
| renegat0x0 wrote:
| Not a web scraper, but a web crawler software. Allows to specify
| method of crawling, selenium, and others. Returns data in JSON
| (status code, text contents, etc).
|
| [1] https://github.com/rumca-js/crawler-buddy
| 3abiton wrote:
| > extract data from websites with precision using XPath
| selectors.
|
| I've used XPath for crawling with selenium, and it used to be my
| favorite way, but turned out quite unreliable if you don't
| combine it with other selectors as certain website are really
| badly designed and have no good patterns. So what's the added
| value over pure selenium?
| leelou2 wrote:
| It's great to see more self-hosted scraping solutions like
| Scraperr emerging, especially with a focus on flexibility and
| open source. The discussion here highlights how the ecosystem is
| evolving--tools like Playwright and chromedp are making browser
| automation more accessible, and the ability to customize headers
| or output formats is becoming essential for modern scraping
| tasks. I hope Scraperr continues to add features like markdown
| output and easier handling of pop-ups, as these are increasingly
| important for LLM and data pipeline use cases. Kudos to the
| developer for being open to community feedback and rapid
| iteration!
| evertedsphere wrote:
| would appreciate if you could post the prompt as well so the
| rest of us could learn how to generate our hn comments too
___________________________________________________________________
(page generated 2025-05-11 23:00 UTC)