hngopher.com

       [HN Gopher] Scraperr - A Self Hosted Webscraper
       ___________________________________________________________________
        
       Scraperr - A Self Hosted Webscraper
        
       Author : jpyles
       Score  : 64 points
       Date   : 2025-05-11 18:29 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | iSloth wrote:
       | Interesting, wish it had markdown output like firecrawl for
       | embedding/llm use cases
        
       | smartmic wrote:
       | My preferred "self-hosted" webscraper is a local, single binary
       | called xidel [1]. The feature I really like is that it can also
       | follow links.
       | 
       | [1] https://github.com/benibela/xidel
        
         | darkwater wrote:
         | Wow, it's written in Pascal! That surely brings me to memory
         | lane.
        
       | _QrE wrote:
       | Is there a reason for using Selenium over something like
       | Playwright? I haven't had very many positive experiences with
       | selenium, and playwright I found is easier to use and more
       | flexible.
       | 
       | Also, for stuff like this:
       | 
       | `modified_value = original_value.replace("HeadlessChrome",
       | "Chrome")`
       | 
       | There's quite a few ways to figure out that a browser is a bot,
       | and I don't think replacing a few values like this does much. Not
       | asking you to reveal any tricks, just saying that if you're using
       | something like Playwright, you can e.g. run scripts in the
       | browser to adjust your fingerprint more easily.
        
         | jpyles wrote:
         | I am quite aware, but I actually built most of the scraping
         | logic a long time ago, before I even knew that playwright was a
         | thing.
         | 
         | I am looking to refactor a lot of this, and switching over to
         | playwright is a high priority, using something like camoufox
         | for scraping, instead of just chromium.
         | 
         | Most of my work on this the past month has been simple
         | additions that are nice to haves
        
           | michaeljx wrote:
           | I was in a similar boat with my scrapers. Started with
           | Selenium 5-6 years ago and only discovered Playwright 2 years
           | ago. Spend a month or so swapping the two, which was well
           | worth it. Cleaner API, async support.
        
             | jpyles wrote:
             | Luckily, I have some experience with playwright, so
             | swapping shouldn't take me too long.
             | 
             | Currently working on a PR to swap over
        
             | nkozyra wrote:
             | Playwright was miles ahead of selenium but what I think is
             | really overlooked is chromedp
        
         | jpyles wrote:
         | With the custom headers, you can actually trick a lot of sites
         | with bot protection to let you load their sites (even big sites
         | like youtube, which I have found success in)
        
           | dotancohen wrote:
           | How do you work around pop-ups for newsletters and such? Look
           | at the BBC for a good example.
        
         | windexh8er wrote:
         | If you're a fan of Playwright check out Crawlee [0]. I've used
         | it for a few small projects and it's been faster for me to get
         | what I've needed done.
         | 
         | [0] https://crawlee.dev/
        
         | throwaway81523 wrote:
         | Last time I looked, Selenium was able to use Firefox. IDK about
         | Playwright, but Puppeteer was Chrome-only.
        
       | renegat0x0 wrote:
       | Not a web scraper, but a web crawler software. Allows to specify
       | method of crawling, selenium, and others. Returns data in JSON
       | (status code, text contents, etc).
       | 
       | [1] https://github.com/rumca-js/crawler-buddy
        
       | 3abiton wrote:
       | > extract data from websites with precision using XPath
       | selectors.
       | 
       | I've used XPath for crawling with selenium, and it used to be my
       | favorite way, but turned out quite unreliable if you don't
       | combine it with other selectors as certain website are really
       | badly designed and have no good patterns. So what's the added
       | value over pure selenium?
        
       | leelou2 wrote:
       | It's great to see more self-hosted scraping solutions like
       | Scraperr emerging, especially with a focus on flexibility and
       | open source. The discussion here highlights how the ecosystem is
       | evolving--tools like Playwright and chromedp are making browser
       | automation more accessible, and the ability to customize headers
       | or output formats is becoming essential for modern scraping
       | tasks. I hope Scraperr continues to add features like markdown
       | output and easier handling of pop-ups, as these are increasingly
       | important for LLM and data pipeline use cases. Kudos to the
       | developer for being open to community feedback and rapid
       | iteration!
        
         | evertedsphere wrote:
         | would appreciate if you could post the prompt as well so the
         | rest of us could learn how to generate our hn comments too
        
       ___________________________________________________________________
       (page generated 2025-05-11 23:00 UTC)