[HN Gopher] ScrapeGraphAI: Web scraping using LLM and direct gra...
       ___________________________________________________________________
        
       ScrapeGraphAI: Web scraping using LLM and direct graph logic
        
       Author : ulrischa
       Score  : 73 points
       Date   : 2024-05-07 19:41 UTC (3 hours ago)
        
 (HTM) web link (scrapegraph-doc.onrender.com)
 (TXT) w3m dump (scrapegraph-doc.onrender.com)
        
       | simonw wrote:
       | Typo on your homepage: "You just have to implment just some lines
       | of code and the work is done"
        
       | sethx wrote:
       | At jobstash.xyz we have similar tech as part of our generalized
       | scraping infra, and it's been live for half a year performing
       | optimally.
        
       | nextworddev wrote:
       | Would be nice if docs had a comparison between traditional
       | scraping (e.g. using headless browsers, beautifulsoup, etc)
       | versus this approach. Exactly how is AI used?
        
       | ushakov wrote:
       | There's also llm-scraper in TypeScript
       | 
       | https://github.com/mishushakov/llm-scraper
        
         | lucgagan wrote:
         | Something similar I worked on in the past
         | https://github.com/lucgagan/auto-playwright/
        
           | worldsayshi wrote:
           | Does it use ChatGPT every time you run the test or only when
           | a test fails (to check if the selector has changed)?
        
       | nodoodles wrote:
       | What I'd love to see is scraper builder that uses LLMs/'magic' to
       | generate optimised scraping rules for any page, ie css selectors
       | and processing rules mapped to output keys. So you can run
       | scraping itself at low cost and high performance..
        
         | cpobuda wrote:
         | I have been working on this. Feel free to DM me.
        
       | mariopt wrote:
       | What is the point of using LLMs for the scrapping itself instead
       | of using them to generate the boring code for mimicking HTTP
       | requests, css/xpath selectors, etc?
       | 
       | I get it may be interesting for small tasks combined with a
       | browser extension but for real scrapping just seems to be
       | overkill and expensive.
        
       | spaniard89277 wrote:
       | This is completely unrealistic unless you want to burn money.
        
         | infecto wrote:
         | I have not used this specific library but its far from
         | unrealistic and hardly a money pit. A LLM can fit in nicely
         | with scraping libraries. Sure if you are crawling the web like
         | google, it makes no sense, but if you have a hit list, this can
         | be a cost effective way to not have engineering hours spent
         | maintaining the crawler.
        
           | spaniard89277 wrote:
           | Which LLM do you use? Because I can't see an scraper running
           | daily without being very expensive.
        
             | mrbungie wrote:
             | GPT-3.5/GPT-4 ain't the only LLMs available. A Flan-T5/T5
             | or Llama2/3 8B models may be finetuning for this use case
             | and used for much cheaper.
        
       ___________________________________________________________________
       (page generated 2024-05-07 23:00 UTC)