[HN Gopher] ScrapeGraphAI: Web scraping using LLM and direct gra...
___________________________________________________________________
ScrapeGraphAI: Web scraping using LLM and direct graph logic
Author : ulrischa
Score : 73 points
Date : 2024-05-07 19:41 UTC (3 hours ago)
(HTM) web link (scrapegraph-doc.onrender.com)
(TXT) w3m dump (scrapegraph-doc.onrender.com)
| simonw wrote:
| Typo on your homepage: "You just have to implment just some lines
| of code and the work is done"
| sethx wrote:
| At jobstash.xyz we have similar tech as part of our generalized
| scraping infra, and it's been live for half a year performing
| optimally.
| nextworddev wrote:
| Would be nice if docs had a comparison between traditional
| scraping (e.g. using headless browsers, beautifulsoup, etc)
| versus this approach. Exactly how is AI used?
| ushakov wrote:
| There's also llm-scraper in TypeScript
|
| https://github.com/mishushakov/llm-scraper
| lucgagan wrote:
| Something similar I worked on in the past
| https://github.com/lucgagan/auto-playwright/
| worldsayshi wrote:
| Does it use ChatGPT every time you run the test or only when
| a test fails (to check if the selector has changed)?
| nodoodles wrote:
| What I'd love to see is scraper builder that uses LLMs/'magic' to
| generate optimised scraping rules for any page, ie css selectors
| and processing rules mapped to output keys. So you can run
| scraping itself at low cost and high performance..
| cpobuda wrote:
| I have been working on this. Feel free to DM me.
| mariopt wrote:
| What is the point of using LLMs for the scrapping itself instead
| of using them to generate the boring code for mimicking HTTP
| requests, css/xpath selectors, etc?
|
| I get it may be interesting for small tasks combined with a
| browser extension but for real scrapping just seems to be
| overkill and expensive.
| spaniard89277 wrote:
| This is completely unrealistic unless you want to burn money.
| infecto wrote:
| I have not used this specific library but its far from
| unrealistic and hardly a money pit. A LLM can fit in nicely
| with scraping libraries. Sure if you are crawling the web like
| google, it makes no sense, but if you have a hit list, this can
| be a cost effective way to not have engineering hours spent
| maintaining the crawler.
| spaniard89277 wrote:
| Which LLM do you use? Because I can't see an scraper running
| daily without being very expensive.
| mrbungie wrote:
| GPT-3.5/GPT-4 ain't the only LLMs available. A Flan-T5/T5
| or Llama2/3 8B models may be finetuning for this use case
| and used for much cheaper.
___________________________________________________________________
(page generated 2024-05-07 23:00 UTC)