[HN Gopher] Web scraping with GPT-4o: powerful but expensive
___________________________________________________________________
Web scraping with GPT-4o: powerful but expensive
Author : edublancas
Score : 30 points
Date : 2024-09-02 19:50 UTC (3 hours ago)
(HTM) web link (blancas.io)
(TXT) w3m dump (blancas.io)
| luigi23 wrote:
| Why are scrapers so popular nowadays?
| rietta wrote:
| Because publishers don't push structured data or APIs enough to
| satisfy demand for the data.
| luigi23 wrote:
| Got it, but why is it booming now and often it's a showcase
| of llm model? Is there some secret market/ usecase for it?
| drusepth wrote:
| Scrapers have always been notoriously brittle and prone to
| breaking completely when pages make even the smallest of
| structural changes.
|
| Scraping with LLMs bypasses that pitfall because it's more
| of a summarization task on the whole document, rather than
| working specifically on a hard-coded document structure to
| extract specific data.
| bobajeff wrote:
| Personally I find it's better for archiving as most sites
| that don't provide a convenient way to save their content
| directly. Occasionally, I do it just to make a better
| interface over the data.
| IanCal wrote:
| Building scrapers sucks.
|
| It's generally not hard because it's conceptually very
| difficult, or that it requires extremely high level
| reasoning.
|
| It sucks because when someone changes "<section
| class='bio'>" to "<div class='section bio'>" your scraper
| breaks. I just want the bio and it's obvious what to grab,
| but machines have no nuance.
|
| LLMs have enough common sense to be able to deal with these
| things and they take almost no time to work with. I can
| throw html at something, with a vague description and pull
| out structured data with no engineer required, _and_ it 'll
| probably work when the page changes.
|
| There's a huge number of one-off jobs people will do where
| perfect isn't the goal, and a fast solution + a bit of
| cleanup is hugely beneficial.
| drusepth wrote:
| I'd say scrapers have always been popular, but I imagine
| they're even more popular nowadays with all the tools (AI but
| also non-AI) readily available to do cool stuff on a lot of
| data.
| bongodongobob wrote:
| Bingo. During the pandemic, I started a project to keep
| myself busy by trying to scrape stock market ticker data and
| then do some analysis and make some pretty graphs out of it.
| I know there are paid services for this, but I wanted to pull
| it from various websites for free. It took me a couple months
| to get it right. There are so many corner cases to deal with
| if the pages aren't exactly the same each time you load them.
| Now with the help of AI, you can slap together a scraping
| program in a couple of hours.
| MaxPock wrote:
| Was it profitable?
| kimoz wrote:
| Is it possible to achieve good results using the open source
| models for scrapping?
| ammario wrote:
| To scale such an approach you could have the LLM generate JS to
| walk the DOM and extract content, caching the JS for each page.
| kcorbitt wrote:
| Funnily enough, web scraping was actually the motivating use-case
| that started my co-founder and I building what is now
| openpipe.ai. GPT-4 is really good at it, but extremely expensive.
| But it's actually pretty easy to distill its skill at scraping a
| specific class of site down to a fine-tuned model that's way
| cheaper and also really good at scraping that class of site
| reliably.
| tom1337 wrote:
| OpenAI recently announced a Batch API [1] which allows you to
| prepare all prompts and then run them as a batch. This reduces
| costs as its just 50% the price. Used it a lot with GPT-4o mini
| in the past and was able to prompt 3000 Items in less than 5min.
| Could be great for non-realtime applications.
|
| [1] https://platform.openai.com/docs/guides/batch
| wslh wrote:
| Isn't ollama an answer to this? Or is there something inherent to
| OpenAI that makes it significantly better for web scraping?
___________________________________________________________________
(page generated 2024-09-02 23:00 UTC)