[HN Gopher] Web scraping with GPT-4o: powerful but expensive
       ___________________________________________________________________
        
       Web scraping with GPT-4o: powerful but expensive
        
       Author : edublancas
       Score  : 30 points
       Date   : 2024-09-02 19:50 UTC (3 hours ago)
        
 (HTM) web link (blancas.io)
 (TXT) w3m dump (blancas.io)
        
       | luigi23 wrote:
       | Why are scrapers so popular nowadays?
        
         | rietta wrote:
         | Because publishers don't push structured data or APIs enough to
         | satisfy demand for the data.
        
           | luigi23 wrote:
           | Got it, but why is it booming now and often it's a showcase
           | of llm model? Is there some secret market/ usecase for it?
        
             | drusepth wrote:
             | Scrapers have always been notoriously brittle and prone to
             | breaking completely when pages make even the smallest of
             | structural changes.
             | 
             | Scraping with LLMs bypasses that pitfall because it's more
             | of a summarization task on the whole document, rather than
             | working specifically on a hard-coded document structure to
             | extract specific data.
        
             | bobajeff wrote:
             | Personally I find it's better for archiving as most sites
             | that don't provide a convenient way to save their content
             | directly. Occasionally, I do it just to make a better
             | interface over the data.
        
             | IanCal wrote:
             | Building scrapers sucks.
             | 
             | It's generally not hard because it's conceptually very
             | difficult, or that it requires extremely high level
             | reasoning.
             | 
             | It sucks because when someone changes "<section
             | class='bio'>" to "<div class='section bio'>" your scraper
             | breaks. I just want the bio and it's obvious what to grab,
             | but machines have no nuance.
             | 
             | LLMs have enough common sense to be able to deal with these
             | things and they take almost no time to work with. I can
             | throw html at something, with a vague description and pull
             | out structured data with no engineer required, _and_ it 'll
             | probably work when the page changes.
             | 
             | There's a huge number of one-off jobs people will do where
             | perfect isn't the goal, and a fast solution + a bit of
             | cleanup is hugely beneficial.
        
         | drusepth wrote:
         | I'd say scrapers have always been popular, but I imagine
         | they're even more popular nowadays with all the tools (AI but
         | also non-AI) readily available to do cool stuff on a lot of
         | data.
        
           | bongodongobob wrote:
           | Bingo. During the pandemic, I started a project to keep
           | myself busy by trying to scrape stock market ticker data and
           | then do some analysis and make some pretty graphs out of it.
           | I know there are paid services for this, but I wanted to pull
           | it from various websites for free. It took me a couple months
           | to get it right. There are so many corner cases to deal with
           | if the pages aren't exactly the same each time you load them.
           | Now with the help of AI, you can slap together a scraping
           | program in a couple of hours.
        
             | MaxPock wrote:
             | Was it profitable?
        
       | kimoz wrote:
       | Is it possible to achieve good results using the open source
       | models for scrapping?
        
       | ammario wrote:
       | To scale such an approach you could have the LLM generate JS to
       | walk the DOM and extract content, caching the JS for each page.
        
       | kcorbitt wrote:
       | Funnily enough, web scraping was actually the motivating use-case
       | that started my co-founder and I building what is now
       | openpipe.ai. GPT-4 is really good at it, but extremely expensive.
       | But it's actually pretty easy to distill its skill at scraping a
       | specific class of site down to a fine-tuned model that's way
       | cheaper and also really good at scraping that class of site
       | reliably.
        
       | tom1337 wrote:
       | OpenAI recently announced a Batch API [1] which allows you to
       | prepare all prompts and then run them as a batch. This reduces
       | costs as its just 50% the price. Used it a lot with GPT-4o mini
       | in the past and was able to prompt 3000 Items in less than 5min.
       | Could be great for non-realtime applications.
       | 
       | [1] https://platform.openai.com/docs/guides/batch
        
       | wslh wrote:
       | Isn't ollama an answer to this? Or is there something inherent to
       | OpenAI that makes it significantly better for web scraping?
        
       ___________________________________________________________________
       (page generated 2024-09-02 23:00 UTC)