[HN Gopher] Minifying HTML for GPT-4o: Remove all the HTML tags
       ___________________________________________________________________
        
       Minifying HTML for GPT-4o: Remove all the HTML tags
        
       Author : edublancas
       Score  : 129 points
       Date   : 2024-09-05 13:51 UTC (2 days ago)
        
 (HTM) web link (blancas.io)
 (TXT) w3m dump (blancas.io)
        
       | giancarlostoro wrote:
       | I wonder if this is due to some template engines looking
       | minimalist like that. I think maybe Pug?
       | 
       | https://github.com/pugjs/pug?tab=readme-ov-file#syntax
       | 
       | It is whitespace sensitive though, but essentially looks like
       | that. I doubt this is the only unique template engine like this
       | though.
        
       | cj wrote:
       | Related article from 4 days ago (with comments on scraping,
       | specifically discussing removing HTML tags)
       | 
       | https://news.ycombinator.com/item?id=41428274
       | 
       | Edit: looks like it's actually the same author
        
       | throwup238 wrote:
       | I don't think that Mercury Prize table is a representative
       | example because each column has an obviously unique structure
       | that the LLM can key in on: (year) (Single Artist/Album pair)
       | (List of Artist/Album pairs) (image) (citation link)
       | 
       | I think a much better test would be something like "List of
       | elements by atomic properties" [1] that has a lot of adjacent
       | numbers in a similar range and overlapping first/last column
       | types. However, the danger with that table might be easy for the
       | LLM to infer just from the element names since they're well known
       | physical constants. The table of counties by population density
       | might be less predictable [2] or list of largest cities [3]
       | 
       | The test should be repeated with every available sorting function
       | too, to see if that causes any new errors.
       | 
       | [1]
       | https://en.wikipedia.org/wiki/List_of_elements_by_atomic_pro...
       | 
       | [2]
       | https://en.wikipedia.org/wiki/List_of_countries_and_dependen...
       | 
       | [3] https://en.wikipedia.org/wiki/List_of_largest_cities#List
        
         | edublancas wrote:
         | thanks a lot for the feedback! you're right, this is much
         | better input data. I'll re-run the code with these tables!
        
           | andybak wrote:
           | Also - is there a chance GPT is relying on it's training data
           | for some questions? i.e. you don't even need to give it the
           | table.
           | 
           | To be sure - shouldn't you be asking questions based on data
           | that is guaranteed not to be in it's training?
        
         | cal85 wrote:
         | Good points. But I feel like even with the cities article it
         | could still 'cheat' by recognising what the data is supposed to
         | be and filling in the blanks. Does it even need to be real
         | though? What about generating a fake article to use as a test
         | so it can't possibly recognise the contents? You could even get
         | GPT to generate it, just give it the 'Largest cities' HTML and
         | tell it to output identical HTML but with all the names and
         | statistics changed randomly.
        
           | wizzwizz4 wrote:
           | > _You could even get GPT to generate it_
           | 
           | This isn't a good idea, if you want a fair test. See
           | https://gwern.net/doc/reinforcement-
           | learning/safe/2023-krako..., specifically
           | https://arxiv.org/abs/1712.02950.
        
         | curl-up wrote:
         | Additionally, using any Wiki page is misleading, as LLMs have
         | seen their format many times during training, and can probably
         | reproduce the original HTML from the stripped version fairly
         | well.
         | 
         | Instead, using some random, messy, scattered-with-spam site
         | would be a much more realistic test environment.
        
           | furyofantares wrote:
           | Also it can get partial credit on some of these questions
           | without feeding in any data at all.
        
         | zaptrem wrote:
         | LLMs are trained on Wikipedia (and, since it's high quality
         | open license data, probably repeatedly), so this test is
         | contaminated.
        
       | topaz0 wrote:
       | Is .8 or .9 considered good enough accuracy for something as
       | simple as this?
        
         | edublancas wrote:
         | I'd say how much is good enough highly depends on your use
         | case. For something that still has to be reviewed by a human, I
         | think even .7 is great; if you're planning to automate
         | processes end-to-end, I'd aim for higher than .95
        
         | LunaSea wrote:
         | Well, when "simply" extracting the core text of an article is a
         | task where most solutions (rule-based, visual, traditional
         | classifiers and LLMs) rarely score above 0.8 in precision on
         | datasets with a variety of websites and / or multilingual
         | pages, I would consider that not too bad.
        
         | moralestapia wrote:
         | Yes, because the prompt is simple as well.
         | 
         | Chain of thought or some similar strategies (I hate that they
         | have their own name and like a paper and authors, lol) can help
         | you push that 0.9 to a 0.95-0.99.
        
       | yawnxyz wrote:
       | I found that reducing html down to markdown using turndown or
       | https://github.com/romansky/dom-to-semantic-markdown works well;
       | 
       | if you want the AI to be able to select stuff, give it cheerio or
       | jQuery access to navigate through the html document;
       | 
       | if you need to give tags, classes, and ids to the llm, I use an
       | html-to-pug converter like https://www.npmjs.com/package/html2pug
       | which strips a lot of text and cuts costs. I don't think LLMs are
       | particularly trained on pug content though so take this with a
       | grain of salt
        
         | rcarmo wrote:
         | Hmmm. That's interesting. I wish there was a Node-RED node for
         | the first library (I can always import the library directly and
         | build my own subflow, but since I have cheerio for Node-RED and
         | use it for paring down input to LLMs already...)
        
         | andybak wrote:
         | But OP did a (admittedly flawed) test. Have you got anything to
         | back up your claim here? We've all got our own hunches but this
         | post was an attempt to test those hypotheses.
        
       | ravedave5 wrote:
       | ChatGPT is clearly trained on wikipedia, is there any concern
       | about its knowledge from there polluting the responses? Seems
       | like it would be better to try against data it didn't potentially
       | already know.
        
         | tedsanders wrote:
         | Yep - one good option is to use Wikipedia pages from the recent
         | Olympics, which GPT has no knowledge of:
         | https://github.com/openai/openai-cookbook/blob/457f4310700f9...
        
       | IncreasePosts wrote:
       | Isn't GPT-4o multimodal? Shouldn't I be able to just feed in an
       | image of the rendered HTML, instead of doing work to strip tags
       | out?
        
         | spencerchubb wrote:
         | it is theoretically possible, but the results and bandwidth
         | would be worse. sending an image that large would take a lot
         | longer than sending text
        
           | brookst wrote:
           | This. Images passed to LLMs are typically downsampled to
           | something like 512x512 because that's perfectly good for
           | feature extraction. Getting text would mean very large images
           | so the text is still readable.
        
         | tedsanders wrote:
         | Images are much less reliable than text, unfortunately.
        
       | CharlieDigital wrote:
       | I roughly came to the same conclusion a few months back and wrote
       | a simple, containerized, open source general purpose scraper for
       | use with GPT using Playwright in C# and TypeScript that's fairly
       | easy to deploy and use with GPT function calling[0]. My
       | observation was that using `document.body.innerText` was
       | sufficient for GPT to "understand" the page and
       | `document.body.innerText` preserves some whitespace in Firefox
       | (and I think Chrome).
       | 
       | I use more or less this code as a starting point for a variety of
       | use cases and it seems to work just fine for my use cases
       | (scraping and processing travel blogs which tend to have pretty
       | consistent layouts/structures).
       | 
       | Some variations can make this better by adding logic to look for
       | the `main` content and ignore `nav` and `footer` (or variants
       | thereof whether using semantic tags or CSS selectors) and taking
       | only the `innerText` from the main container.
       | 
       | [0] https://github.com/CharlieDigital/playwright-scrape-api
        
       | beepbooptheory wrote:
       | You step back and realize: we are thinking about how to best
       | remove _some_ symbols from documents that not a moment ago we
       | were deciding certainly needed to be in there, all to feed a
       | certain kind of symbol machine which has seen all the symbols
       | before anyway, all so we don 't pay as much cents for the symbols
       | we know or think we need.
       | 
       | If I was not a human but some other kind of being suspended above
       | this situation, with no skin in the game so to speak, it would
       | all seem so terribly inefficient... But as fleshy mortal I do
       | understand how we got here.
        
         | Shadowmist wrote:
         | It's all one big piece of tape.
        
       | cpursley wrote:
       | What I do is convert to markdown, that way you still get some
       | semantic structure. Even built an Elixir library for this:
       | https://github.com/agoodway/html2markdown
        
         | bearjaws wrote:
         | Seems to be the most common method I've seen, it makes sense
         | given how well LLMs understand markdown.
        
           | wis wrote:
           | Why do LLMs understand markdown really well? (besides the
           | simple, terse and readable syntax of markdown)
           | 
           | They say "LLMs are trained on the web", are the web pages
           | converted from HTML into markdown before being fed into
           | training?
        
             | nprateem wrote:
             | I think it says in the Anthropic docs they use markdown
             | internally (I assume that means were trained on it to a
             | significant extent).
        
               | cpursley wrote:
               | I think Anthropic actually uses xml and OpenAI markdown.
        
         | audessuscest wrote:
         | I did that with json too, and got better result
        
       | simonw wrote:
       | I built a CLI tool (and Python library) for this a while ago
       | called strip-tags: https://github.com/simonw/strip-tags
       | 
       | By default it will strip all HTML tags and return just the text:
       | curl 'https://simonwillison.net/' | strip-tags
       | 
       | But you can also tell it you just want to get back the area of a
       | page identified by one or more CSS selectors:
       | curl 'https://simonwillison.net/' | strip-tags .quote
       | 
       | Or you can ask it to keep specific tags if you think those might
       | help provide extra context to the LLM:                   curl
       | 'https://simonwillison.net/' | strip-tags .quote -t div -t
       | blockquote
       | 
       | Add "-m" to minify the output (basically stripping most
       | whitespace)
       | 
       | Running this command:                   curl
       | 'https://simonwillison.net/' | strip-tags .quote -t div -t
       | blockquote -m
       | 
       | Gives me back output that starts like this:
       | <div class="quote segment"> <blockquote>history | tail -n
       | 2000 | llm -s "Write aliases for my zshrc based on my
       | terminal history. Only do this for most common features.
       | Don't use any specific files or directories."</blockquote> --
       | anjor  #         3:01 pm         / ai, generative-ai, llms, llm
       | </div>         <div class="quote segment"> <blockquote>Art is
       | notoriously         hard to define, and so are the differences
       | between good art         and bad art. But let me offer a
       | generalization: art is         something that results from making
       | a lot of choices. [...] to         oversimplify, we can imagine
       | that a ten-thousand-word short         story requires something
       | on the order of ten thousand         choices. When you give a
       | generative-A.I. program a prompt,         you are making very few
       | choices; if you supply a hundred-word         prompt, you have
       | made on the order of a hundred choices. If         an A.I.
       | generates a ten-thousand-word story based on your         prompt,
       | it has to fill in for all of the choices that you are         not
       | making.</blockquote> -- Ted Chiang  #         10:09 pm         /
       | art, new-yorker, ai, generative-ai, ted-chiang  </div>
       | 
       | I also often use the https://r.jina.ai/ proxy - add a URL to that
       | and it extracts the key content (using Puppeteer) and returns it
       | converted to Markdown, e.g.
       | https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato...
        
       | sergiotapia wrote:
       | In Elixir, I select the `<body>`, then remove all script and
       | style tags. Then extract the text.
       | 
       | This results in a kind of innerText you get in browsers, great
       | and light to pass into LLMs.                   defp
       | extract_inner_text(html) do           html           |>
       | Floki.parse_document!()           |> Floki.find("body")
       | |> Floki.traverse_and_update(fn             {tag, _attrs,
       | _children} = _node when tag in ["script", "style"] ->
       | nil                    node ->               node           end)
       | |> Floki.text(sep: " ")           |> String.trim()           |>
       | String.replace(~r/\s+/, " ")         end
        
         | bufferout wrote:
         | An example of where this approach is problematic: many
         | ecommerce product pages feature embedded json that is used to
         | dynamically update sections of the page.
        
       | longnguyen wrote:
       | I've been building an AI chat client and I use this exact
       | technique to develop the "Web Browsing" plugin. Basically I use
       | Function Calling to extract content from a web page and then pass
       | it to the LLM.
       | 
       | There are a few optimizations we can make:
       | 
       | - trip all content in <script/> and <style/> - use Readability.js
       | for articles - extract structured content from oEmbed
       | 
       | It works surprisingly well for me, even with gpt-4o-mini
        
       | coddle-hark wrote:
       | Anecdotally, the same seems to apply to the output format as
       | well. I've seen much better performance when instructing the
       | model to output something like this:
       | name=john,age=23       name=anna,age=26
       | 
       | Rather than this:                 {         matches: [
       | { name: "john", age: 23 },           { name: "anna", age: 26 }
       | ]       }
        
         | audessuscest wrote:
         | markdown works better than json too
        
       | cfcfcf wrote:
       | I'm curious. Scraping seems to come up a lot lately. What is
       | everyone scraping? And why?
        
         | samrolken wrote:
         | Context for LLMs, and use cases uniquely enabled by LLMs,
         | mostly I think.
        
         | jstanley wrote:
         | With people making LLMs act as agents in the world, the line
         | between "scraping" and "ordinary web usage" is becoming very
         | blurred.
        
       | bbarnett wrote:
       | A simple |htmltotext works well here, I suspect. Why rewrite the
       | thing from scratch? It even outputs formatted text if requested.
       | 
       | Certainly good enough for gpt input, it's quite good.
        
       | simplecto wrote:
       | One of my projects is a virtual agency of multiple LLMs for a
       | variety of back-office services (copywriting, copy-editing,
       | social media, job ads, etc).
       | 
       | We ingest your data wherever you point our crawlers and then
       | clean it for work working in RAGs or chained LLMs.
       | 
       | One library we like a lot is Trafilatura [1]. It does a great job
       | of taking the full HTML page and returning the most semantically
       | relevant parts.
       | 
       | It works well for LLM work as well as generating embeddings for
       | vectors and downstream things.
       | 
       | [1] - https://trafilatura.readthedocs.io/en/latest/
        
         | ukuina wrote:
         | +1 for Trafilatura. Simple, no fuss, and rarely breaks.
         | 
         | I use it nearly hourly for my HN summarizer HackYourNews
         | (https://hackyournews.com).
        
         | nickpsecurity wrote:
         | The paper is good, too, for understanding how it works. The
         | author also mentions many related tools in it.
         | 
         | https://aclanthology.org/2021.acl-demo.15/
        
       ___________________________________________________________________
       (page generated 2024-09-07 23:01 UTC)