[HN Gopher] Minifying HTML for GPT-4o: Remove all the HTML tags
___________________________________________________________________
Minifying HTML for GPT-4o: Remove all the HTML tags
Author : edublancas
Score : 129 points
Date : 2024-09-05 13:51 UTC (2 days ago)
(HTM) web link (blancas.io)
(TXT) w3m dump (blancas.io)
| giancarlostoro wrote:
| I wonder if this is due to some template engines looking
| minimalist like that. I think maybe Pug?
|
| https://github.com/pugjs/pug?tab=readme-ov-file#syntax
|
| It is whitespace sensitive though, but essentially looks like
| that. I doubt this is the only unique template engine like this
| though.
| cj wrote:
| Related article from 4 days ago (with comments on scraping,
| specifically discussing removing HTML tags)
|
| https://news.ycombinator.com/item?id=41428274
|
| Edit: looks like it's actually the same author
| throwup238 wrote:
| I don't think that Mercury Prize table is a representative
| example because each column has an obviously unique structure
| that the LLM can key in on: (year) (Single Artist/Album pair)
| (List of Artist/Album pairs) (image) (citation link)
|
| I think a much better test would be something like "List of
| elements by atomic properties" [1] that has a lot of adjacent
| numbers in a similar range and overlapping first/last column
| types. However, the danger with that table might be easy for the
| LLM to infer just from the element names since they're well known
| physical constants. The table of counties by population density
| might be less predictable [2] or list of largest cities [3]
|
| The test should be repeated with every available sorting function
| too, to see if that causes any new errors.
|
| [1]
| https://en.wikipedia.org/wiki/List_of_elements_by_atomic_pro...
|
| [2]
| https://en.wikipedia.org/wiki/List_of_countries_and_dependen...
|
| [3] https://en.wikipedia.org/wiki/List_of_largest_cities#List
| edublancas wrote:
| thanks a lot for the feedback! you're right, this is much
| better input data. I'll re-run the code with these tables!
| andybak wrote:
| Also - is there a chance GPT is relying on it's training data
| for some questions? i.e. you don't even need to give it the
| table.
|
| To be sure - shouldn't you be asking questions based on data
| that is guaranteed not to be in it's training?
| cal85 wrote:
| Good points. But I feel like even with the cities article it
| could still 'cheat' by recognising what the data is supposed to
| be and filling in the blanks. Does it even need to be real
| though? What about generating a fake article to use as a test
| so it can't possibly recognise the contents? You could even get
| GPT to generate it, just give it the 'Largest cities' HTML and
| tell it to output identical HTML but with all the names and
| statistics changed randomly.
| wizzwizz4 wrote:
| > _You could even get GPT to generate it_
|
| This isn't a good idea, if you want a fair test. See
| https://gwern.net/doc/reinforcement-
| learning/safe/2023-krako..., specifically
| https://arxiv.org/abs/1712.02950.
| curl-up wrote:
| Additionally, using any Wiki page is misleading, as LLMs have
| seen their format many times during training, and can probably
| reproduce the original HTML from the stripped version fairly
| well.
|
| Instead, using some random, messy, scattered-with-spam site
| would be a much more realistic test environment.
| furyofantares wrote:
| Also it can get partial credit on some of these questions
| without feeding in any data at all.
| zaptrem wrote:
| LLMs are trained on Wikipedia (and, since it's high quality
| open license data, probably repeatedly), so this test is
| contaminated.
| topaz0 wrote:
| Is .8 or .9 considered good enough accuracy for something as
| simple as this?
| edublancas wrote:
| I'd say how much is good enough highly depends on your use
| case. For something that still has to be reviewed by a human, I
| think even .7 is great; if you're planning to automate
| processes end-to-end, I'd aim for higher than .95
| LunaSea wrote:
| Well, when "simply" extracting the core text of an article is a
| task where most solutions (rule-based, visual, traditional
| classifiers and LLMs) rarely score above 0.8 in precision on
| datasets with a variety of websites and / or multilingual
| pages, I would consider that not too bad.
| moralestapia wrote:
| Yes, because the prompt is simple as well.
|
| Chain of thought or some similar strategies (I hate that they
| have their own name and like a paper and authors, lol) can help
| you push that 0.9 to a 0.95-0.99.
| yawnxyz wrote:
| I found that reducing html down to markdown using turndown or
| https://github.com/romansky/dom-to-semantic-markdown works well;
|
| if you want the AI to be able to select stuff, give it cheerio or
| jQuery access to navigate through the html document;
|
| if you need to give tags, classes, and ids to the llm, I use an
| html-to-pug converter like https://www.npmjs.com/package/html2pug
| which strips a lot of text and cuts costs. I don't think LLMs are
| particularly trained on pug content though so take this with a
| grain of salt
| rcarmo wrote:
| Hmmm. That's interesting. I wish there was a Node-RED node for
| the first library (I can always import the library directly and
| build my own subflow, but since I have cheerio for Node-RED and
| use it for paring down input to LLMs already...)
| andybak wrote:
| But OP did a (admittedly flawed) test. Have you got anything to
| back up your claim here? We've all got our own hunches but this
| post was an attempt to test those hypotheses.
| ravedave5 wrote:
| ChatGPT is clearly trained on wikipedia, is there any concern
| about its knowledge from there polluting the responses? Seems
| like it would be better to try against data it didn't potentially
| already know.
| tedsanders wrote:
| Yep - one good option is to use Wikipedia pages from the recent
| Olympics, which GPT has no knowledge of:
| https://github.com/openai/openai-cookbook/blob/457f4310700f9...
| IncreasePosts wrote:
| Isn't GPT-4o multimodal? Shouldn't I be able to just feed in an
| image of the rendered HTML, instead of doing work to strip tags
| out?
| spencerchubb wrote:
| it is theoretically possible, but the results and bandwidth
| would be worse. sending an image that large would take a lot
| longer than sending text
| brookst wrote:
| This. Images passed to LLMs are typically downsampled to
| something like 512x512 because that's perfectly good for
| feature extraction. Getting text would mean very large images
| so the text is still readable.
| tedsanders wrote:
| Images are much less reliable than text, unfortunately.
| CharlieDigital wrote:
| I roughly came to the same conclusion a few months back and wrote
| a simple, containerized, open source general purpose scraper for
| use with GPT using Playwright in C# and TypeScript that's fairly
| easy to deploy and use with GPT function calling[0]. My
| observation was that using `document.body.innerText` was
| sufficient for GPT to "understand" the page and
| `document.body.innerText` preserves some whitespace in Firefox
| (and I think Chrome).
|
| I use more or less this code as a starting point for a variety of
| use cases and it seems to work just fine for my use cases
| (scraping and processing travel blogs which tend to have pretty
| consistent layouts/structures).
|
| Some variations can make this better by adding logic to look for
| the `main` content and ignore `nav` and `footer` (or variants
| thereof whether using semantic tags or CSS selectors) and taking
| only the `innerText` from the main container.
|
| [0] https://github.com/CharlieDigital/playwright-scrape-api
| beepbooptheory wrote:
| You step back and realize: we are thinking about how to best
| remove _some_ symbols from documents that not a moment ago we
| were deciding certainly needed to be in there, all to feed a
| certain kind of symbol machine which has seen all the symbols
| before anyway, all so we don 't pay as much cents for the symbols
| we know or think we need.
|
| If I was not a human but some other kind of being suspended above
| this situation, with no skin in the game so to speak, it would
| all seem so terribly inefficient... But as fleshy mortal I do
| understand how we got here.
| Shadowmist wrote:
| It's all one big piece of tape.
| cpursley wrote:
| What I do is convert to markdown, that way you still get some
| semantic structure. Even built an Elixir library for this:
| https://github.com/agoodway/html2markdown
| bearjaws wrote:
| Seems to be the most common method I've seen, it makes sense
| given how well LLMs understand markdown.
| wis wrote:
| Why do LLMs understand markdown really well? (besides the
| simple, terse and readable syntax of markdown)
|
| They say "LLMs are trained on the web", are the web pages
| converted from HTML into markdown before being fed into
| training?
| nprateem wrote:
| I think it says in the Anthropic docs they use markdown
| internally (I assume that means were trained on it to a
| significant extent).
| cpursley wrote:
| I think Anthropic actually uses xml and OpenAI markdown.
| audessuscest wrote:
| I did that with json too, and got better result
| simonw wrote:
| I built a CLI tool (and Python library) for this a while ago
| called strip-tags: https://github.com/simonw/strip-tags
|
| By default it will strip all HTML tags and return just the text:
| curl 'https://simonwillison.net/' | strip-tags
|
| But you can also tell it you just want to get back the area of a
| page identified by one or more CSS selectors:
| curl 'https://simonwillison.net/' | strip-tags .quote
|
| Or you can ask it to keep specific tags if you think those might
| help provide extra context to the LLM: curl
| 'https://simonwillison.net/' | strip-tags .quote -t div -t
| blockquote
|
| Add "-m" to minify the output (basically stripping most
| whitespace)
|
| Running this command: curl
| 'https://simonwillison.net/' | strip-tags .quote -t div -t
| blockquote -m
|
| Gives me back output that starts like this:
| <div class="quote segment"> <blockquote>history | tail -n
| 2000 | llm -s "Write aliases for my zshrc based on my
| terminal history. Only do this for most common features.
| Don't use any specific files or directories."</blockquote> --
| anjor # 3:01 pm / ai, generative-ai, llms, llm
| </div> <div class="quote segment"> <blockquote>Art is
| notoriously hard to define, and so are the differences
| between good art and bad art. But let me offer a
| generalization: art is something that results from making
| a lot of choices. [...] to oversimplify, we can imagine
| that a ten-thousand-word short story requires something
| on the order of ten thousand choices. When you give a
| generative-A.I. program a prompt, you are making very few
| choices; if you supply a hundred-word prompt, you have
| made on the order of a hundred choices. If an A.I.
| generates a ten-thousand-word story based on your prompt,
| it has to fill in for all of the choices that you are not
| making.</blockquote> -- Ted Chiang # 10:09 pm /
| art, new-yorker, ai, generative-ai, ted-chiang </div>
|
| I also often use the https://r.jina.ai/ proxy - add a URL to that
| and it extracts the key content (using Puppeteer) and returns it
| converted to Markdown, e.g.
| https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato...
| sergiotapia wrote:
| In Elixir, I select the `<body>`, then remove all script and
| style tags. Then extract the text.
|
| This results in a kind of innerText you get in browsers, great
| and light to pass into LLMs. defp
| extract_inner_text(html) do html |>
| Floki.parse_document!() |> Floki.find("body")
| |> Floki.traverse_and_update(fn {tag, _attrs,
| _children} = _node when tag in ["script", "style"] ->
| nil node -> node end)
| |> Floki.text(sep: " ") |> String.trim() |>
| String.replace(~r/\s+/, " ") end
| bufferout wrote:
| An example of where this approach is problematic: many
| ecommerce product pages feature embedded json that is used to
| dynamically update sections of the page.
| longnguyen wrote:
| I've been building an AI chat client and I use this exact
| technique to develop the "Web Browsing" plugin. Basically I use
| Function Calling to extract content from a web page and then pass
| it to the LLM.
|
| There are a few optimizations we can make:
|
| - trip all content in <script/> and <style/> - use Readability.js
| for articles - extract structured content from oEmbed
|
| It works surprisingly well for me, even with gpt-4o-mini
| coddle-hark wrote:
| Anecdotally, the same seems to apply to the output format as
| well. I've seen much better performance when instructing the
| model to output something like this:
| name=john,age=23 name=anna,age=26
|
| Rather than this: { matches: [
| { name: "john", age: 23 }, { name: "anna", age: 26 }
| ] }
| audessuscest wrote:
| markdown works better than json too
| cfcfcf wrote:
| I'm curious. Scraping seems to come up a lot lately. What is
| everyone scraping? And why?
| samrolken wrote:
| Context for LLMs, and use cases uniquely enabled by LLMs,
| mostly I think.
| jstanley wrote:
| With people making LLMs act as agents in the world, the line
| between "scraping" and "ordinary web usage" is becoming very
| blurred.
| bbarnett wrote:
| A simple |htmltotext works well here, I suspect. Why rewrite the
| thing from scratch? It even outputs formatted text if requested.
|
| Certainly good enough for gpt input, it's quite good.
| simplecto wrote:
| One of my projects is a virtual agency of multiple LLMs for a
| variety of back-office services (copywriting, copy-editing,
| social media, job ads, etc).
|
| We ingest your data wherever you point our crawlers and then
| clean it for work working in RAGs or chained LLMs.
|
| One library we like a lot is Trafilatura [1]. It does a great job
| of taking the full HTML page and returning the most semantically
| relevant parts.
|
| It works well for LLM work as well as generating embeddings for
| vectors and downstream things.
|
| [1] - https://trafilatura.readthedocs.io/en/latest/
| ukuina wrote:
| +1 for Trafilatura. Simple, no fuss, and rarely breaks.
|
| I use it nearly hourly for my HN summarizer HackYourNews
| (https://hackyournews.com).
| nickpsecurity wrote:
| The paper is good, too, for understanding how it works. The
| author also mentions many related tools in it.
|
| https://aclanthology.org/2021.acl-demo.15/
___________________________________________________________________
(page generated 2024-09-07 23:01 UTC)