[HN Gopher] Show HN: Convert HTML DOM to semantic markdown for u...
___________________________________________________________________
Show HN: Convert HTML DOM to semantic markdown for use in LLMs
Author : leroman
Score : 86 points
Date : 2024-07-23 08:13 UTC (14 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gradientDissent wrote:
| Nice work. Main content extraction based on the <main> tag won't
| work with most of the web pages these days. Arc90 could help.
| leroman wrote:
| Thank you! this is exactly why there's support for this
| specific use case- https://github.com/romansky/dom-to-semantic-
| markdown/blob/ma... (see `findContentByScoring`)
|
| And if you pass an optional flag `extractMainContent` it will
| use some heuristics to find the main content container if there
| is no such tag..
| gmaster1440 wrote:
| > Semantic Clarity: Converts web content to a format more easily
| "understandable" for LLMs, enhancing their processing and
| reasoning capabilities.
|
| Are there any data or benchmarks available that show what kind of
| text content LLMs understand best? Is it generally understood at
| this point that they "understand" markdown better than html?
| sigmoid10 wrote:
| They understand best whatever was used during their training.
| For OpenAI's GPTs we don't really know since they don't
| disclose anything anymore, but there are good reasons to assume
| they used markdown or something closely related.
| jddj wrote:
| Just out of curiosity, what are some of those good reasons?
|
| It's clear enough that they can use and consume markdown, but
| is the suggestion here that they've seen more markdown than
| xml?
|
| I'd have guessed possibly naively that they fed in more
| straight html but I'd be interested to know why that's
| unlikely to be the case
| leroman wrote:
| Author here- it's a good point to have some benchmarks (which I
| don't have..) but I think it's well understood that minimizing
| noise by reducing tokens will improve the quality of the
| answer. And I think by now LLMs are well versed in Markdown, as
| it's the preferred markup language used when generating
| responses
| mistercow wrote:
| I wouldn't be so sure on reducing tokens. Every token in
| context is space for the LLM to do more computation. Noise is
| obviously bad, because the computations will be irrelevant,
| but as long as your HTML is cleaned up, the extra tokens
| _aren't_ noise, but information about semantic structure.
| leroman wrote:
| Markdown being a very minimal Markup language has no need
| for much of the structural and presentational stuff (CSS,
| structural HTML), HTML has many many artifacts which are a
| huge bloat and give no semantic value IMO.. It's the goal
| here to capture any markup with semantic value, if you have
| examples this library might miss, you are welcome to share
| and I will look into it!
| mistercow wrote:
| Well, markdown and HTML are encoding the same
| information, but markdown is effectively compressing the
| semantic information. This works well for humans, because
| the renderer (whether markdown or plaintext) decompresses
| it for us. Two line breaks, for example, "decompress"
| from two characters to an entire line of empty space. To
| an LLM, though, it's just a string of tokens.
|
| So consider this extreme case: suppose we take a large
| chunk of plaintext and compress it with something like
| DEFLATE (but in a tokenizer friendly way), so that it
| uses 500 tokens instead of 2000 tokens. For the sake of
| argument, say we've done our best to train an LLM on
| these compressed samples.
|
| Is that going to work well? After all, we've got the same
| information in a quarter as many tokens. I think the
| answer is pretty obviously "no". Not only are we using a
| small fraction as much time and space to process the
| information, but the LLM will be forced to waste a lot of
| that computation on decompressing the data.
| michaelmior wrote:
| I think one big difference between DEFLATE and most other
| standard compression algorithms is that they're
| dictionary-based. So compressing in this way, you're
| really messing with locality of tokens in way that is
| likely unrelated to the semantics of what you're
| compressing.
|
| For example, adding a repeated word somewhere in a
| completely different part of the document could change
| the dictionary and the entirety of the compressed text.
| That's not the case with the "compression" offered by
| converting HTML to Markdown. This compression more or
| less preserves locality and potentially removes
| information that is semantically meaningless (e.g. nested
| `div`s used for styling). Of course, this is really just
| conjecture on my part, but I think HTML > Markdown is
| likely to work well. It would certainly be interesting to
| have a good benchmark for this.
| mistercow wrote:
| Absolutely. I'm just making a more general point that
| "the same information in fewer tokens" does not mean
| "more comprehensible to an LLM". And we have more
| practical evidence that that's not the case, like the
| recent "Let's Think Dot by Dot" paper, which found that
| you can get many of the benefits of chain-of-thought
| simply by adding filler tokens to your context (if your
| model is trained to deal with filler tokens). For that
| matter, chain-of-thought itself is an example of
| increasing the tokens:information ratio, and generally
| improves LLM performance.
|
| That's not to say that I think that converting to
| markdown is pointless or particularly harmful. Reducing
| tokens is useful for other reasons; it reduces cost,
| makes generation faster, and gives you more room in the
| context window to cram information into. And markdown is
| a nice choice because it's more comprehensible to
| _humans,_ which is a win for debuggability.
|
| I just don't think you can justifiably claim, without
| specific research to back it up, that markdown is more
| comprehensible to LLMs than HTML.
|
| https://arxiv.org/abs/2404.15758
| michaelmior wrote:
| I think it's a _reasonable_ claim. But I would agree that
| it 's worthy of more detailed investigation.
| pseudosavant wrote:
| My anecdotal experience is that Markdown usually does work
| better than HTML. I only leave it as HTML if the LLM needs to
| understand more about it than just the content, like the
| attributes on the elements (which would typically be a lot of
| noise, excess token input). I've found this to be especially
| true when using AI/LLMs in RAG scenarios.
| mistercow wrote:
| I haven't found any specific research, but I suspect it's
| actually the opposite, particularly for models like Claude,
| which seem to have been specifically trained on XML-like
| structures.
|
| My hunch is that the fact that HTML has explicit matching
| closing tags makes it a bit easier for an LLM to understand
| structure, whereas markdown tends to lean heavily on line
| breaks. That works great when you're viewing the text as a two
| dimensional field of pixels, but that's not how LLMs see the
| world.
|
| But I think the difference is fairly marginal, and my hunch
| should be taken with a grain of salt. From experience, all I
| can say is that I've seen stripped down HTML work fine, and
| I've seen markdown work fine. The one place where markdown
| clearly shines is that it tends to use fewer tokens.
| mistercow wrote:
| This is cool. When dealing with tables, you might want to explore
| departing from markdown. I've found that LLMs tend to struggle
| with tables that have large numbers of columns containing similar
| data types. Correlating a row is easy enough, because the data is
| all together, but connecting a cell back to its column becomes a
| counting task, which appears to be pretty rough.
|
| A trick I've found seems to work well is leaving some kind of id
| or coordinate marker on each column, and adding that to each
| cell. You could probably do that while still having valid
| markdown if you put the metadata in HTML comments, although it's
| hard to say how an LLM will do at understanding that format.
| leroman wrote:
| Thanks for sharing, will look into adding this as a flag in the
| options!
| michaelmior wrote:
| SpreadsheetLLM[0] might be worth looking into. It's designed
| for Excel (and similar) spreadsheets, so I'd imagine you could
| do something far simpler for the majority of HTML tables.
|
| [0] https://arxiv.org/abs/2407.09025v1
| explosion-s wrote:
| How is this different than any other HTML to markdown library,
| like Showdown or Turndown? Is there any specific features that
| make it better for LLMS specifically instead of just converting
| HTML to MD?
| leroman wrote:
| Will add some side-by-side comparisons soon! the goal is not
| just to translate 1:1 HTML to markdown but to preserve any
| semantic information, this is generally not the goal for these
| tools. Some specific features and examples are in the README,
| like URL minification and optional main section detection and
| extraction (ignoring footer / header stuff).
| throwthrowuknow wrote:
| Thank you! I'm always looking for new options to use for
| archiving and ingesting web pages and this looks great! Even
| better that it's an npm package!
| jejeyyy77 wrote:
| hah, out of curiosity, what are you archiving and ingesting
| webpages for?
| throwthrowuknow wrote:
| Mostly for integration with my Obsidian vault so I don't have
| to leave the app and can add notes and links and avoid
| linkrot.
| leroman wrote:
| You might find this useful- just added code & instructions on
| how to make it a global CLI utility-
| https://github.com/romansky/dom-to-semantic-markdown/blob/ma...
| Zetaphor wrote:
| A browser demo would be a nice addition to this readme
| leroman wrote:
| Please see here- https://github.com/romansky/dom-to-semantic-
| markdown/blob/ma...
| leroman wrote:
| Ah, I suppose you mean a web page one could visit to see a demo
| :) Added to the backlog!
| nbbaier wrote:
| This is really cool! Any plans to add Deno support? This would be
| a great fit for environments like val.town[0], but they are based
| on a Deno runtime and I don't think this will work out of the
| box.
|
| Also, when trying to run the node example from your readme, I had
| to use `new dom.window.DOMParser()` instead of
| `dom.window.DOMParser`
|
| [0]: https://val.town
| leroman wrote:
| Afraid to say that other than bumping into a talk about Deno, I
| haven't played around with it yet.. So thanks for the heads up,
| will look into it.
|
| Thanks for the bug report !
| nbbaier wrote:
| Happy to also take a swing at it, but it would take me a bit
| because I've never added such compatibility to a library
| before.
|
| Any specific guidelines for contributing? I see that you're
| open to contributions.
| leroman wrote:
| By all means, you can be the first contributor :) You are
| welcome to either open an issue and brain storm together on
| possible approaches or send me a pull request with what you
| came up with and we start there
| DevX101 wrote:
| Problem is, with modern websites, everything is a div and you
| can't necessarily infer semantic meaning from the DOM elements.
| leroman wrote:
| After removing the noise you can distill the semantic stuff
| where ever possible, like meta-deta from images, buttons, etc,
| and see some structures emerge like footers and nav and body..
| And many times for the sake of SEO and accessibility, websites
| do adopt quite a bit of semantic HTML elements and annotations
| in respective tags..
| goatlover wrote:
| What happened to using the semantic elements? Did that fall out
| of favor or the push for it get abandoned because popular
| frameworks just generate divs with semantic classes
| (hopefully)?
| KolenCh wrote:
| I am curious how it would compare to using pandoc with
| readability algorithm for example.
| leroman wrote:
| Bumped this together with the side-by-side comparison task.. so
| will look into it :)
| KolenCh wrote:
| Does anyone compare the performance between HTML input and other
| formats? I did an informal comparison and from a few tests it
| seems the HTML input is better. I thought having markdown input
| would be more efficient too but I'd like to see more systematic
| comparison to see it is the case.
| la_fayette wrote:
| The scoring approach seems interesting to extract the main
| content of web pages. I am aware of the large body of decades of
| research on that subject, with sophisticated image or nlp based
| approaches. Since this extraction is critical to the quality of
| the LLM response, it would be good to know how well this
| performs. E.g., you could test it against a test dataset
| (https://github.com/scrapinghub/article-extraction-benchmark).
| Also, you could provide the option to plugin another extraction
| algorithm, since there are other implementations available...
| just some ideas for improvement...
| leroman wrote:
| This totally makes sense, I will look into adding support for
| additional ways to detect the main content, super interesting!
| ianbicking wrote:
| This is a great idea! There's an exceedingly large amount of junk
| in a typical HTML page that an LLM can't use in any useful way.
|
| A few thoughts:
|
| 1. URL Refification[sic] would only save tokens if a link is
| referred to many times, right? Otherwise it seems best to keep
| locality of reference. Though to the degree that URLs are opaque
| to the LLM, I suppose they could be turned into references
| without any destination in the source at all, and if the LLM
| refers to a ref link you just look it up the real link in the
| mapping.
|
| 2. Several of the suggestions here could be alternate
| serializations of the AST, but it's not clear to me how abstract
| the AST is (especially since it's labelled as htmlToMarkdownAST).
| And now that I look at the source it's kind of abstract but not
| entirely: https://github.com/romansky/dom-to-semantic-
| markdown/blob/ma... - when writing code like this I also find
| keeping the AST fairly abstract also helps with the
| implementation. (That said, you'll probably still be making
| something that is Markdown-ish because you'll be preserving only
| the data Markdown is able to represent.)
|
| 3. With a more formal AST you could replace the big switch in
| https://github.com/romansky/dom-to-semantic-markdown/blob/ma...
| with a class that can be subclassed to override how particular
| nodes are serialized.
|
| 4. But I can also imagine something where there's a node type
| like "markdown-literal" and to change the serialization someone
| could, say, go through and find all the type:"table" nodes and
| translate them into type:"markdown-literal" and then serialize
| the result.
|
| 5. A more advanced parsing might also turn things like headers
| into sections, and introduce more of a tree of nodes (I think the
| AST is flat currently?). I think it's likely that an LLM would
| follow `<header-name-slug>...</header-name-slug>` better than `#
| Header Name\n ....` (at least sometimes, as an option).
|
| 6. Even fancier if, running it with some full renderer (not sure
| what the options are these days), and you start to use
| getComputedStyle() and heuristics based on bounding boxes and
| stuff like that to infer even more structure.
|
| 7. Another use case that could be useful is to be able to "name"
| pieces of the document so the LLM can refer to them. The result
| doesn't have to be valid Markdown, really, just a unique
| identifier put in the right position. (In a sense this is what
| URL reification can do, but only for URLs?)
| leroman wrote:
| This is some great feedback, thanks!
|
| 1. there some crazy links with lots of arguments and tracking
| stuff in them, so it gets very long, the refification turns
| them into a numbered "ref[n]" scheme, where you also get a map
| of ref[n]->url to do reverse translation.. it really saves a
| lot, in my experience. It's also optional, so you can be
| mindful when you want to use this feature..
|
| 2. I tried to keep it domain specific (not to reinvent HTML...)
| so mostly Markdown components and some flexibility to add HTML
| elements (img, footer etc).
|
| 3. Not sure I'm sold with replacing the switch, it's very
| useful there because of the many fall through cases.. I find it
| maintainable but if you point me to some specific issue there
| it would help
|
| 4. There are some built in functions to traverse and modify the
| AST. It is just JSON in the end of the day so you could
| leverage the types and write your own logic to parse it, as
| long as it conforms to the format you can always serialize it,
| as you mentioned..
|
| 5. The AST is recursive so not flat.. sounds like you want to
| either write your own AST->Semantic-Markdown implementation or
| plug into the existing one so I'll this in mind in the future
|
| 6. Sounds cool but out of scope at the moment :)
|
| 7. This feature would serve to help with scraping and kind of
| point the LLM to some element? Then the part I'm missing is how
| you would code this in advance.. There could be some meta-data
| tag you could add and it would be taken through the pipeline
| and added on the other side to the generated elements in some
| way..
___________________________________________________________________
(page generated 2024-07-23 23:09 UTC)