[HN Gopher] Show HN: Convert HTML DOM to semantic markdown for u...
       ___________________________________________________________________
        
       Show HN: Convert HTML DOM to semantic markdown for use in LLMs
        
       Author : leroman
       Score  : 86 points
       Date   : 2024-07-23 08:13 UTC (14 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gradientDissent wrote:
       | Nice work. Main content extraction based on the <main> tag won't
       | work with most of the web pages these days. Arc90 could help.
        
         | leroman wrote:
         | Thank you! this is exactly why there's support for this
         | specific use case- https://github.com/romansky/dom-to-semantic-
         | markdown/blob/ma... (see `findContentByScoring`)
         | 
         | And if you pass an optional flag `extractMainContent` it will
         | use some heuristics to find the main content container if there
         | is no such tag..
        
       | gmaster1440 wrote:
       | > Semantic Clarity: Converts web content to a format more easily
       | "understandable" for LLMs, enhancing their processing and
       | reasoning capabilities.
       | 
       | Are there any data or benchmarks available that show what kind of
       | text content LLMs understand best? Is it generally understood at
       | this point that they "understand" markdown better than html?
        
         | sigmoid10 wrote:
         | They understand best whatever was used during their training.
         | For OpenAI's GPTs we don't really know since they don't
         | disclose anything anymore, but there are good reasons to assume
         | they used markdown or something closely related.
        
           | jddj wrote:
           | Just out of curiosity, what are some of those good reasons?
           | 
           | It's clear enough that they can use and consume markdown, but
           | is the suggestion here that they've seen more markdown than
           | xml?
           | 
           | I'd have guessed possibly naively that they fed in more
           | straight html but I'd be interested to know why that's
           | unlikely to be the case
        
         | leroman wrote:
         | Author here- it's a good point to have some benchmarks (which I
         | don't have..) but I think it's well understood that minimizing
         | noise by reducing tokens will improve the quality of the
         | answer. And I think by now LLMs are well versed in Markdown, as
         | it's the preferred markup language used when generating
         | responses
        
           | mistercow wrote:
           | I wouldn't be so sure on reducing tokens. Every token in
           | context is space for the LLM to do more computation. Noise is
           | obviously bad, because the computations will be irrelevant,
           | but as long as your HTML is cleaned up, the extra tokens
           | _aren't_ noise, but information about semantic structure.
        
             | leroman wrote:
             | Markdown being a very minimal Markup language has no need
             | for much of the structural and presentational stuff (CSS,
             | structural HTML), HTML has many many artifacts which are a
             | huge bloat and give no semantic value IMO.. It's the goal
             | here to capture any markup with semantic value, if you have
             | examples this library might miss, you are welcome to share
             | and I will look into it!
        
               | mistercow wrote:
               | Well, markdown and HTML are encoding the same
               | information, but markdown is effectively compressing the
               | semantic information. This works well for humans, because
               | the renderer (whether markdown or plaintext) decompresses
               | it for us. Two line breaks, for example, "decompress"
               | from two characters to an entire line of empty space. To
               | an LLM, though, it's just a string of tokens.
               | 
               | So consider this extreme case: suppose we take a large
               | chunk of plaintext and compress it with something like
               | DEFLATE (but in a tokenizer friendly way), so that it
               | uses 500 tokens instead of 2000 tokens. For the sake of
               | argument, say we've done our best to train an LLM on
               | these compressed samples.
               | 
               | Is that going to work well? After all, we've got the same
               | information in a quarter as many tokens. I think the
               | answer is pretty obviously "no". Not only are we using a
               | small fraction as much time and space to process the
               | information, but the LLM will be forced to waste a lot of
               | that computation on decompressing the data.
        
               | michaelmior wrote:
               | I think one big difference between DEFLATE and most other
               | standard compression algorithms is that they're
               | dictionary-based. So compressing in this way, you're
               | really messing with locality of tokens in way that is
               | likely unrelated to the semantics of what you're
               | compressing.
               | 
               | For example, adding a repeated word somewhere in a
               | completely different part of the document could change
               | the dictionary and the entirety of the compressed text.
               | That's not the case with the "compression" offered by
               | converting HTML to Markdown. This compression more or
               | less preserves locality and potentially removes
               | information that is semantically meaningless (e.g. nested
               | `div`s used for styling). Of course, this is really just
               | conjecture on my part, but I think HTML > Markdown is
               | likely to work well. It would certainly be interesting to
               | have a good benchmark for this.
        
               | mistercow wrote:
               | Absolutely. I'm just making a more general point that
               | "the same information in fewer tokens" does not mean
               | "more comprehensible to an LLM". And we have more
               | practical evidence that that's not the case, like the
               | recent "Let's Think Dot by Dot" paper, which found that
               | you can get many of the benefits of chain-of-thought
               | simply by adding filler tokens to your context (if your
               | model is trained to deal with filler tokens). For that
               | matter, chain-of-thought itself is an example of
               | increasing the tokens:information ratio, and generally
               | improves LLM performance.
               | 
               | That's not to say that I think that converting to
               | markdown is pointless or particularly harmful. Reducing
               | tokens is useful for other reasons; it reduces cost,
               | makes generation faster, and gives you more room in the
               | context window to cram information into. And markdown is
               | a nice choice because it's more comprehensible to
               | _humans,_ which is a win for debuggability.
               | 
               | I just don't think you can justifiably claim, without
               | specific research to back it up, that markdown is more
               | comprehensible to LLMs than HTML.
               | 
               | https://arxiv.org/abs/2404.15758
        
               | michaelmior wrote:
               | I think it's a _reasonable_ claim. But I would agree that
               | it 's worthy of more detailed investigation.
        
           | pseudosavant wrote:
           | My anecdotal experience is that Markdown usually does work
           | better than HTML. I only leave it as HTML if the LLM needs to
           | understand more about it than just the content, like the
           | attributes on the elements (which would typically be a lot of
           | noise, excess token input). I've found this to be especially
           | true when using AI/LLMs in RAG scenarios.
        
         | mistercow wrote:
         | I haven't found any specific research, but I suspect it's
         | actually the opposite, particularly for models like Claude,
         | which seem to have been specifically trained on XML-like
         | structures.
         | 
         | My hunch is that the fact that HTML has explicit matching
         | closing tags makes it a bit easier for an LLM to understand
         | structure, whereas markdown tends to lean heavily on line
         | breaks. That works great when you're viewing the text as a two
         | dimensional field of pixels, but that's not how LLMs see the
         | world.
         | 
         | But I think the difference is fairly marginal, and my hunch
         | should be taken with a grain of salt. From experience, all I
         | can say is that I've seen stripped down HTML work fine, and
         | I've seen markdown work fine. The one place where markdown
         | clearly shines is that it tends to use fewer tokens.
        
       | mistercow wrote:
       | This is cool. When dealing with tables, you might want to explore
       | departing from markdown. I've found that LLMs tend to struggle
       | with tables that have large numbers of columns containing similar
       | data types. Correlating a row is easy enough, because the data is
       | all together, but connecting a cell back to its column becomes a
       | counting task, which appears to be pretty rough.
       | 
       | A trick I've found seems to work well is leaving some kind of id
       | or coordinate marker on each column, and adding that to each
       | cell. You could probably do that while still having valid
       | markdown if you put the metadata in HTML comments, although it's
       | hard to say how an LLM will do at understanding that format.
        
         | leroman wrote:
         | Thanks for sharing, will look into adding this as a flag in the
         | options!
        
         | michaelmior wrote:
         | SpreadsheetLLM[0] might be worth looking into. It's designed
         | for Excel (and similar) spreadsheets, so I'd imagine you could
         | do something far simpler for the majority of HTML tables.
         | 
         | [0] https://arxiv.org/abs/2407.09025v1
        
       | explosion-s wrote:
       | How is this different than any other HTML to markdown library,
       | like Showdown or Turndown? Is there any specific features that
       | make it better for LLMS specifically instead of just converting
       | HTML to MD?
        
         | leroman wrote:
         | Will add some side-by-side comparisons soon! the goal is not
         | just to translate 1:1 HTML to markdown but to preserve any
         | semantic information, this is generally not the goal for these
         | tools. Some specific features and examples are in the README,
         | like URL minification and optional main section detection and
         | extraction (ignoring footer / header stuff).
        
       | throwthrowuknow wrote:
       | Thank you! I'm always looking for new options to use for
       | archiving and ingesting web pages and this looks great! Even
       | better that it's an npm package!
        
         | jejeyyy77 wrote:
         | hah, out of curiosity, what are you archiving and ingesting
         | webpages for?
        
           | throwthrowuknow wrote:
           | Mostly for integration with my Obsidian vault so I don't have
           | to leave the app and can add notes and links and avoid
           | linkrot.
        
         | leroman wrote:
         | You might find this useful- just added code & instructions on
         | how to make it a global CLI utility-
         | https://github.com/romansky/dom-to-semantic-markdown/blob/ma...
        
       | Zetaphor wrote:
       | A browser demo would be a nice addition to this readme
        
         | leroman wrote:
         | Please see here- https://github.com/romansky/dom-to-semantic-
         | markdown/blob/ma...
        
         | leroman wrote:
         | Ah, I suppose you mean a web page one could visit to see a demo
         | :) Added to the backlog!
        
       | nbbaier wrote:
       | This is really cool! Any plans to add Deno support? This would be
       | a great fit for environments like val.town[0], but they are based
       | on a Deno runtime and I don't think this will work out of the
       | box.
       | 
       | Also, when trying to run the node example from your readme, I had
       | to use `new dom.window.DOMParser()` instead of
       | `dom.window.DOMParser`
       | 
       | [0]: https://val.town
        
         | leroman wrote:
         | Afraid to say that other than bumping into a talk about Deno, I
         | haven't played around with it yet.. So thanks for the heads up,
         | will look into it.
         | 
         | Thanks for the bug report !
        
           | nbbaier wrote:
           | Happy to also take a swing at it, but it would take me a bit
           | because I've never added such compatibility to a library
           | before.
           | 
           | Any specific guidelines for contributing? I see that you're
           | open to contributions.
        
             | leroman wrote:
             | By all means, you can be the first contributor :) You are
             | welcome to either open an issue and brain storm together on
             | possible approaches or send me a pull request with what you
             | came up with and we start there
        
       | DevX101 wrote:
       | Problem is, with modern websites, everything is a div and you
       | can't necessarily infer semantic meaning from the DOM elements.
        
         | leroman wrote:
         | After removing the noise you can distill the semantic stuff
         | where ever possible, like meta-deta from images, buttons, etc,
         | and see some structures emerge like footers and nav and body..
         | And many times for the sake of SEO and accessibility, websites
         | do adopt quite a bit of semantic HTML elements and annotations
         | in respective tags..
        
         | goatlover wrote:
         | What happened to using the semantic elements? Did that fall out
         | of favor or the push for it get abandoned because popular
         | frameworks just generate divs with semantic classes
         | (hopefully)?
        
       | KolenCh wrote:
       | I am curious how it would compare to using pandoc with
       | readability algorithm for example.
        
         | leroman wrote:
         | Bumped this together with the side-by-side comparison task.. so
         | will look into it :)
        
       | KolenCh wrote:
       | Does anyone compare the performance between HTML input and other
       | formats? I did an informal comparison and from a few tests it
       | seems the HTML input is better. I thought having markdown input
       | would be more efficient too but I'd like to see more systematic
       | comparison to see it is the case.
        
       | la_fayette wrote:
       | The scoring approach seems interesting to extract the main
       | content of web pages. I am aware of the large body of decades of
       | research on that subject, with sophisticated image or nlp based
       | approaches. Since this extraction is critical to the quality of
       | the LLM response, it would be good to know how well this
       | performs. E.g., you could test it against a test dataset
       | (https://github.com/scrapinghub/article-extraction-benchmark).
       | Also, you could provide the option to plugin another extraction
       | algorithm, since there are other implementations available...
       | just some ideas for improvement...
        
         | leroman wrote:
         | This totally makes sense, I will look into adding support for
         | additional ways to detect the main content, super interesting!
        
       | ianbicking wrote:
       | This is a great idea! There's an exceedingly large amount of junk
       | in a typical HTML page that an LLM can't use in any useful way.
       | 
       | A few thoughts:
       | 
       | 1. URL Refification[sic] would only save tokens if a link is
       | referred to many times, right? Otherwise it seems best to keep
       | locality of reference. Though to the degree that URLs are opaque
       | to the LLM, I suppose they could be turned into references
       | without any destination in the source at all, and if the LLM
       | refers to a ref link you just look it up the real link in the
       | mapping.
       | 
       | 2. Several of the suggestions here could be alternate
       | serializations of the AST, but it's not clear to me how abstract
       | the AST is (especially since it's labelled as htmlToMarkdownAST).
       | And now that I look at the source it's kind of abstract but not
       | entirely: https://github.com/romansky/dom-to-semantic-
       | markdown/blob/ma... - when writing code like this I also find
       | keeping the AST fairly abstract also helps with the
       | implementation. (That said, you'll probably still be making
       | something that is Markdown-ish because you'll be preserving only
       | the data Markdown is able to represent.)
       | 
       | 3. With a more formal AST you could replace the big switch in
       | https://github.com/romansky/dom-to-semantic-markdown/blob/ma...
       | with a class that can be subclassed to override how particular
       | nodes are serialized.
       | 
       | 4. But I can also imagine something where there's a node type
       | like "markdown-literal" and to change the serialization someone
       | could, say, go through and find all the type:"table" nodes and
       | translate them into type:"markdown-literal" and then serialize
       | the result.
       | 
       | 5. A more advanced parsing might also turn things like headers
       | into sections, and introduce more of a tree of nodes (I think the
       | AST is flat currently?). I think it's likely that an LLM would
       | follow `<header-name-slug>...</header-name-slug>` better than `#
       | Header Name\n ....` (at least sometimes, as an option).
       | 
       | 6. Even fancier if, running it with some full renderer (not sure
       | what the options are these days), and you start to use
       | getComputedStyle() and heuristics based on bounding boxes and
       | stuff like that to infer even more structure.
       | 
       | 7. Another use case that could be useful is to be able to "name"
       | pieces of the document so the LLM can refer to them. The result
       | doesn't have to be valid Markdown, really, just a unique
       | identifier put in the right position. (In a sense this is what
       | URL reification can do, but only for URLs?)
        
         | leroman wrote:
         | This is some great feedback, thanks!
         | 
         | 1. there some crazy links with lots of arguments and tracking
         | stuff in them, so it gets very long, the refification turns
         | them into a numbered "ref[n]" scheme, where you also get a map
         | of ref[n]->url to do reverse translation.. it really saves a
         | lot, in my experience. It's also optional, so you can be
         | mindful when you want to use this feature..
         | 
         | 2. I tried to keep it domain specific (not to reinvent HTML...)
         | so mostly Markdown components and some flexibility to add HTML
         | elements (img, footer etc).
         | 
         | 3. Not sure I'm sold with replacing the switch, it's very
         | useful there because of the many fall through cases.. I find it
         | maintainable but if you point me to some specific issue there
         | it would help
         | 
         | 4. There are some built in functions to traverse and modify the
         | AST. It is just JSON in the end of the day so you could
         | leverage the types and write your own logic to parse it, as
         | long as it conforms to the format you can always serialize it,
         | as you mentioned..
         | 
         | 5. The AST is recursive so not flat.. sounds like you want to
         | either write your own AST->Semantic-Markdown implementation or
         | plug into the existing one so I'll this in mind in the future
         | 
         | 6. Sounds cool but out of scope at the moment :)
         | 
         | 7. This feature would serve to help with scraping and kind of
         | point the LLM to some element? Then the part I'm missing is how
         | you would code this in advance.. There could be some meta-data
         | tag you could add and it would be taken through the pipeline
         | and added on the other side to the generated elements in some
         | way..
        
       ___________________________________________________________________
       (page generated 2024-07-23 23:09 UTC)