[HN Gopher] Htmd: A turndown.js inspired HTML-to-Markdown conver...
       ___________________________________________________________________
        
       Htmd: A turndown.js inspired HTML-to-Markdown converter for Rust
        
       Author : letmutex
       Score  : 79 points
       Date   : 2024-06-16 08:43 UTC (14 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | christophilus wrote:
       | Im curious why one would use this vs something like pandoc?
        
         | jak2k wrote:
         | Probably because it's rust. Realistically, you would only need
         | it when you have already written a part of your software in
         | rust.
        
           | simicd wrote:
           | To add to that, an additional benefit would be you can
           | compile and release it as Python package (Py03/maturin) or
           | compile to WASM so it runs in the browser (with javascript
           | bindings). This makes the code portable while benefiting from
           | Rust's performance/memory safety.
        
           | ComputerGuru wrote:
           | The majority of use cases are surely as a separate binary and
           | not integrated into your code?
        
       | jak2k wrote:
       | Why even convert HTML to markdown? Isn't it usually the other way
       | around?
        
         | setopt wrote:
         | Only use case I can think of is to save web pages to your
         | MarkDown notes. Web links usually break after a year or two.
         | Unfortunately.
        
         | tarasglek wrote:
         | 1. For reading. I simplify all documents to mdown then render
         | back to html on my readers...e-readers and rss
         | 
         | 2. Saves context for llms and they are often trained on
         | markdown so work best with it.
         | 
         | 3. For search. Can search markdown much better than html with
         | postgres
         | 
         | Wrote https://markdown.download to help me with these
        
         | pjerem wrote:
         | If you want to store markdown in your database but you want to
         | user to use a basic wysiwig/content editable editor it can
         | allow you to not go through the full blown markdown editor.
        
         | AdrienBrault wrote:
         | To feed content to LLMs
        
           | simonw wrote:
           | Yeah, this. Markdown uses less tokens than HTML and most LLMs
           | have been trend on large amounts of Markdown.
           | 
           | That's why tools like this exist: https://jina.ai/reader/
           | 
           | Demo: https://r.jina.ai/https://news.ycombinator.com/item?id=
           | 40695...
        
         | rbut wrote:
         | We have HTML templates for sending transactional email in our
         | SaaS applications.
         | 
         | The templates are very basic, eg. <p> <b> <a>, etc. Users can
         | also customise these via a WYSIWYG editor.
         | 
         | We then use turndown.js to convert the rendered HTML email to
         | markdown which we then use for the text version of the email.
        
         | sureglymop wrote:
         | I created an RSS reader which has a uniform reader mode. I use
         | something similar to this to parse each RSS article to a
         | similar format. I'm sure there are many other use cases also.
        
         | codetrotter wrote:
         | Advent of Code exercises are almost pure markdown, but rendered
         | to HTML.
         | 
         | I've sometimes been converting it back to md to include the
         | text for each exercise alongside my solutions.
         | 
         | In my case I used a custom HTML to Markdown converter that was
         | specifically built to support only what I needed in order to
         | convert those Advent of Code exercises to markdown.
         | 
         | Mine was also written in Rust.
        
         | chaosharmonic wrote:
         | Storage.
         | 
         | My use case involves scraping job boards so that I don't have
         | to doomscroll them myself anymore, and storing them in Markdown
         | makes them smaller while also removing a bunch of extraneous
         | classes and structure.
         | 
         | Further, the side project I'm working on for managing all of
         | this can then _render_ them in a way that makes sense.
        
           | vladstudio wrote:
           | you might be interested in https://www.kadoa.com/use-
           | cases/jobs if you prefer to fully automate the process of job
           | boards scraping
        
         | seanhunter wrote:
         | I had to do this to recover my personal blog after both it and
         | the backups had been lost due to two unrelated snafus during
         | covid. I downloaded the pages from the internet archive and
         | used my own shellscript to extract the text as markdown and
         | then republished it using a static site generator.
         | 
         | Not exactly a common usecase I wouldn't think but it's good to
         | be able to do this.
        
         | mason_mpls wrote:
         | fun project, saving pages to Obsidian
        
       | udev4096 wrote:
       | Nice! I made a CLI in go for quickly converting mardown to html
       | sometime ago: https://github.com/thebigbone/markhtml
        
       | JasonSage wrote:
       | > Fast, it takes less than 200ms to convert a ~1.4MB Wikipedia
       | page on an i5 7th gen CPU
       | 
       | Okay, maybe I'm way off base here, but is this fast? 1.4MB of
       | Wikipedia page is, what, 20k lines? This doesn't sound like fast
       | Rust to me.
       | 
       | I would guess that the amount of HTML parsing that's happening is
       | way more than is actually needed to render markdown.
        
         | tingletech wrote:
         | I'm guessing it's fast compared to the javascript version they
         | ported to rust?
        
       | JohannesKauf wrote:
       | Cool to see another library in this space!
       | 
       | I see that you took the test cases from Turndown. However
       | Turndown isn't actually that accurate. This is especially
       | noticeable when converting entires websites.
       | 
       | The best comparison would be against Pandoc. That is (in my
       | opinion) the best _html to markdown_ converter right now.
       | 
       | Although it is extremely difficult to handle every edge case. As
       | an example, this usually causes problems:
       | <p>nitty<em>-gritty-</em>details</p>
       | 
       | Note: Six years ago I open sourced a Golang library [1].
       | Currently I am re-writing it completely with the aim of getting
       | even better than Pandoc. And wrote about the encountered edge-
       | cases [2].
       | 
       | [1] https://github.com/JohannesKaufmann/html-to-markdown
       | 
       | [2] https://html-to-markdown.com/edge-cases
        
       ___________________________________________________________________
       (page generated 2024-06-16 23:01 UTC)