[HN Gopher] Htmd: A turndown.js inspired HTML-to-Markdown conver...
___________________________________________________________________
Htmd: A turndown.js inspired HTML-to-Markdown converter for Rust
Author : letmutex
Score : 79 points
Date : 2024-06-16 08:43 UTC (14 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| christophilus wrote:
| Im curious why one would use this vs something like pandoc?
| jak2k wrote:
| Probably because it's rust. Realistically, you would only need
| it when you have already written a part of your software in
| rust.
| simicd wrote:
| To add to that, an additional benefit would be you can
| compile and release it as Python package (Py03/maturin) or
| compile to WASM so it runs in the browser (with javascript
| bindings). This makes the code portable while benefiting from
| Rust's performance/memory safety.
| ComputerGuru wrote:
| The majority of use cases are surely as a separate binary and
| not integrated into your code?
| jak2k wrote:
| Why even convert HTML to markdown? Isn't it usually the other way
| around?
| setopt wrote:
| Only use case I can think of is to save web pages to your
| MarkDown notes. Web links usually break after a year or two.
| Unfortunately.
| tarasglek wrote:
| 1. For reading. I simplify all documents to mdown then render
| back to html on my readers...e-readers and rss
|
| 2. Saves context for llms and they are often trained on
| markdown so work best with it.
|
| 3. For search. Can search markdown much better than html with
| postgres
|
| Wrote https://markdown.download to help me with these
| pjerem wrote:
| If you want to store markdown in your database but you want to
| user to use a basic wysiwig/content editable editor it can
| allow you to not go through the full blown markdown editor.
| AdrienBrault wrote:
| To feed content to LLMs
| simonw wrote:
| Yeah, this. Markdown uses less tokens than HTML and most LLMs
| have been trend on large amounts of Markdown.
|
| That's why tools like this exist: https://jina.ai/reader/
|
| Demo: https://r.jina.ai/https://news.ycombinator.com/item?id=
| 40695...
| rbut wrote:
| We have HTML templates for sending transactional email in our
| SaaS applications.
|
| The templates are very basic, eg. <p> <b> <a>, etc. Users can
| also customise these via a WYSIWYG editor.
|
| We then use turndown.js to convert the rendered HTML email to
| markdown which we then use for the text version of the email.
| sureglymop wrote:
| I created an RSS reader which has a uniform reader mode. I use
| something similar to this to parse each RSS article to a
| similar format. I'm sure there are many other use cases also.
| codetrotter wrote:
| Advent of Code exercises are almost pure markdown, but rendered
| to HTML.
|
| I've sometimes been converting it back to md to include the
| text for each exercise alongside my solutions.
|
| In my case I used a custom HTML to Markdown converter that was
| specifically built to support only what I needed in order to
| convert those Advent of Code exercises to markdown.
|
| Mine was also written in Rust.
| chaosharmonic wrote:
| Storage.
|
| My use case involves scraping job boards so that I don't have
| to doomscroll them myself anymore, and storing them in Markdown
| makes them smaller while also removing a bunch of extraneous
| classes and structure.
|
| Further, the side project I'm working on for managing all of
| this can then _render_ them in a way that makes sense.
| vladstudio wrote:
| you might be interested in https://www.kadoa.com/use-
| cases/jobs if you prefer to fully automate the process of job
| boards scraping
| seanhunter wrote:
| I had to do this to recover my personal blog after both it and
| the backups had been lost due to two unrelated snafus during
| covid. I downloaded the pages from the internet archive and
| used my own shellscript to extract the text as markdown and
| then republished it using a static site generator.
|
| Not exactly a common usecase I wouldn't think but it's good to
| be able to do this.
| mason_mpls wrote:
| fun project, saving pages to Obsidian
| udev4096 wrote:
| Nice! I made a CLI in go for quickly converting mardown to html
| sometime ago: https://github.com/thebigbone/markhtml
| JasonSage wrote:
| > Fast, it takes less than 200ms to convert a ~1.4MB Wikipedia
| page on an i5 7th gen CPU
|
| Okay, maybe I'm way off base here, but is this fast? 1.4MB of
| Wikipedia page is, what, 20k lines? This doesn't sound like fast
| Rust to me.
|
| I would guess that the amount of HTML parsing that's happening is
| way more than is actually needed to render markdown.
| tingletech wrote:
| I'm guessing it's fast compared to the javascript version they
| ported to rust?
| JohannesKauf wrote:
| Cool to see another library in this space!
|
| I see that you took the test cases from Turndown. However
| Turndown isn't actually that accurate. This is especially
| noticeable when converting entires websites.
|
| The best comparison would be against Pandoc. That is (in my
| opinion) the best _html to markdown_ converter right now.
|
| Although it is extremely difficult to handle every edge case. As
| an example, this usually causes problems:
| <p>nitty<em>-gritty-</em>details</p>
|
| Note: Six years ago I open sourced a Golang library [1].
| Currently I am re-writing it completely with the aim of getting
| even better than Pandoc. And wrote about the encountered edge-
| cases [2].
|
| [1] https://github.com/JohannesKaufmann/html-to-markdown
|
| [2] https://html-to-markdown.com/edge-cases
___________________________________________________________________
(page generated 2024-06-16 23:01 UTC)