[HN Gopher] Show HN: HTML-to-Markdown - convert entire websites ...
       ___________________________________________________________________
        
       Show HN: HTML-to-Markdown - convert entire websites to Markdown
       with Golang/CLI
        
       Hey HN!  I originally built "html-to-markdown" back in 2018 (while
       still in high school) to handle complex HTML conversions where
       other libraries struggled.  Now, I've released v2 -- a complete
       rewrite designed to handle even more edge cases. It supports entire
       websites with a high accuracy.  Example use: I've used it in my RSS
       reader to strip HTML down to clean Markdown, similar to the "Reader
       Mode" in your Browser.  It can be used as a Golang package or as an
       CLI.  Give it a try & tell me what edge cases you encounter!
        
       Author : JohannesKauf
       Score  : 230 points
       Date   : 2024-11-09 09:48 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | sureglymop wrote:
       | Reminds me of Aaron Swartz' html2text that I think serves the
       | same purpose: http://www.aaronsw.com/2002/html2text/
        
         | stevekemp wrote:
         | Same idea I guess, but it's Aaron's has been broken for years -
         | and probably for the best because it didn't stop people
         | specifying things like "file:////etc/passwd" as the URL to
         | export to markdown.
        
       | cpursley wrote:
       | This is really nice, especially for feeding LLMs web page data
       | (they generally understand markdown well).
       | 
       | I built something similar for the Elixir world but it's much more
       | limited (I might borrow some of your ideas):
       | 
       | https://github.com/agoodway/html2markdown
        
         | jaggirs wrote:
         | Why not just give the html to the llm?
        
           | kgeist wrote:
           | I remember there was a paper which found that LLMs understand
           | HTML pretty well, you don't need additional preprocessing.
           | The downside is that HTML produces more tokens than Markdown.
        
             | simonw wrote:
             | Right: the token savings can be enormous here.
             | 
             | Use https://tools.simonwillison.net/jina-reader to fetch
             | the https://news.ycombinator.com/ homepage as Markdown and
             | paste it into https://tools.simonwillison.net/claude-token-
             | counter - 1550 tokens.
             | 
             | Same thing as HTML: 13367 tokens.
        
           | zexodus wrote:
           | Context size limits are usually the reason. Most websites I
           | want to scrape end up being over 200K tokens. Tokenization
           | for HTML isn't optimal because symbols like '<', '>', '/',
           | etc. end up being separate tokens, whereas whole words can be
           | one token if we're talking about plain text.
           | 
           | Possible approaches include transforming the text to MD or
           | minimizing the HTML (e.g., removing script tags, comments,
           | etc.).
        
           | dtjohnnyb wrote:
           | I was trying to do this recently for Web page summarization.
           | As said below the token sizes would end up over the context
           | length, so I trimmed the html to fit just to see what would
           | happen. I found that the LLM was able to extract information,
           | but it very commonly would start trying to continue the html
           | blocks that had been left open in the trimmed input.
           | Presumably this is due to instruction tuning on coding tasks
           | 
           | I'd love to figure out a way to do it though, it seems to me
           | that there's a bunch of rich description of the website in
           | the html
        
         | JohannesKauf wrote:
         | > built something similar for the Elixir
         | 
         | We interact with the web so much that it's worth having such a
         | library in every language... Great that you took the time and
         | wrote one for the Elixir community!
         | 
         | Feel free to contact me if you want to ping-pong some ideas!
         | 
         | > feeding LLMs web page data
         | 
         | Exactly, that one use case that got quite popular. There is
         | also the feature of keeping specific HTML tags (e.g. <article>
         | and <footer>) to give the LLM a bit more context about the
         | page.
        
       | Savageman wrote:
       | I remember a long time ago I used Pandoc for this.
       | 
       | Fresh tools and more choice is very welcome, thanks for your
       | work!
        
         | JohannesKauf wrote:
         | Pandoc is amazing. Especially because they support so many
         | formats.
         | 
         | And their html to markdown converter is (in my opinion) still
         | the best right now.
         | 
         | But html-to-markdown is getting close. My goal is to cover more
         | edge cases than pandoc...
        
       | paradite wrote:
       | I have been using these two:
       | 
       | https://farnots.github.io/RedditToMarkdown/
       | 
       | https://urltomarkdown.com/
       | 
       | Incredibly useful for leveraging LLMs and building AI apps.
        
       | oezi wrote:
       | Does it also include logic to download JS-driven sites properly
       | or is this out of scope?
        
         | simonw wrote:
         | It doesn't. For that you would need to execute a full headless
         | browser first, extract the HTML (document.body.innerHTML after
         | the page has finished loading can work) and process the result.
         | 
         | If you're already running a headless browser you may as well
         | run the conversion in JavaScript though - I use this recipe
         | pretty often with my shot-scraper tool: https://shot-
         | scraper.datasette.io/en/stable/javascript.html#... - adding
         | https://github.com/mixmark-io/turndown to the mix will get you
         | Markdown conversion as well.
        
         | JohannesKauf wrote:
         | That is unfortunately out of scope. I like the philosophy of
         | doing one thing really well.
         | 
         | But nowadays--with Playwright and Puppeteer--there are great
         | choices for Browser automation.
        
         | jot wrote:
         | We do that with Urlbox's markdown feature:
         | https://urlbox.com/extracting-text
        
         | bni wrote:
         | I used https://github.com/mozilla/readability for this
        
       | plaidwombat wrote:
       | Great work. I thank you for it. I've used your library for a few
       | years in a Lambda function which takes a URL and converts it to
       | Markdown for storage in S3. I hooked it into every "bookmark" app
       | I use as a webhook so I save a Markdown copy of everything I
       | bookmark, which makes it very handy for importing into Obsidian.
        
         | JohannesKauf wrote:
         | Oh very nice to hear, thank you very much!
         | 
         | That's actually a great idea!
         | 
         | I personally use Raindrop for bookmarking articles. But I can't
         | find stuff in the search.
         | 
         | The other day "Embeddings are underrated" was on HN. That would
         | actually be a good approach for finding stuff later on. Using
         | webhooks, converting to markdown, generating embedding and then
         | enjoying a better search. You just gave me the idea for my next
         | weekend project :-)
        
       | yayoohooyahoo wrote:
       | Turndown works quite well too: https://github.com/mixmark-
       | io/turndown
        
       | rty32 wrote:
       | Nice! And glad to see it's MIT licensed.
       | 
       | I wonder if it is feasible to use this as a replacement for p2k,
       | instapaper etc for the purpose of reading on Kindle. One
       | annoyance with these services is that the rendering is way off --
       | h elements not showing up as headers, elements missing randomly,
       | source code incorrectly rendered in 10 different ways. Some are
       | better than others, but generally they are disappointing. (Yet
       | they expect you to pay a subscription fee.) If this is an
       | actively maintained project that welcomes contribution, I could
       | test it out with various articles and report/fix issues. Although
       | I wonder how much work there will be for handling edge cases of
       | all the websites out there.
        
         | JohannesKauf wrote:
         | There are two parts to it:
         | 
         | 1) convert html to markdown
         | 
         | This is what my library specifically addresses, and I believe
         | it handles this task robustly. There was a lot of testing
         | involved. For example, I used the CommonCrawl Dataset to
         | automatically catch edge cases.
         | 
         | 2) Identify article content
         | 
         | This is the more challenging aspect. You need to identify and
         | extract the main content while removing peripheral elements
         | (navigation bars, sidebars, ads, etc.)
         | 
         | For example, the top of the markdown document will have lots of
         | links from the navbar otherwise.
         | 
         | Mozilla's "Readability" project (and its various ports) is the
         | most used solution in this space. However, it relies on
         | heuristic rules that need adjustments to work on every website.
         | 
         | ---
         | 
         | The html-to-markdown project in combination with some heuristic
         | would be great match! There is actually a comment below [1]
         | about this topic. Feel free to contact me if you start this
         | project, would be happy to help!
         | 
         | [1] https://news.ycombinator.com/item?id=42094012
        
           | dleeftink wrote:
           | I'm working on a Textify API that collates elements based on
           | the visible/running flow of text elements. It's not quite
           | there yet, but is able to get the running content of HTML
           | pages quite consistently. You can check it out here:
           | 
           | [0]: https://github.com/dleeftink/plainmark
        
       | throwup238 wrote:
       | This is probably out of scope for your tool but it'd be nice to
       | have built in n-gram deduplication where the tool strips any
       | identical content from the header and footer, like navigation,
       | when pointed at a few of these markdown files.
        
         | JohannesKauf wrote:
         | My final university project was about a clean-up-approach on
         | the HTML nodes _before_ sending it to the html-to-markdown
         | converter. But that was extremely difficult and dependent on
         | some heuristics that had to be tweaked.
         | 
         | Your idea of comparing multiple pages would be a great
         | approach. It would be amazing if you build something like this!
         | This would enable so many more use cases... For example a
         | better "send to kindle" (see other comment from rty32 [1]).
         | 
         | [1] https://news.ycombinator.com/item?id=42093964
        
       | miki123211 wrote:
       | If you need this sort of thing in any other language, there's a
       | free, no-auth, no-api-key-required, no-strings-attached API that
       | can do this at https://jina.ai/reader/
       | 
       | You just fetch a URL like
       | `https://r.jina.ai/https://www.asimov.press/p/mitochondria`, and
       | get a markdown document for the "inner" URL.
       | 
       | I've actually used this and it's not perfect, there are websites
       | (mostly those behind Cloudflare and other such proxies) that it
       | can't handle, but it does 90% of the job, and is an one-liner in
       | most languages with a decent HTTP requests library.
        
         | JohannesKauf wrote:
         | Thanks, Jina actually looks quite nice for use in LLMs.
         | 
         | I also provide a REST API [1] that you can use for free (within
         | limits). However you have get an API Key by registering with
         | Github (see reason below).
         | 
         | ---
         | 
         | The demo was previously hosted on Vercel. Someone misused the
         | demo and send ~5 million requests per day. And would not stop
         | -- which quickly brought me over the bandwidth limits of
         | Vercel. And bandwidth is really really expensive!
         | 
         | So that is the reason for requiring API Keys and hosting it on
         | a VPS... Lessons learned!
         | 
         | [1] https://html-to-markdown.com/api
        
           | emptiestplace wrote:
           | Seems pretty risky to not implement rate limits either way.
        
             | JohannesKauf wrote:
             | The problem was: Doing rate limiting on the application
             | level was not enough. Once the request hit my backend the
             | _incoming bandwidth_ was _already_ consumed -- and I was
             | charged for it.
             | 
             | I contacted Vercel's Support to block that specific IP
             | address but unfortunately they weren't helpful.
        
               | emptiestplace wrote:
               | So you're probably still vulnerable to this even with the
               | key requirement, but they stopped once you removed the
               | incentive? Did you notice what they were scraping?
        
               | JohannesKauf wrote:
               | Sorry, I mixed up a few topics here:
               | 
               | - Moved everything to a VPS - way better value for money.
               | Extra TB of traffic only costs EUR1-10 with
               | Hetzner/DigitalOcean compared to 400EUR with Vercel's old
               | pricing.
               | 
               | - Put Cloudflare in front - gives me an extra layer of
               | control (if I ever need it)
               | 
               | - Built a proper REST API - now there's an _official_ way
               | to use the converter programmatically
               | 
               | - Made email registration mandatory for API keys - lets
               | me reach out before having to block anyone
               | 
               | That other server was probably running a scraper and then
               | converting the html-websites to markdown. After about 2
               | weeks they noticed that I was just returning garbage and
               | it stopped :)
        
               | emptiestplace wrote:
               | Ah! Makes sense now, thanks for sharing.
               | 
               | I've had good success with Cloudflare's free-tier
               | features for rate limiting. If you haven't tried it, it
               | only takes a couple minutes to enable and should be
               | pretty set-and-forget for your API.
        
         | petercooper wrote:
         | I use this too and, not to detract from your enthusiasm, it's
         | not exactly no-strings-attached. There's a token limit on free
         | use and you can't use it for any commercial purposes. Luckily
         | the pricing for unrestricted use is reasonable though at 2
         | cents per million tokens.
         | 
         | People will also want to note that it's LLM-powered which has
         | pros and cons. One pro being that you can download and run
         | their model yourself for non commercial use cases:
         | https://huggingface.co/jinaai/reader-lm-1.5b
        
       | NotACracker wrote:
       | Pandoc
       | 
       | http://www.cantoni.org/2019/01/27/converting-html-markdown-u...
        
         | bbor wrote:
         | For clarity: I'm a pandoc diehard (especially because it's
         | written by a philosopher!) but it intentionally doesn't
         | approach this level of functionality, AFAIK.
        
       | jot wrote:
       | This is great!
       | 
       | If you also want to grab an accurate screenshot with the markdown
       | of a webpage you can get both with Urlbox.
       | 
       | We have a couple of free tools that use this feature:
       | 
       | https://screenshotof.com https://url2text.com
        
       | ssousa666 wrote:
       | I have been looking for a similar lib to use in a Kotlin/Spring
       | app - any recommendations? My specific use-case does not need to
       | support sanitizing during the HTML -> MD conversion, as the HTML
       | doc strings that I will be converting are sanitized during the
       | extraction phase (using JSoup).
        
       | inhumantsar wrote:
       | i've made some modest contributions to Mozilla's Readability
       | library and didn't see anything like their heuristics in this.
       | 
       | are you using a separate library for that or did I miss something
       | in this?
        
         | inhumantsar wrote:
         | oops, refreshed the page and saw other comments addressing
         | this! nevermind!
        
       | juliuskiesian wrote:
       | One of the pain points of using this kind of tools is handling
       | syntax highlighted code blocks. How does html-to-markdown perform
       | in such scenarios?
        
         | JohannesKauf wrote:
         | Yeah good point, that's actually difficult. They use many
         | `<span>` html tags to color individual words and syntax.
         | 
         | But I wrote logic to handle that. It probably needs to be
         | adapted at some point, but works surprisingly well. Have a look
         | at the testdata files ("code.in.html" and "code.out.md" files
         | [1]).
         | 
         | Feel free to give it a try & let me know if you notice any edge
         | cases!
         | 
         | [1] https://github.com/JohannesKaufmann/html-to-
         | markdown/blob/ma...
        
       | lollobomb wrote:
       | This is nice, I tried a plugin for pandoc in the past but didn't
       | really work well.
        
       | hello_computer wrote:
       | This is honorable work. Thank you.
        
       ___________________________________________________________________
       (page generated 2024-11-09 23:00 UTC)