[HN Gopher] Show HN: HTML-to-Markdown - convert entire websites ...
___________________________________________________________________
Show HN: HTML-to-Markdown - convert entire websites to Markdown
with Golang/CLI
Hey HN! I originally built "html-to-markdown" back in 2018 (while
still in high school) to handle complex HTML conversions where
other libraries struggled. Now, I've released v2 -- a complete
rewrite designed to handle even more edge cases. It supports entire
websites with a high accuracy. Example use: I've used it in my RSS
reader to strip HTML down to clean Markdown, similar to the "Reader
Mode" in your Browser. It can be used as a Golang package or as an
CLI. Give it a try & tell me what edge cases you encounter!
Author : JohannesKauf
Score : 230 points
Date : 2024-11-09 09:48 UTC (13 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| sureglymop wrote:
| Reminds me of Aaron Swartz' html2text that I think serves the
| same purpose: http://www.aaronsw.com/2002/html2text/
| stevekemp wrote:
| Same idea I guess, but it's Aaron's has been broken for years -
| and probably for the best because it didn't stop people
| specifying things like "file:////etc/passwd" as the URL to
| export to markdown.
| cpursley wrote:
| This is really nice, especially for feeding LLMs web page data
| (they generally understand markdown well).
|
| I built something similar for the Elixir world but it's much more
| limited (I might borrow some of your ideas):
|
| https://github.com/agoodway/html2markdown
| jaggirs wrote:
| Why not just give the html to the llm?
| kgeist wrote:
| I remember there was a paper which found that LLMs understand
| HTML pretty well, you don't need additional preprocessing.
| The downside is that HTML produces more tokens than Markdown.
| simonw wrote:
| Right: the token savings can be enormous here.
|
| Use https://tools.simonwillison.net/jina-reader to fetch
| the https://news.ycombinator.com/ homepage as Markdown and
| paste it into https://tools.simonwillison.net/claude-token-
| counter - 1550 tokens.
|
| Same thing as HTML: 13367 tokens.
| zexodus wrote:
| Context size limits are usually the reason. Most websites I
| want to scrape end up being over 200K tokens. Tokenization
| for HTML isn't optimal because symbols like '<', '>', '/',
| etc. end up being separate tokens, whereas whole words can be
| one token if we're talking about plain text.
|
| Possible approaches include transforming the text to MD or
| minimizing the HTML (e.g., removing script tags, comments,
| etc.).
| dtjohnnyb wrote:
| I was trying to do this recently for Web page summarization.
| As said below the token sizes would end up over the context
| length, so I trimmed the html to fit just to see what would
| happen. I found that the LLM was able to extract information,
| but it very commonly would start trying to continue the html
| blocks that had been left open in the trimmed input.
| Presumably this is due to instruction tuning on coding tasks
|
| I'd love to figure out a way to do it though, it seems to me
| that there's a bunch of rich description of the website in
| the html
| JohannesKauf wrote:
| > built something similar for the Elixir
|
| We interact with the web so much that it's worth having such a
| library in every language... Great that you took the time and
| wrote one for the Elixir community!
|
| Feel free to contact me if you want to ping-pong some ideas!
|
| > feeding LLMs web page data
|
| Exactly, that one use case that got quite popular. There is
| also the feature of keeping specific HTML tags (e.g. <article>
| and <footer>) to give the LLM a bit more context about the
| page.
| Savageman wrote:
| I remember a long time ago I used Pandoc for this.
|
| Fresh tools and more choice is very welcome, thanks for your
| work!
| JohannesKauf wrote:
| Pandoc is amazing. Especially because they support so many
| formats.
|
| And their html to markdown converter is (in my opinion) still
| the best right now.
|
| But html-to-markdown is getting close. My goal is to cover more
| edge cases than pandoc...
| paradite wrote:
| I have been using these two:
|
| https://farnots.github.io/RedditToMarkdown/
|
| https://urltomarkdown.com/
|
| Incredibly useful for leveraging LLMs and building AI apps.
| oezi wrote:
| Does it also include logic to download JS-driven sites properly
| or is this out of scope?
| simonw wrote:
| It doesn't. For that you would need to execute a full headless
| browser first, extract the HTML (document.body.innerHTML after
| the page has finished loading can work) and process the result.
|
| If you're already running a headless browser you may as well
| run the conversion in JavaScript though - I use this recipe
| pretty often with my shot-scraper tool: https://shot-
| scraper.datasette.io/en/stable/javascript.html#... - adding
| https://github.com/mixmark-io/turndown to the mix will get you
| Markdown conversion as well.
| JohannesKauf wrote:
| That is unfortunately out of scope. I like the philosophy of
| doing one thing really well.
|
| But nowadays--with Playwright and Puppeteer--there are great
| choices for Browser automation.
| jot wrote:
| We do that with Urlbox's markdown feature:
| https://urlbox.com/extracting-text
| bni wrote:
| I used https://github.com/mozilla/readability for this
| plaidwombat wrote:
| Great work. I thank you for it. I've used your library for a few
| years in a Lambda function which takes a URL and converts it to
| Markdown for storage in S3. I hooked it into every "bookmark" app
| I use as a webhook so I save a Markdown copy of everything I
| bookmark, which makes it very handy for importing into Obsidian.
| JohannesKauf wrote:
| Oh very nice to hear, thank you very much!
|
| That's actually a great idea!
|
| I personally use Raindrop for bookmarking articles. But I can't
| find stuff in the search.
|
| The other day "Embeddings are underrated" was on HN. That would
| actually be a good approach for finding stuff later on. Using
| webhooks, converting to markdown, generating embedding and then
| enjoying a better search. You just gave me the idea for my next
| weekend project :-)
| yayoohooyahoo wrote:
| Turndown works quite well too: https://github.com/mixmark-
| io/turndown
| rty32 wrote:
| Nice! And glad to see it's MIT licensed.
|
| I wonder if it is feasible to use this as a replacement for p2k,
| instapaper etc for the purpose of reading on Kindle. One
| annoyance with these services is that the rendering is way off --
| h elements not showing up as headers, elements missing randomly,
| source code incorrectly rendered in 10 different ways. Some are
| better than others, but generally they are disappointing. (Yet
| they expect you to pay a subscription fee.) If this is an
| actively maintained project that welcomes contribution, I could
| test it out with various articles and report/fix issues. Although
| I wonder how much work there will be for handling edge cases of
| all the websites out there.
| JohannesKauf wrote:
| There are two parts to it:
|
| 1) convert html to markdown
|
| This is what my library specifically addresses, and I believe
| it handles this task robustly. There was a lot of testing
| involved. For example, I used the CommonCrawl Dataset to
| automatically catch edge cases.
|
| 2) Identify article content
|
| This is the more challenging aspect. You need to identify and
| extract the main content while removing peripheral elements
| (navigation bars, sidebars, ads, etc.)
|
| For example, the top of the markdown document will have lots of
| links from the navbar otherwise.
|
| Mozilla's "Readability" project (and its various ports) is the
| most used solution in this space. However, it relies on
| heuristic rules that need adjustments to work on every website.
|
| ---
|
| The html-to-markdown project in combination with some heuristic
| would be great match! There is actually a comment below [1]
| about this topic. Feel free to contact me if you start this
| project, would be happy to help!
|
| [1] https://news.ycombinator.com/item?id=42094012
| dleeftink wrote:
| I'm working on a Textify API that collates elements based on
| the visible/running flow of text elements. It's not quite
| there yet, but is able to get the running content of HTML
| pages quite consistently. You can check it out here:
|
| [0]: https://github.com/dleeftink/plainmark
| throwup238 wrote:
| This is probably out of scope for your tool but it'd be nice to
| have built in n-gram deduplication where the tool strips any
| identical content from the header and footer, like navigation,
| when pointed at a few of these markdown files.
| JohannesKauf wrote:
| My final university project was about a clean-up-approach on
| the HTML nodes _before_ sending it to the html-to-markdown
| converter. But that was extremely difficult and dependent on
| some heuristics that had to be tweaked.
|
| Your idea of comparing multiple pages would be a great
| approach. It would be amazing if you build something like this!
| This would enable so many more use cases... For example a
| better "send to kindle" (see other comment from rty32 [1]).
|
| [1] https://news.ycombinator.com/item?id=42093964
| miki123211 wrote:
| If you need this sort of thing in any other language, there's a
| free, no-auth, no-api-key-required, no-strings-attached API that
| can do this at https://jina.ai/reader/
|
| You just fetch a URL like
| `https://r.jina.ai/https://www.asimov.press/p/mitochondria`, and
| get a markdown document for the "inner" URL.
|
| I've actually used this and it's not perfect, there are websites
| (mostly those behind Cloudflare and other such proxies) that it
| can't handle, but it does 90% of the job, and is an one-liner in
| most languages with a decent HTTP requests library.
| JohannesKauf wrote:
| Thanks, Jina actually looks quite nice for use in LLMs.
|
| I also provide a REST API [1] that you can use for free (within
| limits). However you have get an API Key by registering with
| Github (see reason below).
|
| ---
|
| The demo was previously hosted on Vercel. Someone misused the
| demo and send ~5 million requests per day. And would not stop
| -- which quickly brought me over the bandwidth limits of
| Vercel. And bandwidth is really really expensive!
|
| So that is the reason for requiring API Keys and hosting it on
| a VPS... Lessons learned!
|
| [1] https://html-to-markdown.com/api
| emptiestplace wrote:
| Seems pretty risky to not implement rate limits either way.
| JohannesKauf wrote:
| The problem was: Doing rate limiting on the application
| level was not enough. Once the request hit my backend the
| _incoming bandwidth_ was _already_ consumed -- and I was
| charged for it.
|
| I contacted Vercel's Support to block that specific IP
| address but unfortunately they weren't helpful.
| emptiestplace wrote:
| So you're probably still vulnerable to this even with the
| key requirement, but they stopped once you removed the
| incentive? Did you notice what they were scraping?
| JohannesKauf wrote:
| Sorry, I mixed up a few topics here:
|
| - Moved everything to a VPS - way better value for money.
| Extra TB of traffic only costs EUR1-10 with
| Hetzner/DigitalOcean compared to 400EUR with Vercel's old
| pricing.
|
| - Put Cloudflare in front - gives me an extra layer of
| control (if I ever need it)
|
| - Built a proper REST API - now there's an _official_ way
| to use the converter programmatically
|
| - Made email registration mandatory for API keys - lets
| me reach out before having to block anyone
|
| That other server was probably running a scraper and then
| converting the html-websites to markdown. After about 2
| weeks they noticed that I was just returning garbage and
| it stopped :)
| emptiestplace wrote:
| Ah! Makes sense now, thanks for sharing.
|
| I've had good success with Cloudflare's free-tier
| features for rate limiting. If you haven't tried it, it
| only takes a couple minutes to enable and should be
| pretty set-and-forget for your API.
| petercooper wrote:
| I use this too and, not to detract from your enthusiasm, it's
| not exactly no-strings-attached. There's a token limit on free
| use and you can't use it for any commercial purposes. Luckily
| the pricing for unrestricted use is reasonable though at 2
| cents per million tokens.
|
| People will also want to note that it's LLM-powered which has
| pros and cons. One pro being that you can download and run
| their model yourself for non commercial use cases:
| https://huggingface.co/jinaai/reader-lm-1.5b
| NotACracker wrote:
| Pandoc
|
| http://www.cantoni.org/2019/01/27/converting-html-markdown-u...
| bbor wrote:
| For clarity: I'm a pandoc diehard (especially because it's
| written by a philosopher!) but it intentionally doesn't
| approach this level of functionality, AFAIK.
| jot wrote:
| This is great!
|
| If you also want to grab an accurate screenshot with the markdown
| of a webpage you can get both with Urlbox.
|
| We have a couple of free tools that use this feature:
|
| https://screenshotof.com https://url2text.com
| ssousa666 wrote:
| I have been looking for a similar lib to use in a Kotlin/Spring
| app - any recommendations? My specific use-case does not need to
| support sanitizing during the HTML -> MD conversion, as the HTML
| doc strings that I will be converting are sanitized during the
| extraction phase (using JSoup).
| inhumantsar wrote:
| i've made some modest contributions to Mozilla's Readability
| library and didn't see anything like their heuristics in this.
|
| are you using a separate library for that or did I miss something
| in this?
| inhumantsar wrote:
| oops, refreshed the page and saw other comments addressing
| this! nevermind!
| juliuskiesian wrote:
| One of the pain points of using this kind of tools is handling
| syntax highlighted code blocks. How does html-to-markdown perform
| in such scenarios?
| JohannesKauf wrote:
| Yeah good point, that's actually difficult. They use many
| `<span>` html tags to color individual words and syntax.
|
| But I wrote logic to handle that. It probably needs to be
| adapted at some point, but works surprisingly well. Have a look
| at the testdata files ("code.in.html" and "code.out.md" files
| [1]).
|
| Feel free to give it a try & let me know if you notice any edge
| cases!
|
| [1] https://github.com/JohannesKaufmann/html-to-
| markdown/blob/ma...
| lollobomb wrote:
| This is nice, I tried a plugin for pandoc in the past but didn't
| really work well.
| hello_computer wrote:
| This is honorable work. Thank you.
___________________________________________________________________
(page generated 2024-11-09 23:00 UTC)