[HN Gopher] Show HN: Defuddle, an HTML-to-Markdown alternative t...
___________________________________________________________________
Show HN: Defuddle, an HTML-to-Markdown alternative to Readability
Defuddle is an open-source JS library I built to parse and extract
the main content and metadata from web pages. It can also return
the content as Markdown. I built Defuddle while working on
Obsidian Web Clipper[1] (also MIT-licensed) because Mozilla's
Readability[2] appears to be mostly abandoned, and didn't work well
for many sites. It's still very much a work in progress, but I
thought I'd share it today, in light of the announcement that
Mozilla is shutting down Pocket. This library could be helpful to
anyone building a read-it-later app. Defuddle is also available as
a CLI: https://github.com/kepano/defuddle-cli [1]
https://github.com/obsidianmd/obsidian-clipper [2]
https://github.com/mozilla/readability
Author : kepano
Score : 39 points
Date : 2025-05-22 21:40 UTC (1 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| busymom0 wrote:
| In the playground, after I enter a url, I can't seem to figure
| out how to submit it to fetch the url? I tried pressing the
| return key on iOS keyboard but it didn't do anything. Am I
| missing something?
| kepano wrote:
| The input is there to test the url option -- which I admit is a
| bit confusing, so I have removed it for now. I haven't found a
| good and free way to proxy requests from a GitHub page (yet).
| rcarmo wrote:
| The Python analogues seem to be well maintained. I did my own
| implementation of the Readability algorithm years ago and dropped
| it in favor them, and I have a few scrapers going strong with
| regular updates.
| kepano wrote:
| Are there any in particular you can recommend?
| khimaros wrote:
| not parent, but this one looks maintained
| https://github.com/buriy/python-readability
| fkfyshroglk wrote:
| For those not in the know:
| [Readability](https://github.com/mozilla/readability)
| billconan wrote:
| Are you using ai models behind the scenes? I saw Gemini and
| others in the code. I am asking mainly to understand the cost of
| using yours vs. readability. Thank!
| kepano wrote:
| No it's all rules-based. I think the code you're referring to
| is "extractors", which are website-specific rules that I'm
| working on to standardize the output from sites with comments
| threads (e.g. HN, Reddit) and conversational chats (ChatGPT,
| Claude, Gemini).
| tmpfs wrote:
| Interesting as I was researching this recently and certainly not
| impressed with the quality of the Readability implementations in
| various languages. Although Readability.js was clearly the best,
| it being Javascript didn't suit my project.
|
| In the end I found the python trifatura library to extract the
| best quality content with accurate meta data.
|
| You might want to compare your implementation to trifatura to see
| if there is room for improvement.
___________________________________________________________________
(page generated 2025-05-22 23:00 UTC)