hngopher.com

       [HN Gopher] Show HN: Defuddle, an HTML-to-Markdown alternative t...
       ___________________________________________________________________
        
       Show HN: Defuddle, an HTML-to-Markdown alternative to Readability
        
       Defuddle is an open-source JS library I built to parse and extract
       the main content and metadata from web pages. It can also return
       the content as Markdown.  I built Defuddle while working on
       Obsidian Web Clipper[1] (also MIT-licensed) because Mozilla's
       Readability[2] appears to be mostly abandoned, and didn't work well
       for many sites.  It's still very much a work in progress, but I
       thought I'd share it today, in light of the announcement that
       Mozilla is shutting down Pocket. This library could be helpful to
       anyone building a read-it-later app.  Defuddle is also available as
       a CLI:  https://github.com/kepano/defuddle-cli  [1]
       https://github.com/obsidianmd/obsidian-clipper  [2]
       https://github.com/mozilla/readability
        
       Author : kepano
       Score  : 39 points
       Date   : 2025-05-22 21:40 UTC (1 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | busymom0 wrote:
       | In the playground, after I enter a url, I can't seem to figure
       | out how to submit it to fetch the url? I tried pressing the
       | return key on iOS keyboard but it didn't do anything. Am I
       | missing something?
        
         | kepano wrote:
         | The input is there to test the url option -- which I admit is a
         | bit confusing, so I have removed it for now. I haven't found a
         | good and free way to proxy requests from a GitHub page (yet).
        
       | rcarmo wrote:
       | The Python analogues seem to be well maintained. I did my own
       | implementation of the Readability algorithm years ago and dropped
       | it in favor them, and I have a few scrapers going strong with
       | regular updates.
        
         | kepano wrote:
         | Are there any in particular you can recommend?
        
           | khimaros wrote:
           | not parent, but this one looks maintained
           | https://github.com/buriy/python-readability
        
       | fkfyshroglk wrote:
       | For those not in the know:
       | [Readability](https://github.com/mozilla/readability)
        
       | billconan wrote:
       | Are you using ai models behind the scenes? I saw Gemini and
       | others in the code. I am asking mainly to understand the cost of
       | using yours vs. readability. Thank!
        
         | kepano wrote:
         | No it's all rules-based. I think the code you're referring to
         | is "extractors", which are website-specific rules that I'm
         | working on to standardize the output from sites with comments
         | threads (e.g. HN, Reddit) and conversational chats (ChatGPT,
         | Claude, Gemini).
        
       | tmpfs wrote:
       | Interesting as I was researching this recently and certainly not
       | impressed with the quality of the Readability implementations in
       | various languages. Although Readability.js was clearly the best,
       | it being Javascript didn't suit my project.
       | 
       | In the end I found the python trifatura library to extract the
       | best quality content with accurate meta data.
       | 
       | You might want to compare your implementation to trifatura to see
       | if there is room for improvement.
        
       ___________________________________________________________________
       (page generated 2025-05-22 23:00 UTC)