[HN Gopher] Trafilatura: Python tool to gather text on the Web
       ___________________________________________________________________
        
       Trafilatura: Python tool to gather text on the Web
        
       Author : kevin_hu
       Score  : 42 points
       Date   : 2023-08-14 18:10 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | bnewbold wrote:
       | This tool is so great for robustly dealing with content in old
       | and poorly formatted HTML. There are a lot of similar tools for
       | extracting "the main text" from free-form HTML, but this was the
       | most reliable in my experience, especially when dealing with web
       | archives containing hand-written HTML back to the 1990s, working
       | with non-English languages, etc.
        
       | evanmcgillivray wrote:
       | What is the gap between this and beautiful soup?
        
         | simonw wrote:
         | The feature list answers that question pretty well:
         | https://github.com/adbar/trafilatura#features
         | 
         | Basically: you could implement all of this on top of
         | BeautifulSoup - polite crawling policies, sitemap and feed
         | parsing, URL de-duplication, parallel processing, download
         | queues, heuristics for extracting just the main article
         | content, metadata extraction, language detection... but it
         | would require writing an enormous amount of extra code.
        
         | rmbyrro wrote:
         | This tool can extract data in a structured format from
         | virtually any website, with any HTML structure.
         | 
         | With Beautiful Soup, you'd need to explicitly tell where each
         | piece of data exists referencing HTML tags, ids, classes, etc.
         | For each website you'd want to process.
        
         | dmillar wrote:
         | Maybe bs4 + newspaper3k rolled into one? But still, what's the
         | gap?
        
       | rolisz wrote:
       | Cool tool, I used it for a scrapinh project and it performed
       | quite well for extracting clean text and the date.
        
       | dominick-cc wrote:
       | I wish there was a web service that used this tool to scrape
       | nicely-formatted plain text from any website, then archive it and
       | serve it as a super basic web reader.
        
         | 0cf8612b2e1e wrote:
         | Archive box (https://archivebox.io/) will create a local dump
         | of any site in a multitude of formats from raw html, printed
         | PDF, and extracted body text. Also has option to request
         | internet archive to trigger a scrape of the page.
        
         | nghota wrote:
         | check out nghota.com api. It is able to pull out the main text
         | from most non-ecommerce web pages and return that to you in
         | json.
        
           | bravura wrote:
           | In general I'd be curious to try this, but your homepage is
           | not very convincing.
           | 
           | The "demo" doesn't look like typing, it's a fade right, and
           | it's painfully slow. And then, there's no library, it's just
           | 'import requests', so even the demo is extra long. (Why not
           | show curl then?)
           | 
           | Also, are there any benchmarks? Why should I take the time to
           | evaluate this myself against existing open-source tools? It
           | seems like that should be your responsibility, not mine, to
           | spend the time doing a detailed comparison and evaluation. In
           | a way that feels open and trustworthy.
           | 
           | I respect what you are doing and share this feedback from the
           | heart.
        
         | dleeftink wrote:
         | Not sure how it fares nowadays, but I used to employ Mercury
         | Reader/API for this, now called Postlight Reader[1]. While not
         | perfect, I found it to work for most daily reading needs.
         | 
         | [1]: https://reader.postlight.com/
        
         | mxuribe wrote:
         | You sort of, kind of, maybe just asked for roughly what RSS
         | (Really Simple Syndication) provides...although your wish is
         | more of a "pull", while RSS is more of a "push" in content
         | access/distribution. :-) Don't get me wrong, I'm in agreement
         | with you. I wish every website, web app, well, pretty much
         | everything digital had an automated RSS feed available to
         | consume and subscribe to!
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-08-14 23:01 UTC)