[HN Gopher] Trafilatura: Python tool to gather text on the Web
___________________________________________________________________
Trafilatura: Python tool to gather text on the Web
Author : kevin_hu
Score : 42 points
Date : 2023-08-14 18:10 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| bnewbold wrote:
| This tool is so great for robustly dealing with content in old
| and poorly formatted HTML. There are a lot of similar tools for
| extracting "the main text" from free-form HTML, but this was the
| most reliable in my experience, especially when dealing with web
| archives containing hand-written HTML back to the 1990s, working
| with non-English languages, etc.
| evanmcgillivray wrote:
| What is the gap between this and beautiful soup?
| simonw wrote:
| The feature list answers that question pretty well:
| https://github.com/adbar/trafilatura#features
|
| Basically: you could implement all of this on top of
| BeautifulSoup - polite crawling policies, sitemap and feed
| parsing, URL de-duplication, parallel processing, download
| queues, heuristics for extracting just the main article
| content, metadata extraction, language detection... but it
| would require writing an enormous amount of extra code.
| rmbyrro wrote:
| This tool can extract data in a structured format from
| virtually any website, with any HTML structure.
|
| With Beautiful Soup, you'd need to explicitly tell where each
| piece of data exists referencing HTML tags, ids, classes, etc.
| For each website you'd want to process.
| dmillar wrote:
| Maybe bs4 + newspaper3k rolled into one? But still, what's the
| gap?
| rolisz wrote:
| Cool tool, I used it for a scrapinh project and it performed
| quite well for extracting clean text and the date.
| dominick-cc wrote:
| I wish there was a web service that used this tool to scrape
| nicely-formatted plain text from any website, then archive it and
| serve it as a super basic web reader.
| 0cf8612b2e1e wrote:
| Archive box (https://archivebox.io/) will create a local dump
| of any site in a multitude of formats from raw html, printed
| PDF, and extracted body text. Also has option to request
| internet archive to trigger a scrape of the page.
| nghota wrote:
| check out nghota.com api. It is able to pull out the main text
| from most non-ecommerce web pages and return that to you in
| json.
| bravura wrote:
| In general I'd be curious to try this, but your homepage is
| not very convincing.
|
| The "demo" doesn't look like typing, it's a fade right, and
| it's painfully slow. And then, there's no library, it's just
| 'import requests', so even the demo is extra long. (Why not
| show curl then?)
|
| Also, are there any benchmarks? Why should I take the time to
| evaluate this myself against existing open-source tools? It
| seems like that should be your responsibility, not mine, to
| spend the time doing a detailed comparison and evaluation. In
| a way that feels open and trustworthy.
|
| I respect what you are doing and share this feedback from the
| heart.
| dleeftink wrote:
| Not sure how it fares nowadays, but I used to employ Mercury
| Reader/API for this, now called Postlight Reader[1]. While not
| perfect, I found it to work for most daily reading needs.
|
| [1]: https://reader.postlight.com/
| mxuribe wrote:
| You sort of, kind of, maybe just asked for roughly what RSS
| (Really Simple Syndication) provides...although your wish is
| more of a "pull", while RSS is more of a "push" in content
| access/distribution. :-) Don't get me wrong, I'm in agreement
| with you. I wish every website, web app, well, pretty much
| everything digital had an automated RSS feed available to
| consume and subscribe to!
| [deleted]
___________________________________________________________________
(page generated 2023-08-14 23:01 UTC)