https://github.com/adbar/trafilatura Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Customer Stories + White papers, Ebooks, Webinars + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. {{ message }} adbar / trafilatura Public * * Notifications * Fork 140 * Star 1.4k Python & command-line tool to gather text on the Web: web crawling/ scraping, extraction of text, metadata, comments trafilatura.readthedocs.io License GPL-3.0 license 1.4k stars 140 forks Activity Star Notifications * Code * Issues 48 * Pull requests 3 * Discussions * Actions * Security * Insights More * Code * Issues * Pull requests * Discussions * Actions * Security * Insights adbar/trafilatura This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 2 branches 26 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/a] Use Git or checkout with SVN using the web URL. [gh repo clone adbar/] Work fast with our official CLI. Learn more about the CLI. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @adbar adbar courlan changes: adapt parameter and tests (#389) ... c6f4559 Aug 7, 2023 courlan changes: adapt parameter and tests (#389) * adapt parameter and tests * diagnose issue with is_on * revert * setup: darwin compatibility * update tests * cleanup c6f4559 Git stats * 1,418 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time .github setup: fix and update CI workflows (#380) June 21, 2023 14:24 docs include_links: convert relative URLs to absolute if possible (#377) June 21, 2023 13:30 tests courlan changes: adapt parameter and tests (#389) August 7, 2023 16:40 trafilatura courlan changes: adapt parameter and tests (#389) August 7, 2023 16:40 .coveragerc tests: improve coverage January 25, 2022 15:48 .gitattributes fix proogramming language detection October 6, 2020 13:26 .gitignore Improvements for Chinese web pages (#186) March 17, 2022 11:44 .readthedocs.yml rtd config December 17, 2019 13:53 CITATION.cff add CITATION.cff file November 18, 2021 19:19 CONTRIBUTING.md docs: updated contribution info January 24, 2022 17:27 HISTORY.md prepare version 1.6.1 (#371) June 15, 2023 14:53 LICENSE name changed April 8, 2019 14:04 MANIFEST.in prepare new version: 1.2.0 March 7, 2022 12:41 README.rst docs roundup (#364) June 8, 2023 14:01 pytest.ini tests: remove tox setting November 15, 2021 15:03 setup.py courlan changes: adapt parameter and tests (#389) August 7, 2023 16:40 View code [ ] A Python package & command-line tool to gather text on the Web Description Features Evaluation and alternatives Other evaluations: Usage and documentation License Context Contributing Roadmap Author Software ecosystem README.rst A Python package & command-line tool to gather text on the Web Logo as PNG image Python package Python versions Documentation Status Code Coverage Downloads Reference DOI: 10.18653/v1/2021.acl-demo.15 Demo as GIF image Description Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents. This tool can be useful for quantitative research in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security. Features * Web crawling and text discovery: o Focused crawling and politeness rules o Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS) o URL management (blacklists, filtering and de-duplication) * Seamless and parallel processing, online and offline: o URLs, HTML files or parsed HTML trees usable as input o Efficient and polite processing of download queues o Conversion of previously downloaded files * Robust and efficient extraction: o Main text (with LXML, common patterns and generic algorithms: jusText, fork of readability-lxml) o Metadata (title, author, date, site name, categories and tags) o Formatting and structural elements: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting o Comments (if applicable) * Output formats: o Text (minimal formatting or Markdown) o CSV (with metadata, tab-separated values) o JSON (with metadata) o XML (with metadata, text formatting and page structure) and TEI-XML * Optional add-ons: o Language detection on extracted content o Graphical user interface (GUI) o Speed optimizations Evaluation and alternatives For more detailed results see the benchmark and evaluation script. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the tests directory. 750 documents, 2236 text & 2250 boilerplate segments (2022-05-18), Python 3.8 Python Package Precision Recall Accuracy F-Score Diff. html_text 0.5.2 0.529 0.958 0.554 0.682 2.2x inscriptis 2.2.0 (html to txt) 0.534 0.959 0.563 0.686 3.5x newspaper3k 0.2.8 0.895 0.593 0.762 0.713 12x justext 3.0.0 (custom) 0.865 0.650 0.775 0.742 5.2x boilerpy3 1.0.6 (article mode) 0.814 0.744 0.787 0.777 4.1x baseline (text markup) 0.757 0.827 0.781 0.790 1x goose3 3.1.9 0.934 0.690 0.821 0.793 22x readability-lxml 0.8.1 0.891 0.729 0.820 0.801 5.8x news-please 1.5.22 0.898 0.734 0.826 0.808 61x readabilipy 0.2.0 0.877 0.870 0.874 0.874 248x trafilatura 1.2.2 (standard) 0.914 0.904 0.910 0.909 7.1x Other evaluations: * Most efficient open-source library in ScrapingHub's article extraction benchmark * Best overall tool according to Gael Lejeune & Adrien Barbaresi, Bien choisir son outil d'extraction de contenu a partir du Web (2020, PDF, French) Usage and documentation For more information please refer to the documentation: * Installation * Usage: On the command-line, With Python, With R * Core Python functions * Python Notebook Trafilatura Overview * Tutorials For video tutorials see this Youtube playlist: * Web scraping how-tos and tutorials License Trafilatura is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me. See also GPL and free software licensing: What's in it for business? Context Contributing Contributions are welcome! See CONTRIBUTING.md for more information. Bug reports can be filed on the dedicated page. Many thanks to the contributors who submitted features and bugfixes! Roadmap For planned enhancements and relevant milestones see issues page. Author This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality. * Barbaresi, A. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131. * Barbaresi, A. "Generic Web Content Extraction with Open-Source Software", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019. * Barbaresi, A. "Efficient construction of metadata-enhanced web corpora", Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016. Reference DOI: 10.18653/v1/2021.acl-demo.15 Zenodo archive DOI: 10.5281/zenodo.3460969 @inproceedings{barbaresi-2021-trafilatura, title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}}, author = "Barbaresi, Adrien", booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations", pages = "122--131", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.acl-demo.15", year = 2021, } You can contact me via my contact page or on GitHub. Software ecosystem Software ecosystem Trafilatura: Italian word for wire drawing. Known uses of the software. Corresponding posts on Bits of Language (blog). About Python & command-line tool to gather text on the Web: web crawling/ scraping, extraction of text, metadata, comments trafilatura.readthedocs.io Topics nlp crawler text-mining news html-to-markdown scraping corpus news-aggregator text-extraction web-scraping rss-feed readability tei html2text news-crawler corpus-builder corpus-tools article-extractor text-cleaning text-preprocessing Resources Readme License GPL-3.0 license Activity Stars 1.4k stars Watchers 22 watching Forks 140 forks Report repository Releases 26 trafilatura-1.6.1 Latest Jun 15, 2023 + 25 releases Sponsor this project Used by 755 * @websitefpbytc * @Mayanksde * @arif-ozberk * @Ansh101112 * @Aurnab990 * @ShaafSM * @kumarisakshi22 * @RUSHIKITLA300 + 747 Contributors 30 * * * * * * * * * * * + 19 contributors Languages * Python 100.0% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time.