https://github.com/adbar/trafilatura

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Customer Stories
      + White papers, Ebooks, Webinars
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session.
{{ message }}
adbar / trafilatura Public

  * 
  * Notifications
  * Fork 140
  * Star 1.4k

Python & command-line tool to gather text on the Web: web crawling/
scraping, extraction of text, metadata, comments

trafilatura.readthedocs.io

License

GPL-3.0 license
1.4k stars 140 forks Activity
Star
Notifications

  * Code
  * Issues 48
  * Pull requests 3
  * Discussions
  * Actions
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Security
  * Insights

adbar/trafilatura

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
master
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
2 branches 26 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/a]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone adbar/]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@adbar
adbar courlan changes: adapt parameter and tests (#389)
...
c6f4559 Aug 7, 2023
courlan changes: adapt parameter and tests (#389)

* adapt parameter and tests

* diagnose issue with is_on

* revert

* setup: darwin compatibility

* update tests

* cleanup

c6f4559

Git stats

  * 1,418 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
.github
setup: fix and update CI workflows (#380)
June 21, 2023 14:24
docs
include_links: convert relative URLs to absolute if possible (#377)
June 21, 2023 13:30
tests
courlan changes: adapt parameter and tests (#389)
August 7, 2023 16:40
trafilatura
courlan changes: adapt parameter and tests (#389)
August 7, 2023 16:40
.coveragerc
tests: improve coverage
January 25, 2022 15:48
.gitattributes
fix proogramming language detection
October 6, 2020 13:26
.gitignore
Improvements for Chinese web pages (#186)
March 17, 2022 11:44
.readthedocs.yml
rtd config
December 17, 2019 13:53
CITATION.cff
add CITATION.cff file
November 18, 2021 19:19
CONTRIBUTING.md
docs: updated contribution info
January 24, 2022 17:27
HISTORY.md
prepare version 1.6.1 (#371)
June 15, 2023 14:53
LICENSE
name changed
April 8, 2019 14:04
MANIFEST.in
prepare new version: 1.2.0
March 7, 2022 12:41
README.rst
docs roundup (#364)
June 8, 2023 14:01
pytest.ini
tests: remove tox setting
November 15, 2021 15:03
setup.py
courlan changes: adapt parameter and tests (#389)
August 7, 2023 16:40
View code
[                    ]
A Python package & command-line tool to gather text on the Web
Description Features Evaluation and alternatives Other evaluations:
Usage and documentation License Context Contributing Roadmap Author
Software ecosystem

README.rst

 A Python package & command-line tool to gather text on the Web

Logo as PNG image


Python package Python versions Documentation Status Code Coverage 
Downloads Reference DOI: 10.18653/v1/2021.acl-demo.15

Demo as GIF image

 Description

Trafilatura is a Python package and command-line tool designed to
gather text on the Web. It includes discovery, extraction and text
processing components. Its main applications are web crawling,
downloads, scraping, and extraction of main texts, metadata and
comments. It aims at staying handy and modular: no database is
required, the output can be converted to various commonly used
formats.

Going from raw HTML to essential parts can alleviate many problems
related to text quality, first by avoiding the noise caused by
recurring elements (headers, footers, links/blogroll etc.) and second
by including information such as author and date in order to make
sense of the data. The extractor tries to strike a balance between
limiting noise (precision) and including all valid parts (recall). It
also has to be robust and reasonably fast, it runs in production on
millions of documents.

This tool can be useful for quantitative research in corpus
linguistics, natural language processing, computational social
science and beyond: it is relevant to anyone interested in data
science, information extraction, text mining, and scraping-intensive
use cases like search engine optimization, business analytics or
information security.

 Features

  * 
    Web crawling and text discovery:
          o Focused crawling and politeness rules
          o Support for sitemaps (TXT, XML) and feeds (ATOM, JSON,
            RSS)
          o URL management (blacklists, filtering and de-duplication)

  * 
    Seamless and parallel processing, online and offline:
          o URLs, HTML files or parsed HTML trees usable as input
          o Efficient and polite processing of download queues
          o Conversion of previously downloaded files

  * 
    Robust and efficient extraction:
          o Main text (with LXML, common patterns and generic
            algorithms: jusText, fork of readability-lxml)
          o Metadata (title, author, date, site name, categories and
            tags)
          o Formatting and structural elements: paragraphs, titles,
            lists, quotes, code, line breaks, in-line text formatting
          o Comments (if applicable)

  * 
    Output formats:
          o Text (minimal formatting or Markdown)
          o CSV (with metadata, tab-separated values)
          o JSON (with metadata)
          o XML (with metadata, text formatting and page structure)
            and TEI-XML

  * 
    Optional add-ons:
          o Language detection on extracted content
          o Graphical user interface (GUI)
          o Speed optimizations

 Evaluation and alternatives

For more detailed results see the benchmark and evaluation script. To
reproduce the tests just clone the repository, install all necessary
packages and run the evaluation script with the data provided in the
tests directory.

  750 documents, 2236 text & 2250 boilerplate segments (2022-05-18),
                              Python 3.8
        Python Package         Precision Recall Accuracy F-Score Diff.
html_text 0.5.2                0.529     0.958  0.554    0.682   2.2x
inscriptis 2.2.0 (html to txt) 0.534     0.959  0.563    0.686   3.5x
newspaper3k 0.2.8              0.895     0.593  0.762    0.713   12x
justext 3.0.0 (custom)         0.865     0.650  0.775    0.742   5.2x
boilerpy3 1.0.6 (article mode) 0.814     0.744  0.787    0.777   4.1x
baseline (text markup)         0.757     0.827  0.781    0.790   1x
goose3 3.1.9                   0.934     0.690  0.821    0.793   22x
readability-lxml 0.8.1         0.891     0.729  0.820    0.801   5.8x
news-please 1.5.22             0.898     0.734  0.826    0.808   61x
readabilipy 0.2.0              0.877     0.870  0.874    0.874   248x
trafilatura 1.2.2 (standard)   0.914     0.904  0.910    0.909   7.1x

 Other evaluations:

  * Most efficient open-source library in ScrapingHub's article
    extraction benchmark
  * Best overall tool according to Gael Lejeune & Adrien Barbaresi,
    Bien choisir son outil d'extraction de contenu a partir du Web
    (2020, PDF, French)

 Usage and documentation

For more information please refer to the documentation:

  * Installation
  * Usage: On the command-line, With Python, With R
  * Core Python functions
  * Python Notebook Trafilatura Overview
  * Tutorials

For video tutorials see this Youtube playlist:

  * Web scraping how-tos and tutorials

 License

Trafilatura is distributed under the GNU General Public License v3.0.
If you wish to redistribute this library but feel bounded by the
license conditions please try interacting at arms length,
multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What's in it for business?

 Context

 Contributing

Contributions are welcome! See CONTRIBUTING.md for more information.
Bug reports can be filed on the dedicated page.

Many thanks to the contributors who submitted features and bugfixes!

 Roadmap

For planned enhancements and relevant milestones see issues page.

 Author

This effort is part of methods to derive information from web
documents in order to build text databases for research (chiefly
linguistic analysis and natural language processing). Extracting and
pre-processing web texts to the exacting standards of scientific
research presents a substantial challenge for those who conduct such
research. Web corpus construction involves numerous design decisions,
and this software package can help facilitate text data collection
and enhance corpus quality.

  * Barbaresi, A. Trafilatura: A Web Scraping Library and
    Command-Line Tool for Text Discovery and Extraction, Proceedings
    of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
  * Barbaresi, A. "Generic Web Content Extraction with Open-Source
    Software", Proceedings of KONVENS 2019, Kaleidoscope Abstracts,
    2019.
  * Barbaresi, A. "Efficient construction of metadata-enhanced web
    corpora", Proceedings of the 10th Web as Corpus Workshop (WAC-X),
    2016.

Reference DOI: 10.18653/v1/2021.acl-demo.15 Zenodo archive DOI:
10.5281/zenodo.3460969

@inproceedings{barbaresi-2021-trafilatura,
  title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
  author = "Barbaresi, Adrien",
  booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
  pages = "122--131",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-demo.15",
  year = 2021,
}

You can contact me via my contact page or on GitHub.

 Software ecosystem

Software ecosystem

Trafilatura: Italian word for wire drawing.

Known uses of the software.

Corresponding posts on Bits of Language (blog).

About

Python & command-line tool to gather text on the Web: web crawling/
scraping, extraction of text, metadata, comments

trafilatura.readthedocs.io

Topics

nlp crawler text-mining news html-to-markdown scraping corpus 
news-aggregator text-extraction web-scraping rss-feed readability tei
html2text news-crawler corpus-builder corpus-tools article-extractor 
text-cleaning text-preprocessing

Resources

Readme

License

GPL-3.0 license
Activity

Stars

1.4k stars

Watchers

22 watching

Forks

140 forks
Report repository

Releases 26

 
trafilatura-1.6.1 Latest
Jun 15, 2023
+ 25 releases

Sponsor this project

Used by 755

 

  * @websitefpbytc
  * @Mayanksde
  * @arif-ozberk
  * @Ansh101112
  * @Aurnab990
  * @ShaafSM
  * @kumarisakshi22
  * @RUSHIKITLA300

+ 747

Contributors 30

  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 

+ 19 contributors

Languages

  * Python 100.0%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.