https://github.com/NVIDIA/NeMo-Curator

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        GitHub Copilot
        Write better code with AI
      +  
        Security
        Find and fix vulnerabilities
      +  
        Actions
        Automate any workflow
      +  
        Codespaces
        Instant dev environments
      +  
        Issues
        Plan and track work
      +  
        Code Review
        Manage code changes
      +  
        Discussions
        Collaborate outside of code
      +  
        Code Search
        Find more, search less
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    By company size
      + Enterprises
      + Small and medium teams
      + Startups
    By use case
      + DevSecOps
      + DevOps
      + CI/CD
      + View all use cases
    By industry
      + Healthcare
      + Financial services
      + Manufacturing
      + Government
      + View all industries
    View all solutions
  * Resources
    Topics
      + AI
      + DevOps
      + Security
      + Software Development
      + View all
    Explore
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Enterprise
      +  
        Enterprise platform
        AI-powered developer platform
    Available add-ons
      +  
        Advanced Security
        Enterprise-grade security features
      +  
        GitHub Copilot
        Enterprise-grade AI features
      +  
        Premium Support
        Enterprise-grade 24/7 support
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up Reseting focus
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
NVIDIA / NeMo-Curator Public

  * Notifications You must be signed in to change notification
    settings
  * Fork 70
  * Star 521

Scalable data pre processing and curation toolkit for LLMs

License

Apache-2.0 license
521 stars 70 forks Branches Tags Activity
Star
Notifications You must be signed in to change notification settings

  * Code
  * Issues 51
  * Pull requests 21
  * Discussions
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

NVIDIA/NeMo-Curator

 main
BranchesTags
  
Go to file
Code

Folders and files

                                                Last commit   Last
         Name                    Name             message    commit
                                                              date
Latest commit

 

History

172 Commits
 
.github                 .github                              
config                  config                               
docs                    docs                                 
examples                examples                             
nemo_curator            nemo_curator                         
requirements            requirements                         
tests                   tests                                
tutorials               tutorials                            
.gitignore              .gitignore                           
.pre-commit-config.yaml .pre-commit-config.yaml              
CITATION.cff            CITATION.cff                         
CONTRIBUTING.md         CONTRIBUTING.md                      
Dockerfile              Dockerfile                           
LICENSE                 LICENSE                              
README.md               README.md                            
SECURITY.md             SECURITY.md                          
conftest.py             conftest.py                          
pyproject.toml          pyproject.toml                       
setup.py                setup.py                             
View all files

Repository files navigation

  * README
  * Apache-2.0 license
  * Security

   https://pypi.org/project/nemo-curator https://pypi.org/project/
    nemo-curator/ NVIDIA/NeMo-Curator https://github.com/NVIDIA/
      NeMo-Curator/releases https://github.com/Naereen/badges/

NeMo Curator

 

 The GPU-Accelerated Open Source Framework for Efficient Large
Language Model Data Curation 

                               diagram

NeMo Curator is a Python library specifically designed for fast and
scalable dataset preparation and curation for large language model
(LLM) use-cases such as foundation model pretraining, domain-adaptive
pretraining (DAPT), supervised fine-tuning (SFT) and
paramter-efficient fine-tuning (PEFT). It greatly accelerates data
curation by leveraging GPUs with Dask and RAPIDS, resulting in
significant time savings. The library provides a customizable and
modular interface, simplifying pipeline expansion and accelerating
model convergence through the preparation of high-quality tokens.

At the core of the NeMo Curator is the DocumentDataset which serves
as the the main dataset class. It acts as a straightforward wrapper
around a Dask DataFrame. The Python library offers easy-to-use
methods for expanding the functionality of your curation pipeline
while eliminating scalability concerns.

Key Features

 

NeMo Curator provides a collection of scalable data-mining modules.
Some of the key features include:

  * Data download and text extraction

      + Default implementations for downloading and extracting Common
        Crawl, Wikipedia, and ArXiv data
      + Easily customize the download and extraction and extend to
        other datasets
  * Language identification and separation with fastText and pycld2

  * Text reformatting and cleaning to fix unicode decoding errors via
    ftfy

  * Quality filtering

      + Multilingual heuristic-based filtering
      + Classifier-based filtering via fastText
  * Document-level deduplication

      + exact and fuzzy (near-identical) deduplication are
        accelerated using cuDF and Dask
      + For fuzzy deduplication, our implementation follows the
        method described in Microsoft Turing NLG 530B
      + For semantic deduplication, our implementation follows the
        method described in SemDeDup by Meta AI (FAIR)
        facebookresearch/SemDeDup
  * Multilingual downstream-task decontamination following the
    approach of OpenAI GPT3 and Microsoft Turing NLG 530B

  * Distributed data classification

      + Multi-node, multi-GPU classifier inference
      + Provides sophisticated domain and quality classification
      + Flexible interface for extending to your own classifier
        network
  * Personal identifiable information (PII) redaction for removing
    addresses, credit card numbers, social security numbers, and more

These modules offer flexibility and permit reordering, with only a
few exceptions. In addition, the NeMo Framework Launcher provides
pre-built pipelines that can serve as a foundation for your
customization use cases.

Resources

 

  * Documentation
  * Examples
  * Tutorials
  * Blog posts
      + Curating Trillion-Token Datasets: Introducing NVIDIA NeMo
        Data Curator
      + Scale and Curate High-Quality Datasets for LLM Training with
        NVIDIA NeMo Curator
      + Curating Custom Datasets for LLM Training with NVIDIA NeMo
        Curator
      + Curating Custom Datasets for LLM Parameter-Efficient
        Fine-Tuning with NVIDIA NeMo Curator
      + Streamlining Data Processing for Domain Adaptive Pretraining
        with NVIDIA NeMo Curator

Get Started

 

This section explains how to install NeMo Curator and use the Python
library, Python modules, and CLI scripts. It also includes a list of
tutorials to help you get started right away. Finally, this section
explains how to use the NeMo Framework Launcher as an alternative
method for interfacing with NeMo Curator.

Install NeMo Curator

 

Requirements

 

Before installing NeMo Curator, ensure that the following
requirements are met:

  * Python 3.10
  * Ubuntu 22.04/20.04
  * NVIDIA GPU (optional)
      + Volta(tm) or higher (compute capability 7.0+)
      + CUDA 12 (or above)

You can install NeMo-Curator

 1. from PyPi
 2. from source
 3. get it through the NeMo Framework container.

From PyPi

 

To install the CPU-only modules:

pip install cython
pip install nemo-curator

To install the CPU and CUDA-accelerated modules:

pip install cython
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]

From Source

 

 1. Clone the NeMo Curator repository in GitHub.

    git clone https://github.com/NVIDIA/NeMo-Curator.git
    cd NeMo-Curator

 2. Install the modules that you need.

    To install the CPU-only modules:

    pip install cython
    pip install .

    To install the CPU and CUDA-accelerated modules:

    pip install cython
    pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"

Using Nightly Dependencies for RAPIDS

 

You can also install NeMo Curator using the RAPIDS Nightly Builds. To
do so, you can set the environment variable RAPIDS_NIGHTLY=1.

# installing from pypi
RAPIDS_NIGHTLY=1 pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple "nemo-curator[cuda12x]"

# installing from source
RAPIDS_NIGHTLY=1 pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple ".[cuda12x]"

When the RAPIDS_NIGHTLY variable is set to 0 (which is the default),
it will use the stable version of RAPIDS.

From the NeMo Framework Container

 

The latest release of NeMo Curator comes preinstalled in the NeMo
Framework Container. If you want the latest commit inside the
container, you can reinstall NeMo Curator using:

pip uninstall nemo-curator
rm -r /opt/NeMo-Curator
git clone https://github.com/NVIDIA/NeMo-Curator.git /opt/NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com /opt/NeMo-Curator[cuda12x]

And follow the instructions for installing from source from above.

Use NeMo Curator

 

Python API Quick Example

 

The following snippet demonstrates how to create a small data
curation pipeline that downloads and curates a small subset of the
Common Crawl dataset.

# Download your dataset
dataset = download_common_crawl("/datasets/common_crawl/", "2021-04", "2021-10", url_limit=10)
# Build your pipeline
curation_pipeline = Sequential([
  # Fix unicode
  Modify(UnicodeReformatter()),
  # Discard short records
  ScoreFilter(WordCountFilter(min_words=80)),
  # Discard low-quality records
  ScoreFilter(FastTextQualityFilter(model_path="model.bin")),
  # Discard records from the evaluation metrics to prevent test set leakage.
  TaskDecontamination([Winogrande(), Squad(), TriviaQA()])
])
# Execute the pipeline on your dataset
curated_dataset = curation_pipeline(dataset)

Explore NeMo Curator Tutorials

 

To get started with NeMo Curator, you can follow the tutorials
available here. These tutorials include:

  * tinystories which focuses on data curation for training LLMs from
    scratch.
  * peft-curation which focuses on data curation for LLM
    parameter-efficient fine-tuning (PEFT) use-cases.
  * distributed_data_classification which focuses on using the
    quality and domain classifiers to help with data annotation.
  * single_node_tutorial which demonstrates an end-to-end data
    curation pipeline for curating Wikipedia data in Thai.

Access Python Modules

 

The NeMo Curator section of the NeMo Framework User Guide provides
in-depth information about how the Python modules work. The examples
directory in the GitHub repository provides scripts that showcase
these modules.

Use CLI Scripts

 

NeMo Curator also offers CLI scripts for you to use. The scripts in
nemo_curator/scripts map closely to the supplied Python modules.
Refer to the NeMo Framework User Guide for more information about the
Python modules and scripts.

Use NeMo Framework Launcher

 

As an alternative method for interfacing with NeMo Curator, you can
use the NeMo Framework Launcher. The launcher enables you to easily
configure the parameters and cluster. It can also automatically
generate the SLURM batch scripts that wrap around the CLI scripts
required to run your pipeline.

In addition, other methods are available to run NeMo Curator on
SLURM. For example, refer to the example scripts in examples/slurm
for information on how to run NeMo Curator on SLURM without the NeMo
Framework Launcher.

Module Ablation and Compute Performance

 

The modules within NeMo Curator were primarily designed to curate
high-quality documents from Common Crawl snapshots in a scalable
manner. To evaluate the quality of the curated Common Crawl
documents, we conducted a series of ablation experiments. In these
experiments, we trained a 357M-parameter GPT-style model using
datasets generated at various stages of our data curation pipeline,
which was implemented in NeMo Curator.

The following figure shows that the use of different data curation
modules implemented in NeMo Curator led to improved model zero-shot
downstream task performance.

                               drawing

In terms of scalability and compute performance, using the
combination of RAPIDS and Dask fuzzy deduplication enabled us to
deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours
with 64 NVIDIA A100 Tensor Core GPUs.

Additionally, using the CPU-based modules, the following table shows
the time required and resulting data size reduction for each
processing step Common Crawl snapshot from November/December of 2020
using 30 CPU nodes (with hardware similar to the c5.24xlarge Amazon
AWS C5 instance).

    Dataset        Download and text    Text cleaning      Quality
                       extraction                         filtering
                 Time    Output Size    Time Output    Time  Output
                                             Size            Size
Common Crawl     36 hrs  2.8 TB         1 hr 2.8 TB    0.2   0.52 TB
2020-50                                                hr

Contribute to NeMo Curator

 

We welcome community contributions! Please refer to CONTRIBUTING.md
for the process.

About

Scalable data pre processing and curation toolkit for LLMs

Topics

python data data-processing data-preparation deduplication 
data-quality data-curation data-prep fine-tuning fast-data-processing
data-processing-pipelines datacuration large-language-models llm 
llmapps large-scale-data-processing datarecipes 
semantic-deduplication llm-data-quality

Resources

Readme

License

Apache-2.0 license

Security policy

Security policy
Activity
Custom properties

Stars

521 stars

Watchers

14 watching

Forks

70 forks
Report repository

Releases 3

 
v0.4.1 Latest
Oct 3, 2024
+ 2 releases

Packages 0

No packages published

Contributors 20

  * @ryantwolf
  * @sarahyurick
  * @ayushdg
  * @Maghoumi
  * @VibhuJawa
  * @praateekmahajan
  * @miguelusque
  * @chrisalexiuk-nvidia
  * @rjzamora
  * @jgerh
  * @yury-tokpanov
  * @aschilling-nv
  * @nicoleeeluo
  * @terrykong

+ 6 contributors

Languages

  * Jupyter Notebook 79.5%
  * Python 20.3%
  * Other 0.2%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.