https://github.com/NVIDIA/NeMo-Curator Skip to content Navigation Menu Toggle navigation Sign in * Product + GitHub Copilot Write better code with AI + Security Find and fix vulnerabilities + Actions Automate any workflow + Codespaces Instant dev environments + Issues Plan and track work + Code Review Manage code changes + Discussions Collaborate outside of code + Code Search Find more, search less Explore + All features + Documentation + GitHub Skills + Blog * Solutions By company size + Enterprises + Small and medium teams + Startups By use case + DevSecOps + DevOps + CI/CD + View all use cases By industry + Healthcare + Financial services + Manufacturing + Government + View all industries View all solutions * Resources Topics + AI + DevOps + Security + Software Development + View all Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up Reseting focus You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} NVIDIA / NeMo-Curator Public * Notifications You must be signed in to change notification settings * Fork 70 * Star 521 Scalable data pre processing and curation toolkit for LLMs License Apache-2.0 license 521 stars 70 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 51 * Pull requests 21 * Discussions * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights NVIDIA/NeMo-Curator main BranchesTags Go to file Code Folders and files Last commit Last Name Name message commit date Latest commit History 172 Commits .github .github config config docs docs examples examples nemo_curator nemo_curator requirements requirements tests tests tutorials tutorials .gitignore .gitignore .pre-commit-config.yaml .pre-commit-config.yaml CITATION.cff CITATION.cff CONTRIBUTING.md CONTRIBUTING.md Dockerfile Dockerfile LICENSE LICENSE README.md README.md SECURITY.md SECURITY.md conftest.py conftest.py pyproject.toml pyproject.toml setup.py setup.py View all files Repository files navigation * README * Apache-2.0 license * Security https://pypi.org/project/nemo-curator https://pypi.org/project/ nemo-curator/ NVIDIA/NeMo-Curator https://github.com/NVIDIA/ NeMo-Curator/releases https://github.com/Naereen/badges/ NeMo Curator The GPU-Accelerated Open Source Framework for Efficient Large Language Model Data Curation diagram NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens. At the core of the NeMo Curator is the DocumentDataset which serves as the the main dataset class. It acts as a straightforward wrapper around a Dask DataFrame. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns. Key Features NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include: * Data download and text extraction + Default implementations for downloading and extracting Common Crawl, Wikipedia, and ArXiv data + Easily customize the download and extraction and extend to other datasets * Language identification and separation with fastText and pycld2 * Text reformatting and cleaning to fix unicode decoding errors via ftfy * Quality filtering + Multilingual heuristic-based filtering + Classifier-based filtering via fastText * Document-level deduplication + exact and fuzzy (near-identical) deduplication are accelerated using cuDF and Dask + For fuzzy deduplication, our implementation follows the method described in Microsoft Turing NLG 530B + For semantic deduplication, our implementation follows the method described in SemDeDup by Meta AI (FAIR) facebookresearch/SemDeDup * Multilingual downstream-task decontamination following the approach of OpenAI GPT3 and Microsoft Turing NLG 530B * Distributed data classification + Multi-node, multi-GPU classifier inference + Provides sophisticated domain and quality classification + Flexible interface for extending to your own classifier network * Personal identifiable information (PII) redaction for removing addresses, credit card numbers, social security numbers, and more These modules offer flexibility and permit reordering, with only a few exceptions. In addition, the NeMo Framework Launcher provides pre-built pipelines that can serve as a foundation for your customization use cases. Resources * Documentation * Examples * Tutorials * Blog posts + Curating Trillion-Token Datasets: Introducing NVIDIA NeMo Data Curator + Scale and Curate High-Quality Datasets for LLM Training with NVIDIA NeMo Curator + Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator + Curating Custom Datasets for LLM Parameter-Efficient Fine-Tuning with NVIDIA NeMo Curator + Streamlining Data Processing for Domain Adaptive Pretraining with NVIDIA NeMo Curator Get Started This section explains how to install NeMo Curator and use the Python library, Python modules, and CLI scripts. It also includes a list of tutorials to help you get started right away. Finally, this section explains how to use the NeMo Framework Launcher as an alternative method for interfacing with NeMo Curator. Install NeMo Curator Requirements Before installing NeMo Curator, ensure that the following requirements are met: * Python 3.10 * Ubuntu 22.04/20.04 * NVIDIA GPU (optional) + Volta(tm) or higher (compute capability 7.0+) + CUDA 12 (or above) You can install NeMo-Curator 1. from PyPi 2. from source 3. get it through the NeMo Framework container. From PyPi To install the CPU-only modules: pip install cython pip install nemo-curator To install the CPU and CUDA-accelerated modules: pip install cython pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x] From Source 1. Clone the NeMo Curator repository in GitHub. git clone https://github.com/NVIDIA/NeMo-Curator.git cd NeMo-Curator 2. Install the modules that you need. To install the CPU-only modules: pip install cython pip install . To install the CPU and CUDA-accelerated modules: pip install cython pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]" Using Nightly Dependencies for RAPIDS You can also install NeMo Curator using the RAPIDS Nightly Builds. To do so, you can set the environment variable RAPIDS_NIGHTLY=1. # installing from pypi RAPIDS_NIGHTLY=1 pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple "nemo-curator[cuda12x]" # installing from source RAPIDS_NIGHTLY=1 pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple ".[cuda12x]" When the RAPIDS_NIGHTLY variable is set to 0 (which is the default), it will use the stable version of RAPIDS. From the NeMo Framework Container The latest release of NeMo Curator comes preinstalled in the NeMo Framework Container. If you want the latest commit inside the container, you can reinstall NeMo Curator using: pip uninstall nemo-curator rm -r /opt/NeMo-Curator git clone https://github.com/NVIDIA/NeMo-Curator.git /opt/NeMo-Curator pip install --extra-index-url https://pypi.nvidia.com /opt/NeMo-Curator[cuda12x] And follow the instructions for installing from source from above. Use NeMo Curator Python API Quick Example The following snippet demonstrates how to create a small data curation pipeline that downloads and curates a small subset of the Common Crawl dataset. # Download your dataset dataset = download_common_crawl("/datasets/common_crawl/", "2021-04", "2021-10", url_limit=10) # Build your pipeline curation_pipeline = Sequential([ # Fix unicode Modify(UnicodeReformatter()), # Discard short records ScoreFilter(WordCountFilter(min_words=80)), # Discard low-quality records ScoreFilter(FastTextQualityFilter(model_path="model.bin")), # Discard records from the evaluation metrics to prevent test set leakage. TaskDecontamination([Winogrande(), Squad(), TriviaQA()]) ]) # Execute the pipeline on your dataset curated_dataset = curation_pipeline(dataset) Explore NeMo Curator Tutorials To get started with NeMo Curator, you can follow the tutorials available here. These tutorials include: * tinystories which focuses on data curation for training LLMs from scratch. * peft-curation which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases. * distributed_data_classification which focuses on using the quality and domain classifiers to help with data annotation. * single_node_tutorial which demonstrates an end-to-end data curation pipeline for curating Wikipedia data in Thai. Access Python Modules The NeMo Curator section of the NeMo Framework User Guide provides in-depth information about how the Python modules work. The examples directory in the GitHub repository provides scripts that showcase these modules. Use CLI Scripts NeMo Curator also offers CLI scripts for you to use. The scripts in nemo_curator/scripts map closely to the supplied Python modules. Refer to the NeMo Framework User Guide for more information about the Python modules and scripts. Use NeMo Framework Launcher As an alternative method for interfacing with NeMo Curator, you can use the NeMo Framework Launcher. The launcher enables you to easily configure the parameters and cluster. It can also automatically generate the SLURM batch scripts that wrap around the CLI scripts required to run your pipeline. In addition, other methods are available to run NeMo Curator on SLURM. For example, refer to the example scripts in examples/slurm for information on how to run NeMo Curator on SLURM without the NeMo Framework Launcher. Module Ablation and Compute Performance The modules within NeMo Curator were primarily designed to curate high-quality documents from Common Crawl snapshots in a scalable manner. To evaluate the quality of the curated Common Crawl documents, we conducted a series of ablation experiments. In these experiments, we trained a 357M-parameter GPT-style model using datasets generated at various stages of our data curation pipeline, which was implemented in NeMo Curator. The following figure shows that the use of different data curation modules implemented in NeMo Curator led to improved model zero-shot downstream task performance. drawing In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours with 64 NVIDIA A100 Tensor Core GPUs. Additionally, using the CPU-based modules, the following table shows the time required and resulting data size reduction for each processing step Common Crawl snapshot from November/December of 2020 using 30 CPU nodes (with hardware similar to the c5.24xlarge Amazon AWS C5 instance). Dataset Download and text Text cleaning Quality extraction filtering Time Output Size Time Output Time Output Size Size Common Crawl 36 hrs 2.8 TB 1 hr 2.8 TB 0.2 0.52 TB 2020-50 hr Contribute to NeMo Curator We welcome community contributions! Please refer to CONTRIBUTING.md for the process. About Scalable data pre processing and curation toolkit for LLMs Topics python data data-processing data-preparation deduplication data-quality data-curation data-prep fine-tuning fast-data-processing data-processing-pipelines datacuration large-language-models llm llmapps large-scale-data-processing datarecipes semantic-deduplication llm-data-quality Resources Readme License Apache-2.0 license Security policy Security policy Activity Custom properties Stars 521 stars Watchers 14 watching Forks 70 forks Report repository Releases 3 v0.4.1 Latest Oct 3, 2024 + 2 releases Packages 0 No packages published Contributors 20 * @ryantwolf * @sarahyurick * @ayushdg * @Maghoumi * @VibhuJawa * @praateekmahajan * @miguelusque * @chrisalexiuk-nvidia * @rjzamora * @jgerh * @yury-tokpanov * @aschilling-nv * @nicoleeeluo * @terrykong + 6 contributors Languages * Jupyter Notebook 79.5% * Python 20.3% * Other 0.2% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.