https://github.com/facebookresearch/large_concept_model

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        GitHub Copilot
        Write better code with AI
      +  
        Security
        Find and fix vulnerabilities
      +  
        Actions
        Automate any workflow
      +  
        Codespaces
        Instant dev environments
      +  
        Issues
        Plan and track work
      +  
        Code Review
        Manage code changes
      +  
        Discussions
        Collaborate outside of code
      +  
        Code Search
        Find more, search less
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    By company size
      + Enterprises
      + Small and medium teams
      + Startups
    By use case
      + DevSecOps
      + DevOps
      + CI/CD
      + View all use cases
    By industry
      + Healthcare
      + Financial services
      + Manufacturing
      + Government
      + View all industries
    View all solutions
  * Resources
    Topics
      + AI
      + DevOps
      + Security
      + Software Development
      + View all
    Explore
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
      + Executive Insights
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Enterprise
      +  
        Enterprise platform
        AI-powered developer platform
    Available add-ons
      +  
        Advanced Security
        Enterprise-grade security features
      +  
        GitHub Copilot
        Enterprise-grade AI features
      +  
        Premium Support
        Enterprise-grade 24/7 support
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up Reseting focus
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
facebookresearch / large_concept_model Public

  * Notifications You must be signed in to change notification
    settings
  * Fork 94
  * Star 1.2k

Large Concept Models: Language modeling in a sentence representation
space

License

MIT license
1.2k stars 94 forks Branches Tags Activity
Star
Notifications You must be signed in to change notification settings

  * Code
  * Issues 4
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

facebookresearch/large_concept_model

 main
BranchesTags
  
[                    ]
Go to file
Code

Folders and files

                                                Last commit   Last
         Name                    Name             message    commit
                                                              date
Latest commit

 

History

10 Commits
 
.github                 .github                              
examples/evaluation     examples/evaluation                  
lcm                     lcm                                  
recipes                 recipes                              
scripts                 scripts                              
tests                   tests                                
.gitignore              .gitignore                           
.pre-commit-config.yaml .pre-commit-config.yaml              
CODE_OF_CONDUCT.md      CODE_OF_CONDUCT.md                   
CONTRIBUTING.md         CONTRIBUTING.md                      
LICENSE                 LICENSE                              
README.md               README.md                            
lcm.svg                 lcm.svg                              
pyproject.toml          pyproject.toml                       
space.svg               space.svg                            
uv.lock                 uv.lock                              
View all files

Repository files navigation

  * README
  * Code of conduct
  * MIT license
  * Security

Large Concept Models

 

Language Modeling in a Sentence Representation Space

 

[Blog] [Paper]

This repository provides the official implementations and experiments
for Large Concept Models (LCM).

                               [space]

The LCM operates on an explicit higher-level semantic representation,
which we name a "concept". Concepts are language- and
modality-agnostic and represent a higher level idea. In this work, a
concept corresponds to a sentence, and we use the SONAR embedding
space, which supports up to 200 languages in text and 57 languages in
speech. See the list of supported languages here.

Approach

 

                                [lcm]

The LCM is a sequence-to-sequence model in the concepts space trained
to perform auto-regressive sentence prediction. We explore multiple
approaches:

  * MSE regression (base_lcm in this code).
  * Variants of diffusion-based generation (we include
    two_tower_diffusion_lcm in this release).
  * Models operating in a quantized SONAR space (coming soon).

These explorations are performed using 1.6B parameter models and
training data in the order of 1.3T tokens. We include in this
repository recipes to reproduce the training and finetuning of 1.6B
MSE LCM and Two-tower diffusion LCM. See instructions below.

Installing

 

Using UV

 

The LCM repository relies on fairseq2. If you have uv installed on
your system, you can install a virtual environment with all the
necessary packages by running the following commands:

uv sync --extra cpu --extra eval --extra data

You can also use uv run to run the demo commands with the correct
environment.

Note that we only provide requirements for cpu dependencies, if you
want to use GPU support, you will have to choose the variants of
torch and fairseq2 that work for your system. For example for torch
2.5.1 with cuda 1.21, You would do something like:

uv pip install torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/cu121 --upgrade
uv pip install fairseq2==v0.3.0rc1 --pre --extra-index-url  https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cu121 --upgrade

Check fairseq2 variants for possible variants. Note that LCM
currently relies on the release candidate for fairseq2 0.3.0 rc1.

Using pip

 

To install with pip, the commands are very similar, but you will have
to manage your own environment and make sure to install fairseq2
manually first. For instance, for a cpu install.

pip install --upgrade pip
pip install fairseq2==v0.3.0rc1 --pre --extra-index-url  https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cpu
pip install -e ".[data,eval]"

If fairseq2 does not provide a build for your machine, check the
readme of that project to build it locally.

Usage

 

Note

If using uv prefix all commands with uv run to use the environment
created by default in .venv, e.g., uv run torchrun --standalone.
Alternatively, you can activate the environment once and for all with
source .venv/bin/activate.

Preparing data

 

The LCM can be trained and evaluated using textual data split in
sentences and embedded with SONAR. We provide a sample processing
pipeline that can be used to prepare such training data, you can run
it with:

 uv run --extra data scripts/prepare_wikipedia.py /output/dir/for/the/data

This pipeline shows how to get a dataset from huggingface and process
it with SONAR and SaT. Check out the file for more details on
processing your own data. While the script provides an example
pulling data from huggingface, we also provide APIs to process jsonl,
parquet and CSV.

Datacards

 

The trainer described below relies on datacards configuring the
datasets. These datacards are yaml files with pointers to the dataset
files (locally or on s3) and information on its schema. We provide
some sample datacards in lcm/datacards/datacards.yaml. Once you have
processed some data, you can update the datacards with your paths.

Fitting a normalizer

 

To fit a new embedding space normalizer on a given weighted mixture
of datasets one can use the following command :

python scripts/fit_embedding_normalizer.py --ds dataset1:4 dataset2:1 dataset3:10 --save_path "path/to/new/normalizer.pt" --max_nb_samples 1000000

Here, dataset1, dataset2, dataset3 are the names of datasets declared
in the datacards as shown above and (4, 1, 10) their respective
relative weights. The resulting normalizer can be next declared as a
model as shown in lcm/cards/sonar_normalizer.yaml and referenced in
all model training configs.

Pre-training models

 

Base MSE LCM

 

To train an MSE LCM, we will use one of the following commands:

Option 1. Training with SLURM using submitit via stopes's launcher:

python -m lcm.train \
    +pretrain=mse \
    ++trainer.output_dir="checkpoints/mse_lcm" \
    ++trainer.experiment_name=training_mse_lcm \

With this command, we will submit a slurm job named training_mse_lcm
with the recipe's requirements, in this case:

requirements:
  nodes: 4
  tasks_per_node: 8
  gpus_per_node: 8
  cpus_per_task: 32
  mem_gb: 0
  timeout_min: 10000

You can override the job's requirements like the timeout limit and
the launcher's slurm partition with:

python -m lcm.train \
    +pretrain=mse \
    ++trainer.output_dir="checkpoints/mse_lcm" \
    ++trainer.experiment_name=training_mse_lcm \
    ++trainer.requirements.timeout_min=100 \
    ++trainer.requirements.cpus_per_task=8 \
    ++launcher.partition=$partition_name

Option 2. Training locally with torchrun (e.g. using only 2 GPUs)
with a smaller batch size (overriding
++trainer.data_loading_config.max_tokens=1000):

CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \
    -m lcm.train launcher=standalone \
    +pretrain=mse \
    ++trainer.data_loading_config.max_tokens=1000 \
    ++trainer.output_dir="checkpoints/mse_lcm" \
    +trainer.use_submitit=false \

Important

Since we're changing the number of GPUs required by the recipe, this
will not reproduce the experimental setup of the paper.

The checkpoints directory checkpoints/mse_lcm will be structured as:

.
+-- checkpoints
|   +-- step_2000
|   +-- ...
|   +-- step_250000
+-- config_logs
+-- executor_logs
+-- model_card.yaml
+-- tb   # tensorboard logs
+-- wandb  # W&B logs

Note that W&B logging is skipped unless wandb is available. You can
install wandb with uv pip install wandb. W&B arguments can be changed
by overriding Hydra config values in the recipe:

++trainer.wandb_project=$project_name
++trainer.wandb_run_name=$run_name

Two-tower diffusion LCM

 

Similar to the base MSE LCM we can submit a training job following
the recipe in ./recipes/train/pretrain/two_tower.yaml via:

python -m lcm.train \
    +pretrain=two_tower \
    ++trainer.output_dir="checkpoints/two_tower_lcm" \
    ++trainer.experiment_name=training_two_tower_lcm \

Tip

To understand the different ingredients of training recipes, check
this README.

Finetuning models

 

To finetune the previously pre-trained two-tower diffusion LCM on
supervised data, follow these steps:

Step 1. Register the pre-trained checkpoint as a fairseq2 asset.

You can finetune the final checkpoint with the card checkpoints/
two_tower_lcm/model_card.yaml or any checkpoint after a specific
number of training steps, e.g., checkpoints/two_tower_lcm/checkpoints
/step_2000/model_card.yaml. To register the selected checkpoint, copy
the automatically created yaml file to ./lcm/cards/mycards.yaml and
rename the model to replace the default on_the_fly_lcm. ./lcm/cards/
mycards.yaml will look like:

__source__: inproc
 checkpoint: file://path_to/large_concept_model/checkpoints/two_tower_lcm/checkpoints/step_2000/model.pt
 model_arch: two_tower_diffusion_lcm_1_6B
 model_family: two_tower_diffusion_lcm
 name: my_pretrained_two_tower

For more on how to manage fairseq2 assets, see documentation.

Step 2. Launch a finetuning job pointing to the model to finetune, in
this instance my_pretrained_two_tower:

CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 \
    -m lcm.train launcher=standalone \
    +finetune=two_tower \
    ++trainer.output_dir="checkpoints/finetune_two_tower_lcm" \
    ++trainer.data_loading_config.max_tokens=1000 \
    +trainer.use_submitit=false \
    ++trainer.model_config_or_name=my_pretrained_two_tower

or

python -m lcm.train \
    +finetune=two_tower \
    ++trainer.output_dir="checkpoints/finetune_two_tower_lcm" \
    ++trainer.experiment_name=finetune_two_tower_lcm \
    ++trainer.model_config_or_name=my_pretrained_two_tower

Similarly, to finetune an MSE LCM, follow the same instructions for
registering a pre-trained checkpoint and submit a finetuning job with
the appropriate recipe (./recipes/train/finetune/mse.yaml) via:

python -m lcm.train \
    +finetune=mse \
    ++trainer.output_dir="checkpoints/finetune_mse_lcm" \
    ++trainer.experiment_name=finetune_mse_lcm \
    ++trainer.model_config_or_name=my_pretrained_mse_lcm

Evaluating models

 

Note

For advanced evaluation (benchmarking different tasks, comparing
results with LLMs, etc.) , check the evaluation documentation.

Step 0. Download NLTK data required for evaluating ROUGE:

python -m nltk.downloader punkt_tab

Step 1. Generate and score outputs of a model either by pointing to
its model_card yaml file or after registering it as a fairseq2 asset
(the same way we registerd my_pretrained_two_tower):

model_card=./checkpoints/finetune_two_tower_lcm/checkpoints/step_1000/model_card.yaml
OUTPUT_DIR=evaluation_outputs/two_tower

torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation  \
  --predictor two_tower_diffusion_lcm  \
  --show_progress true \
  --data_loading.max_samples 100 \
  --model_card ${model_card} \
  --launcher standalone \
  --dataset.source_suffix_text '[MODEL]:' \
  --tasks finetuning_data_lcm.validation \
   --task_args '{"max_gen_len": 10, "eos_config": {"text": "End of text."}}' \
  --data_loading.batch_size 4  --generator_batch_size 4 \
  --dump_dir ${OUTPUT_DIR} \
  --inference_timesteps 40 \
  --initial_noise_scale 0.6 \
  --guidance_scale 3 \
  --guidance_rescale 0.7

where in the example we are evaluating 100 samples only
(--data_loading.max_samples 100) and limiting the model output length
to 10 sentences (--task_args '{"max_gen_len": 10}').

Outputs dumped in ./evaluation_outputs/two_tower will be structured
as:

.
+-- metadata.jsonl
+-- metrics.eval.jsonl
+-- raw_results
+-- results
+-- tb

where metrics.eval.jsonl contains corpus-level scores.

To evaluate an MSE LCM, we use the associated predictor (base_lcm)
and evaluate with:

model_card=./checkpoints/finetune_mse_lcm/checkpoints/step_1000/model_card.yaml
OUTPUT_DIR=evaluation_outputs/mse_lcm

torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation  \
  --predictor base_lcm --sample_latent_variable False \
  --show_progress true \
  --data_loading.max_samples 100 \
  --model_card ${model_card} \
  --launcher standalone \
  --dataset.source_suffix_text '[MODEL]:' \
  --tasks finetuning_data_lcm.validation \
   --task_args '{"max_gen_len": 10, "eos_config": {"text": "End of text."}}' \
  --data_loading.batch_size 4  --generator_batch_size 4 \
  --dump_dir ${OUTPUT_DIR} \

Note that in this example, we only show how to evaluate the LCM on
the same finetuning dataset (validation split). To evaluate in a
downstream task, and compare results with the LLM, refer to the
Evaluation documentation.

Contributing

 

See the CONTRIBUTING file for how to help out.

Citation

 

If you use this codebase, please cite:

@article{lcm2024,
  author = {{LCM team}, Lo\"{i}c Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-juss\`{a}, David Dale, Hady Elsahar, Kevin Heffernan, Jo\~{a}o Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sanchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk},
  title = {{Large Concept Models}: Language Modeling in a Sentence Representation Space},
  publisher = {arXiv},
  year = {2024},
  url = {https://arxiv.org/abs/2412.08821},
}

License

 

This code is released under the MIT license (see LICENSE).

About

Large Concept Models: Language modeling in a sentence representation
space

Topics

nlp pytorch seq2seq sequence-to-sequence language-models

Resources

Readme

License

MIT license

Code of conduct

Code of conduct

Security policy

Security policy
Activity
Custom properties

Stars

1.2k stars

Watchers

25 watching

Forks

94 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Contributors 8

  * @elbayadm
  * @Mortimerp9
  * @artemru
  * @antoine-tran
  * @facebook-github-bot
  * @eltociear
  * @alexmourachko
  * @metaloic

Languages

  * Python 100.0%

Footer

 (c) 2025 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.