https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/

 
logo
logo
 

  * Product
      + Cloud
      + Cluster
          o Andromeda
      + System
      + Processor
      + Software
  * Applications
      + Natural Language Processing
      + Computer Vision
      + High-Performance Computing
      + Industries
          o Health & Pharma
          o Energy
          o Government
          o Scientific Computing
          o Financial Services
          o Web and Social Media
  * Resources
      + Customer Spotlights
      + Blog
      + Publications
      + Events
      + White Papers
  * Developers
      + Community
      + Developer Blog
      + Documentation
      + Cerebras Model Zoo
      + Request Access to SDK
  * Company
      + About
      + In the News
      + Awards
      + Press Releases
      + Press Kit
  * Join Us
      + Life at Cerebras
      + All openings

  *  
    Get Demo

  *  
  *  
  *  
  *  
  * Search

March 28, 2023
In Machine Learning, Software, Cluster, Blog

Cerebras-GPT: A Family of Open, Compute-efficient, Large Language
Models

Cerebras open sources seven GPT-3 models from 111 million to 13
billion parameters. Trained using the Chinchilla formula, these
models set new benchmarks for accuracy and compute efficiency.

[Scaling-laws-blog-banner]

Abstract

State-of-the-art language models are extremely challenging to train;
they require huge compute budgets, complex distributed compute
techniques and deep ML expertise. As a result, few organizations
train large language models (LLMs) from scratch. And increasingly
those that have the resources and expertise are not open sourcing the
results, marking a significant change from even a few months back.

At Cerebras, we believe in fostering open access to the most advanced
models. With this in mind, we are proud to announce the release to
the open source community of Cerebras-GPT, a family of seven GPT
models ranging from 111 million to 13 billion parameters. Trained
using the Chinchilla formula, these models provide the highest
accuracy for a given compute budget. Cerebras-GPT has faster training
times, lower training costs, and consumes less energy than any
publicly available model to date.

All models were trained on CS-2 systems that are part of the
Andromeda AI supercomputer using our simple, data-parallel weight
streaming architecture. By not having to worry about model
partitioning, we were able to train these models in just a few weeks.
Training these seven models has allowed us to derive a new scaling
law. Scaling laws predict model accuracy based on the training
compute budget and have been hugely influential in guiding AI
research. To the best of our knowledge, Cerebras-GPT is the first
scaling law that predicts model performance for a public dataset.

Today's release is designed to be used by and reproducible by anyone.
All models, weights, and checkpoints are available on Hugging Face
and GitHub under the Apache 2.0 license. Additionally, we provide
detailed information on our training methods and performance results
in our forthcoming paper. The Cerebras CS-2 systems used for training
are also available on-demand via Cerebras Model Studio.

Cerebras-GPT: A New Model For Open LLM Development

Artificial intelligence has the potential to transform the world
economy, but its access is increasingly gated. The latest large
language model - OpenAI's GPT4 - was released with no information on
its model architecture, training data, training hardware, or
hyperparameters. Companies are increasingly building large models
using closed datasets and offering model outputs only via API access.

For LLMs to be an open and accessible technology, we believe it's
important to have access to state-of-the-art models that are open,
reproducible, and royalty free for both research and commercial
applications. To that end, we have trained a family of transformer
models using the latest techniques and open datasets that we call
Cerebras-GPT. These models are the first family of GPT models trained
using the Chinchilla formula and released via the Apache 2.0 license.

[Scaling-laws-blog-comparison]
Figure 1. A comparison of different large language models and their
openness and training philosophy.

Large language models can be broadly categorized into two camps. The
first group includes models such as OpenAI's GPT-4 and DeepMind's
Chinchilla, which are trained on private data to achieve the highest
level of accuracy. However, the trained weights and source code of
these models are not available to the public. The second group
includes models such as Meta's OPT and Eleuther's Pythia, which are
open source but not trained in a compute-optimal manner.

By "compute-optimal," we refer to DeepMind's finding that large
language models achieve the highest accuracy for a fixed compute
budget when 20 data tokens are used for every parameter in the model.
Therefore, a one billion parameter model should be trained on 20
billion data tokens to reach optimal results for a fixed training
budget. This is sometimes referred to as the "Chinchilla recipe."

An implication of this finding is that it's not optimal to use the
same amount of training data when training a family of model sizes.
For instance, training a small model with too much data results in
diminishing returns and less accuracy gains per FLOP - it would be
better to use a larger model with less data. In contrast, a large
model trained on too little data does not reach its potential - it
would be better to reduce the model size and feed it more data. In
each case, using 20 tokens per parameter is optimal, per the
Chinchilla recipe.

[Scaling-Laws-blog-fig-2]
Figure 2. Cerebras-GPT vs. Pythia. Lower curves show greater compute
efficiency for a given loss level.

EleutherAI's Pythia open-source model suite is highly valuable for
the research community because it provides a wide range of model
sizes using the public Pile dataset under a controlled training
methodology. However, Pythia was trained with a fixed number of
tokens across all model sizes with the objective of providing an
apples-to-apples baseline across all models.

Designed to be complimentary to Pythia, Cerebras-GPT was designed to
cover a wide range of model sizes using the same public Pile dataset
and to establish a training-efficient scaling law and family of
models. Cerebras-GPT consists of seven models with 111M, 256M, 590M,
1.3B, 2.7B, 6.7B, and 13B parameters, all of which are trained using
20 tokens per parameter. By using the optimal training tokens for
each model size, Cerebras-GPT achieves the lowest loss per unit of
compute across all model sizes (Figure 2).

New Scaling Law

Training a large language model can be an expensive and
time-consuming process. It requires a significant amount of
computational resources and expertise to optimize the model's
performance. One way to address this challenge is to train a family
of models with varying sizes, which can help establish a scaling law
that describes the relationship between training compute and model
performance.

[Scaling-law-chart-no-logo]
Figure 3. Cerebras-GPT scaling law

Scaling laws are vital to LLM development since they allow
researchers to predict the expected loss of a model before training,
thus avoiding costly hyperparameter search. OpenAI was the first to
establish a scaling law showing a power law relationship between
compute and model loss. DeepMind followed with the Chinchilla study,
demonstrating an optimal ratio between compute and data. However,
these studies were performed using closed datasets, making them
difficult to apply the results to other datasets.

Cerebras-GPT continues this line of research by establishing a
scaling law based on the open Pile dataset. The resulting scaling law
provides a compute-efficient recipe for training LLMs of any size
using Pile. By publishing our findings, we hope to provide a valuable
resource for the community and further advance the development of
large language models.

Model Performance on Downstream Tasks

We evaluated the performance of Cerebras-GPT on several task specific
language tasks such as sentence completion and question-and-answer.
These are important because even though the models may have good
natural language understanding, that may not translate to specialized
downstream tasks. We show that Cerebras-GPT preserves
state-of-the-art training efficiency for most common downstream
tasks, as shown in the examples in Figure 4. Notably while previous
scaling laws have shown scaling for pre-training loss, this is the
first time results have been published showing scaling for downstream
natural language tasks.

[Downstream-tasks-figure]
Figure 4 Example downstream task performance comparison of
Cerebras-GPT and other open-source models. Cerebras-GPT preserves the
training efficiency advantage across downstream tasks.

Cerebras CS-2: Simple, Data-Parallel Training

It takes substantial technical expertise to train very large models
on GPUs. In the recently released GPT-4 Technical Report, OpenAI
credits over thirty contributors just for compute infrastructure and
scaling. To understand why, we will look at existing LLM scaling
techniques on the GPU shown in Figure 5.

The simplest way to scale is data parallel.  Data parallel scaling
replicates the model in each device and uses different training
batches on those devices, averaging their gradients. Clearly, this
does not address the issue of model size - it fails if the entire
model does not fit on a single GPU.

A common alternative approach is pipelined model parallel, which runs
different layers on different GPUs as a pipeline. However, as the
pipeline grows, the activation memory increases quadratically with
the pipeline depth, and this can be prohibitive for large models. To
avoid that, another common approach is to split layers across GPUs,
called tensor model parallel, but this imposes significant
communication between the GPUs, which complicates the implementation
and can be slow.

Because of these complexities, there is no single way to scale on GPU
clusters today. Training large models on GPUs requires a hybrid
approach with all forms of parallelism; the implementations are
complicated and hard to bring up, and there are significant
performance issues

[Scaling-laws-blog-parallel-techniques]
Figure 5 Existing scaling techniques on distributed GPU clusters and
their challenges. Scaling on GPU clusters requires a complex
combination of all forms of parallelism.
[Scaling-laws-blog-training-HW]
Figure 6. GPU scaling requires the use of multiple parallelism
techniques. Cerebras CS-2 uses data parallel scaling for any model
size.

Two recent large language models illustrate the complexities involved
in splitting large language models across many GPUs (Figure 6).
Meta's OPT model, ranging from 125M to 175B parameters was trained on
992 GPUs using a combination of data parallelism and tensor
parallelism along with various memory optimization techniques.
Eleuther's 20B parameter GPT-NeoX used a combination data, tensor,
and pipeline parallelism to train the model across 96 GPUs.

Cerebras GPT was trained using standard data parallelism on 16 CS-2
systems. This is possible because the Cerebras CS-2 systems are
fitted with enough memory to run even the largest models on a single
device without splitting the model. We then designed the
purpose-built Cerebras Wafer-Scale Cluster around the CS-2 to enable
easy scale-out. It uses a HW/SW co-designed execution called weight
streaming that enables independent scaling of model size and cluster
size, without model parallelism. With this architecture, scaling to
larger clusters is as simple as changing the number of systems in a
configuration file, as shown in Figure 7.

[Cluster-code-gif-2-1]
Figure 7. Push-button scaling to multiple CS-2 systems in the
Cerebras Wafer-Scale Cluster using only simple data parallel scaling.

We trained all Cerebras-GPT models on a 16x CS-2 Cerebras Wafer-Scale
Cluster called Andromeda. The cluster enabled all experiments to be
completed quickly, without the traditional distributed systems
engineering and model parallel tuning needed on GPU clusters. Most
importantly, it enabled our researchers to focus on the design of the
ML instead of the distributed system. We believe the capability to
easily train large models is a key enabler for the broad community,
so we have made the Cerebras Wafer-Scale Cluster available on the
cloud through the Cerebras AI Model Studio.

Conclusion

At Cerebras, we believe democratizing large models requires both
solving the training infrastructure challenge and opening more models
to the community. To that end, we have designed the Cerebras
Wafer-Scale Cluster with push-button scaling, and we are
open-sourcing the Cerebras-GPT family of large generative models. We
hope that as the first public large GPT model suite with
state-of-the-art training efficiency Cerebras-GPT will serve as a
recipe for efficient training and as a reference for further
community research. Additionally, we are making both the
infrastructure and models available on the cloud through the Cerebras
AI Model Studio. We believe it's through better training
infrastructure and more community sharing that we can, together,
further advance the large generative AI industry.

Authors

Nolan Dey, Research Scientist; Joel Hestness, Principal Research
Scientist; Sean Lie, Chief Hardware Architect and Co-founder | Mar
28, 2023

Contributing Authors

Nolan Dey, Gurpreet Gosal, Charles Chen, Hemant Khachane, William
Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness.

Additional Resources

  * arXiv Paper (coming soon)
  * Hugging Face
  * Cerebras Model Zoo
  * Cerebras AI Model Studio.

featured
---------------------------------------------------------------------
Avatar photo

Nolan Dey

Author posts

Related Posts

Machine LearningSoftwareBlogComputer Vision
 
[Sparse-IFT-blog-banner-uai-901x600]

March 22, 2023

Can Sparsity Make AI Models More Accurate?

Cerebras introduces Sparse-IFT, a technique that, through
sparsification,...

---------------------------------------------------------------------

Avatar photoby Sean Lie

Machine LearningSoftwareSystemBlog
 
[Sean-sparsity-blog-FLOPs-chart-60pc-banner-uai-1490x993]

March 21, 2023

Accelerating Large GPT Training with Sparse Pre-Training and Dense
Fine-Tuning [Updated]

We have shown it is possible to reduce the training compute for large
GPT...

---------------------------------------------------------------------

Avatar photoby Vithursan Thangarasa

ChipMachine LearningSoftwareClusterSDKBlogHPCDeveloper Blog
 
[cerebras-fine-tuning-uai-961x641]

February 16, 2023

Cerebras Announces Fine-Tuning on the Cerebras AI Model Studio

Announcing the addition of fine-tuning capabilities for large
language models...

---------------------------------------------------------------------

Avatar photoby Udai Mody

  * Prev
  * Next

Explore more ideas in less time. Reduce the cost of curiosity.

Sign up
---------------------------------------------------------------------
[cerebras-white-01]

info@cerebras.net

1237 E. Arques Ave
Sunnyvale, CA 94085

Follow

 
 
 
 

Product

Cluster
System
Chip
Software
Cloud

Applications

Natural Language Processing

Computer Vision

High Performance Computing

Industries

Health & Pharma
Energy
Government
Scientific Computing
Financial Services
Web & Social Media

Resources

Customer Spotlight
Blog
Publications
Event Replays
Whitepapers

Developers

Community
Developer Blog
Documentation
ML Public Repository
Request Access to SDK

Company

About Cerebras
In the News
Press Releases
Privacy
Legal
Careers
Contact

(c) 2023 Cerebras. All rights reserved
 
[                    ]
Privacy Preference Center

Privacy Preferences

[Save Preferences] 
x
Manage Cookie Consent
To provide the best experiences, we use technologies like cookies to
store and/or access device information. Consenting to these
technologies will allow us to process data such as browsing behavior
or unique IDs on this site. Not consenting or withdrawing consent,
may adversely affect certain features and functions.
Functional [ ] Functional Always active
The technical storage or access is strictly necessary for the
legitimate purpose of enabling the use of a specific service
explicitly requested by the subscriber or user, or for the sole
purpose of carrying out the transmission of a communication over an
electronic communications network.
Preferences [ ] Preferences
The technical storage or access is necessary for the legitimate
purpose of storing preferences that are not requested by the
subscriber or user.
Statistics [ ] Statistics
The technical storage or access that is used exclusively for
statistical purposes. The technical storage or access that is used
exclusively for anonymous statistical purposes. Without a subpoena,
voluntary compliance on the part of your Internet Service Provider,
or additional records from a third party, information stored or
retrieved for this purpose alone cannot usually be used to identify
you.
Marketing [ ] Marketing
The technical storage or access is required to create user profiles
to send advertising, or to track the user on a website or across
several websites for similar marketing purposes.
Manage options Manage services Manage vendors Read more about these
purposes
Accept Deny View preferences Save preferences View preferences
{title} {title} {title}
Manage consent