https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/ logo logo * Product + Cloud + Cluster o Andromeda + System + Processor + Software * Applications + Natural Language Processing + Computer Vision + High-Performance Computing + Industries o Health & Pharma o Energy o Government o Scientific Computing o Financial Services o Web and Social Media * Resources + Customer Spotlights + Blog + Publications + Events + White Papers * Developers + Community + Developer Blog + Documentation + Cerebras Model Zoo + Request Access to SDK * Company + About + In the News + Awards + Press Releases + Press Kit * Join Us + Life at Cerebras + All openings * Get Demo * * * * * Search March 28, 2023 In Machine Learning, Software, Cluster, Blog Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models Cerebras open sources seven GPT-3 models from 111 million to 13 billion parameters. Trained using the Chinchilla formula, these models set new benchmarks for accuracy and compute efficiency. [Scaling-laws-blog-banner] Abstract State-of-the-art language models are extremely challenging to train; they require huge compute budgets, complex distributed compute techniques and deep ML expertise. As a result, few organizations train large language models (LLMs) from scratch. And increasingly those that have the resources and expertise are not open sourcing the results, marking a significant change from even a few months back. At Cerebras, we believe in fostering open access to the most advanced models. With this in mind, we are proud to announce the release to the open source community of Cerebras-GPT, a family of seven GPT models ranging from 111 million to 13 billion parameters. Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget. Cerebras-GPT has faster training times, lower training costs, and consumes less energy than any publicly available model to date. All models were trained on CS-2 systems that are part of the Andromeda AI supercomputer using our simple, data-parallel weight streaming architecture. By not having to worry about model partitioning, we were able to train these models in just a few weeks. Training these seven models has allowed us to derive a new scaling law. Scaling laws predict model accuracy based on the training compute budget and have been hugely influential in guiding AI research. To the best of our knowledge, Cerebras-GPT is the first scaling law that predicts model performance for a public dataset. Today's release is designed to be used by and reproducible by anyone. All models, weights, and checkpoints are available on Hugging Face and GitHub under the Apache 2.0 license. Additionally, we provide detailed information on our training methods and performance results in our forthcoming paper. The Cerebras CS-2 systems used for training are also available on-demand via Cerebras Model Studio. Cerebras-GPT: A New Model For Open LLM Development Artificial intelligence has the potential to transform the world economy, but its access is increasingly gated. The latest large language model - OpenAI's GPT4 - was released with no information on its model architecture, training data, training hardware, or hyperparameters. Companies are increasingly building large models using closed datasets and offering model outputs only via API access. For LLMs to be an open and accessible technology, we believe it's important to have access to state-of-the-art models that are open, reproducible, and royalty free for both research and commercial applications. To that end, we have trained a family of transformer models using the latest techniques and open datasets that we call Cerebras-GPT. These models are the first family of GPT models trained using the Chinchilla formula and released via the Apache 2.0 license. [Scaling-laws-blog-comparison] Figure 1. A comparison of different large language models and their openness and training philosophy. Large language models can be broadly categorized into two camps. The first group includes models such as OpenAI's GPT-4 and DeepMind's Chinchilla, which are trained on private data to achieve the highest level of accuracy. However, the trained weights and source code of these models are not available to the public. The second group includes models such as Meta's OPT and Eleuther's Pythia, which are open source but not trained in a compute-optimal manner. By "compute-optimal," we refer to DeepMind's finding that large language models achieve the highest accuracy for a fixed compute budget when 20 data tokens are used for every parameter in the model. Therefore, a one billion parameter model should be trained on 20 billion data tokens to reach optimal results for a fixed training budget. This is sometimes referred to as the "Chinchilla recipe." An implication of this finding is that it's not optimal to use the same amount of training data when training a family of model sizes. For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP - it would be better to use a larger model with less data. In contrast, a large model trained on too little data does not reach its potential - it would be better to reduce the model size and feed it more data. In each case, using 20 tokens per parameter is optimal, per the Chinchilla recipe. [Scaling-Laws-blog-fig-2] Figure 2. Cerebras-GPT vs. Pythia. Lower curves show greater compute efficiency for a given loss level. EleutherAI's Pythia open-source model suite is highly valuable for the research community because it provides a wide range of model sizes using the public Pile dataset under a controlled training methodology. However, Pythia was trained with a fixed number of tokens across all model sizes with the objective of providing an apples-to-apples baseline across all models. Designed to be complimentary to Pythia, Cerebras-GPT was designed to cover a wide range of model sizes using the same public Pile dataset and to establish a training-efficient scaling law and family of models. Cerebras-GPT consists of seven models with 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B parameters, all of which are trained using 20 tokens per parameter. By using the optimal training tokens for each model size, Cerebras-GPT achieves the lowest loss per unit of compute across all model sizes (Figure 2). New Scaling Law Training a large language model can be an expensive and time-consuming process. It requires a significant amount of computational resources and expertise to optimize the model's performance. One way to address this challenge is to train a family of models with varying sizes, which can help establish a scaling law that describes the relationship between training compute and model performance. [Scaling-law-chart-no-logo] Figure 3. Cerebras-GPT scaling law Scaling laws are vital to LLM development since they allow researchers to predict the expected loss of a model before training, thus avoiding costly hyperparameter search. OpenAI was the first to establish a scaling law showing a power law relationship between compute and model loss. DeepMind followed with the Chinchilla study, demonstrating an optimal ratio between compute and data. However, these studies were performed using closed datasets, making them difficult to apply the results to other datasets. Cerebras-GPT continues this line of research by establishing a scaling law based on the open Pile dataset. The resulting scaling law provides a compute-efficient recipe for training LLMs of any size using Pile. By publishing our findings, we hope to provide a valuable resource for the community and further advance the development of large language models. Model Performance on Downstream Tasks We evaluated the performance of Cerebras-GPT on several task specific language tasks such as sentence completion and question-and-answer. These are important because even though the models may have good natural language understanding, that may not translate to specialized downstream tasks. We show that Cerebras-GPT preserves state-of-the-art training efficiency for most common downstream tasks, as shown in the examples in Figure 4. Notably while previous scaling laws have shown scaling for pre-training loss, this is the first time results have been published showing scaling for downstream natural language tasks. [Downstream-tasks-figure] Figure 4 Example downstream task performance comparison of Cerebras-GPT and other open-source models. Cerebras-GPT preserves the training efficiency advantage across downstream tasks. Cerebras CS-2: Simple, Data-Parallel Training It takes substantial technical expertise to train very large models on GPUs. In the recently released GPT-4 Technical Report, OpenAI credits over thirty contributors just for compute infrastructure and scaling. To understand why, we will look at existing LLM scaling techniques on the GPU shown in Figure 5. The simplest way to scale is data parallel. Data parallel scaling replicates the model in each device and uses different training batches on those devices, averaging their gradients. Clearly, this does not address the issue of model size - it fails if the entire model does not fit on a single GPU. A common alternative approach is pipelined model parallel, which runs different layers on different GPUs as a pipeline. However, as the pipeline grows, the activation memory increases quadratically with the pipeline depth, and this can be prohibitive for large models. To avoid that, another common approach is to split layers across GPUs, called tensor model parallel, but this imposes significant communication between the GPUs, which complicates the implementation and can be slow. Because of these complexities, there is no single way to scale on GPU clusters today. Training large models on GPUs requires a hybrid approach with all forms of parallelism; the implementations are complicated and hard to bring up, and there are significant performance issues [Scaling-laws-blog-parallel-techniques] Figure 5 Existing scaling techniques on distributed GPU clusters and their challenges. Scaling on GPU clusters requires a complex combination of all forms of parallelism. [Scaling-laws-blog-training-HW] Figure 6. GPU scaling requires the use of multiple parallelism techniques. Cerebras CS-2 uses data parallel scaling for any model size. Two recent large language models illustrate the complexities involved in splitting large language models across many GPUs (Figure 6). Meta's OPT model, ranging from 125M to 175B parameters was trained on 992 GPUs using a combination of data parallelism and tensor parallelism along with various memory optimization techniques. Eleuther's 20B parameter GPT-NeoX used a combination data, tensor, and pipeline parallelism to train the model across 96 GPUs. Cerebras GPT was trained using standard data parallelism on 16 CS-2 systems. This is possible because the Cerebras CS-2 systems are fitted with enough memory to run even the largest models on a single device without splitting the model. We then designed the purpose-built Cerebras Wafer-Scale Cluster around the CS-2 to enable easy scale-out. It uses a HW/SW co-designed execution called weight streaming that enables independent scaling of model size and cluster size, without model parallelism. With this architecture, scaling to larger clusters is as simple as changing the number of systems in a configuration file, as shown in Figure 7. [Cluster-code-gif-2-1] Figure 7. Push-button scaling to multiple CS-2 systems in the Cerebras Wafer-Scale Cluster using only simple data parallel scaling. We trained all Cerebras-GPT models on a 16x CS-2 Cerebras Wafer-Scale Cluster called Andromeda. The cluster enabled all experiments to be completed quickly, without the traditional distributed systems engineering and model parallel tuning needed on GPU clusters. Most importantly, it enabled our researchers to focus on the design of the ML instead of the distributed system. We believe the capability to easily train large models is a key enabler for the broad community, so we have made the Cerebras Wafer-Scale Cluster available on the cloud through the Cerebras AI Model Studio. Conclusion At Cerebras, we believe democratizing large models requires both solving the training infrastructure challenge and opening more models to the community. To that end, we have designed the Cerebras Wafer-Scale Cluster with push-button scaling, and we are open-sourcing the Cerebras-GPT family of large generative models. We hope that as the first public large GPT model suite with state-of-the-art training efficiency Cerebras-GPT will serve as a recipe for efficient training and as a reference for further community research. Additionally, we are making both the infrastructure and models available on the cloud through the Cerebras AI Model Studio. We believe it's through better training infrastructure and more community sharing that we can, together, further advance the large generative AI industry. Authors Nolan Dey, Research Scientist; Joel Hestness, Principal Research Scientist; Sean Lie, Chief Hardware Architect and Co-founder | Mar 28, 2023 Contributing Authors Nolan Dey, Gurpreet Gosal, Charles Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness. Additional Resources * arXiv Paper (coming soon) * Hugging Face * Cerebras Model Zoo * Cerebras AI Model Studio. featured --------------------------------------------------------------------- Avatar photo Nolan Dey Author posts Related Posts Machine LearningSoftwareBlogComputer Vision [Sparse-IFT-blog-banner-uai-901x600] March 22, 2023 Can Sparsity Make AI Models More Accurate? Cerebras introduces Sparse-IFT, a technique that, through sparsification,... --------------------------------------------------------------------- Avatar photoby Sean Lie Machine LearningSoftwareSystemBlog [Sean-sparsity-blog-FLOPs-chart-60pc-banner-uai-1490x993] March 21, 2023 Accelerating Large GPT Training with Sparse Pre-Training and Dense Fine-Tuning [Updated] We have shown it is possible to reduce the training compute for large GPT... --------------------------------------------------------------------- Avatar photoby Vithursan Thangarasa ChipMachine LearningSoftwareClusterSDKBlogHPCDeveloper Blog [cerebras-fine-tuning-uai-961x641] February 16, 2023 Cerebras Announces Fine-Tuning on the Cerebras AI Model Studio Announcing the addition of fine-tuning capabilities for large language models... --------------------------------------------------------------------- Avatar photoby Udai Mody * Prev * Next Explore more ideas in less time. Reduce the cost of curiosity. Sign up --------------------------------------------------------------------- [cerebras-white-01] info@cerebras.net 1237 E. Arques Ave Sunnyvale, CA 94085 Follow Product Cluster System Chip Software Cloud Applications Natural Language Processing Computer Vision High Performance Computing Industries Health & Pharma Energy Government Scientific Computing Financial Services Web & Social Media Resources Customer Spotlight Blog Publications Event Replays Whitepapers Developers Community Developer Blog Documentation ML Public Repository Request Access to SDK Company About Cerebras In the News Press Releases Privacy Legal Careers Contact (c) 2023 Cerebras. All rights reserved [ ] Privacy Preference Center Privacy Preferences [Save Preferences] x Manage Cookie Consent To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions. Functional [ ] Functional Always active The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network. Preferences [ ] Preferences The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user. Statistics [ ] Statistics The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you. Marketing [ ] Marketing The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes. Manage options Manage services Manage vendors Read more about these purposes Accept Deny View preferences Save preferences View preferences {title} {title} {title} Manage consent