https://research.ibm.com/blog/science-expert-LLM Skip to main content Research * Focus areas * Blog * Publications * Careers * About * Back * Focus areas * Semiconductors * Artificial Intelligence * Quantum Computing * Hybrid Cloud * Back * About * Overview * Labs * People * Collaborate * Back * Semiconductors * Back * Artificial Intelligence * Back * Quantum Computing * Back * Hybrid Cloud * Back * Overview * Back * Labs * Back * People * Back * Collaborate IBM logo Research * Focus areas + Semiconductors + Artificial Intelligence + Quantum Computing + Hybrid Cloud * Blog * Publications * Careers * About + Overview + Labs + People + Collaborate Open IBM search field Close 12 Mar 2024 Technical note 2 minute read IBM and NASA build language models to make scientific knowledge more accessible In a new collaboration, IBM and NASA created a suite of efficient language models by training on scientific literature. Based on the transformer architecture, these models can be used in a variety of applications, from classification and entity extraction to question-answering and information retrieval. These models achieve high performance across a variety of domains and can respond promptly. We have open-sourced the models on Hugging Face for the benefit of the scientific and academic community. Transformer-based language models -- which include BERT, RoBERTa, and IBM's Slate and Granite family of models, are invaluable for a range of natural language understanding tasks. What powers these models is a statistical understanding of how language works. They are trained on masked language modeling tasks, which learns by reconstructing sentences with words that have been obscured. Tokenizers, which break down words into units for the model, play a critical role in learning a vast vocabulary. While general-purpose text training is effective with popular tokenizers trained on datasets like Wikipedia or BooksCorpus, scientific domains require specialized tokenizers for terms like "phosphatidylcholine." We trained our models on 60 billion tokens on a corpus of astrophysics, planetary science, earth science, heliophysics, and biological and physical sciences data. Unlike a generic tokenizer, the one we developed is capable of recognizing scientific terms such as "axes" and "polycrystalline." More than half of the 50,000 tokens our models processed were unique compared to the open-source RoBERTa model on Hugging Face. The IBM-NASA models, trained on domain-specific vocabulary, outperformed the open RoBERTa model by 5% on the popular BLURB benchmark, which evaluates performance on biomedical tasks. It also showed a 2.4% F1 score improvement on an internal scientific question-answering benchmark and a 5.5% improvement on internal Earth science entity recognition tests. Our trained encoder model can be fine-tuned for many non-generative linguistic tasks and can generate information-rich embeddings for document retrieval through retrieval augmented generation (RAG). RAG commonly follows a two-step framework: a retriever model first encodes the question and retrieves relevant documents from a vector database. These documents are then passed to a generative model to answer the question while ensuring fidelity to the retrieved document. We built a retriever model on top of our encoder model to produce information-rich embeddings that map the similarity between pairs of text. Specifically, we optimize on a contrastive loss function, pushing the embeddings of an anchor text closer to those of a relevant ("positive") document, and farther away from a random ("negative") document. These models used about 268 million text pairs, including titles and abstracts, and questions and answers. As a result, they excel at retrieving relevant passages in a test set of about 400 questions that NASA curated. This is evidenced by a 6.5% improvement over a similarly fine-tuned RoBERTa model, and a 5% improvement over BGE-base, another popular open-source model for embeddings. The significant enhancements achieved by our models can be attributed to the specialized training data, custom tokenizer, and training methodology. Consistent with IBM and NASA's commitment to open and transparent AI, both models are available on Hugging Face: the encoder model can be further finetuned for applications in the space domain, while the retriever model can be used for information retrieval applications for RAG. We are also collaborating with NASA to enhance science search engine using these models. Subscribe to our Future Forward newsletter and stay up to date on the latest research news Subscribe to our newsletter 1. Home 2. | Blog Date 12 Mar 2024 Authors * Bishwaranjan Bhattacharjee * Aashka Trivedi * Masayasu Muraoka * Bharath Dandala * Rong Zhang * Yousef El-Kurdi Topics * Accelerated Discovery * AI * Generative AI * Science Share AI is making extracting key information from reports easier than ever [yH5BAEAAAA][image] Technical note Michele Dolfi, Peter Staar, Cesar Berrospi Ramis, and Lokesh Mishra 22 Feb 2024 * AI In search of AI algorithms that mimic the brain [yH5BAEAAAA][image] Q & A Kim Martineau 20 Feb 2024 * AI * Foundation Models * Generative AI * Machine Learning * Science DARPA and IBM are ensuring that anyone can protect their AI systems from hackers [yH5BAEAAAA][image] News Mike Murphy 07 Feb 2024 * AI * Data and AI Security The future of AI research comes to Albany [yH5BAEAAAA][image] News Mike Murphy 06 Feb 2024 * AI * AI Hardware * Semiconductors PreviousA faster, systematic way to train large language models for enterprise NextMaximizing training throughput using PyTorch FSDP * Focus areas Focus areas + Semiconductors + Artificial Intelligence + Quantum Computing + Hybrid Cloud * Quick links Quick links + About + Publications + Blog + Events * Work with us Work with us + Careers + Collaborate + Contact Research * Directories Directories + Topics + People + Projects * Follow us Follow us + Newsletter + X + LinkedIn + YouTube * Contact IBM * Privacy * Terms of use * Accessibility *