https://research.ibm.com/blog/science-expert-LLM

Skip to main content
Research

  * Focus areas
  * Blog
  * Publications
  * Careers
  * About
  * Back
  * Focus areas
  * Semiconductors
  * Artificial Intelligence
  * Quantum Computing
  * Hybrid Cloud
  * Back
  * About
  * Overview
  * Labs
  * People
  * Collaborate
  * Back
  * Semiconductors
  * Back
  * Artificial Intelligence
  * Back
  * Quantum Computing
  * Back
  * Hybrid Cloud
  * Back
  * Overview
  * Back
  * Labs
  * Back
  * People
  * Back
  * Collaborate

IBM logo
Research

  * Focus areas
      + Semiconductors
      + Artificial Intelligence
      + Quantum Computing
      + Hybrid Cloud
  * Blog
  * Publications
  * Careers
  * About
      + Overview
      + Labs
      + People
      + Collaborate

Open IBM search field
Close
12 Mar 2024
Technical note
2 minute read

IBM and NASA build language models to make scientific knowledge more
accessible

In a new collaboration, IBM and NASA created a suite of efficient
language models by training on scientific literature. Based on the
transformer architecture, these models can be used in a variety of
applications, from classification and entity extraction to
question-answering and information retrieval. These models achieve
high performance across a variety of domains and can respond
promptly. We have open-sourced the models on Hugging Face for the
benefit of the scientific and academic community.

Transformer-based language models -- which include BERT, RoBERTa, and
IBM's Slate and Granite family of models, are invaluable for a range
of natural language understanding tasks. What powers these models is
a statistical understanding of how language works. They are trained
on masked language modeling tasks, which learns by reconstructing
sentences with words that have been obscured. Tokenizers, which break
down words into units for the model, play a critical role in learning
a vast vocabulary. While general-purpose text training is effective
with popular tokenizers trained on datasets like Wikipedia or
BooksCorpus, scientific domains require specialized tokenizers for
terms like "phosphatidylcholine."

We trained our models on 60 billion tokens on a corpus of
astrophysics, planetary science, earth science, heliophysics, and
biological and physical sciences data. Unlike a generic tokenizer,
the one we developed is capable of recognizing scientific terms such
as "axes" and "polycrystalline." More than half of the 50,000 tokens
our models processed were unique compared to the open-source RoBERTa
model on Hugging Face.

The IBM-NASA models, trained on domain-specific vocabulary,
outperformed the open RoBERTa model by 5% on the popular BLURB
benchmark, which evaluates performance on biomedical tasks. It also
showed a 2.4% F1 score improvement on an internal scientific
question-answering benchmark and a 5.5% improvement on internal Earth
science entity recognition tests.

Our trained encoder model can be fine-tuned for many non-generative
linguistic tasks and can generate information-rich embeddings for
document retrieval through retrieval augmented generation (RAG). RAG
commonly follows a two-step framework: a retriever model first
encodes the question and retrieves relevant documents from a vector
database. These documents are then passed to a generative model to
answer the question while ensuring fidelity to the retrieved
document.

We built a retriever model on top of our encoder model to produce
information-rich embeddings that map the similarity between pairs of
text. Specifically, we optimize on a contrastive loss function,
pushing the embeddings of an anchor text closer to those of a
relevant ("positive") document, and farther away from a random
("negative") document.

These models used about 268 million text pairs, including titles and
abstracts, and questions and answers. As a result, they excel at
retrieving relevant passages in a test set of about 400 questions
that NASA curated. This is evidenced by a 6.5% improvement over a
similarly fine-tuned RoBERTa model, and a 5% improvement over
BGE-base, another popular open-source model for embeddings.

The significant enhancements achieved by our models can be attributed
to the specialized training data, custom tokenizer, and training
methodology. Consistent with IBM and NASA's commitment to open and
transparent AI, both models are available on Hugging Face: the
encoder model can be further finetuned for applications in the space
domain, while the retriever model can be used for information
retrieval applications for RAG. We are also collaborating with NASA
to enhance science search engine using these models.

 
Subscribe to our Future Forward newsletter and stay up to date on the
latest research news
Subscribe to our newsletter

 1. Home
 2. | Blog

Date

12 Mar 2024

Authors

  * Bishwaranjan Bhattacharjee
  * Aashka Trivedi
  * Masayasu Muraoka
  * Bharath Dandala
  * Rong Zhang
  * Yousef El-Kurdi

Topics

  * Accelerated Discovery
  * AI
  * Generative AI
  * Science

Share

AI is making extracting key information from reports easier than ever

[yH5BAEAAAA][image]
Technical note
Michele Dolfi, Peter Staar, Cesar Berrospi Ramis, and Lokesh Mishra
22 Feb 2024

  * AI

In search of AI algorithms that mimic the brain

[yH5BAEAAAA][image]
Q & A
Kim Martineau
20 Feb 2024

  * AI
  * Foundation Models
  * Generative AI
  * Machine Learning
  * Science

DARPA and IBM are ensuring that anyone can protect their AI systems
from hackers

[yH5BAEAAAA][image]
News
Mike Murphy
07 Feb 2024

  * AI
  * Data and AI Security

The future of AI research comes to Albany

[yH5BAEAAAA][image]
News
Mike Murphy
06 Feb 2024

  * AI
  * AI Hardware
  * Semiconductors

 
PreviousA faster, systematic way to train large language models for
enterprise
 
NextMaximizing training throughput using PyTorch FSDP
 

  * Focus areas

    Focus areas

      + Semiconductors
      + Artificial Intelligence
      + Quantum Computing
      + Hybrid Cloud
  * Quick links

    Quick links

      + About
      + Publications
      + Blog
      + Events
  * Work with us

    Work with us

      + Careers
      + Collaborate
      + Contact Research
  * Directories

    Directories

      + Topics
      + People
      + Projects
  * Follow us

    Follow us

      + Newsletter
      + X
      + LinkedIn
      + YouTube

  * Contact IBM
  * Privacy
  * Terms of use
  * Accessibility
  *