[HN Gopher] IBM and NASA build language models to make scientifi...
       ___________________________________________________________________
        
       IBM and NASA build language models to make scientific knowledge
       more accessible
        
       Author : rbanffy
       Score  : 104 points
       Date   : 2024-03-13 19:54 UTC (3 hours ago)
        
 (HTM) web link (research.ibm.com)
 (TXT) w3m dump (research.ibm.com)
        
       | jwuphysics wrote:
       | UniverseTBD (https://universetbd.org) is also making great
       | strides in the space of large language models and astronomy.
        
         | mncharity wrote:
         | Briefly hopeful, I ask astrollama-7b-chat-alpha[1] "What color
         | is the Sun?". It replies "The Sun has no color as it emits
         | radiation across all wavelengths from ultraviolet to infrared.
         | [...] there isn't an answer for what color the Sun truly is
         | since it doesn't have one but rather produces every visible
         | spectrum imaginable!". Sigh. Hmm, I wonder if LLM replies might
         | be usefully mined to generate misconception lists?
         | 
         | [1]
         | https://huggingface.co/spaces/universeTBD/astrollama-7b-chat...
        
       | wolverine876 wrote:
       | Who wants to learn from anything less than the best sources?
       | 
       | I've often thought that a search engine that indexes only the
       | highest quality, probably hand-curated, sources would be highly
       | desireable. I'm not really interested in learning from everyone
       | about, for example, physics or history or climate change or the
       | invasion of Ukraine; I only want the best. I'm not missing out,
       | practically: there is far more than enough of the 'best' to
       | consume all my time; there's a large opportunity cost to reading
       | other things. Choosing the 'best' is somewhat subjective, but it
       | is far better than arbitrarily or randomly choosing sources.
       | 
       | LLMs, used for knowledge discovery and retrieval, would seem to
       | benefit from the same sources.
        
         | javiramos wrote:
         | With Perplexity [0] you can narrow you LLM interactions to only
         | reference Academic articles or other high-quality sources.
         | 
         | [0] https://www.perplexity.ai/
        
           | Tomte wrote:
           | Ignoring the question whether LLMs can produce the best, the
           | idea wouldn't be to cite the best sources, but to train only
           | on the best sources.
           | 
           | A garbage hallucination with a link to the Stanford
           | Encyclopedia of Philosophy won't help anyone.
        
             | throwanem wrote:
             | On the other hand, given the query "facial recognition in
             | paper wasps", Perplexity just gave not only an answer in
             | accord with my prior understanding derived from reading in
             | the field, but also surfaced a paper [1] published two
             | months ago that I hadn't yet seen.
             | 
             | I expected less, and I suspect a researcher could easily
             | find gaps. But from the perspective an amateur autodidact,
             | that's still a fairly impressive result.
             | 
             | [1] https://www.pnas.org/doi/10.1073/pnas.1918592117
        
         | porphyra wrote:
         | That makes sense but the principle of "more data = more better"
         | suggests that maybe training an LLM on all the possible data
         | and then fine tuning it to only spit out the best answers might
         | be better than only training it on only the best data to begin
         | with.
        
           | wolverine876 wrote:
           | How will training it on false data, for example, result in
           | better output?
        
         | vecinu wrote:
         | This might be a naive question but how does one determine what
         | "best" is for multiple subjects?
         | 
         | Even in your example, physics and mathematics could be curated
         | for "best" when dealing with equations and foundational
         | knowledge that has been hardened over decades. For history,
         | climate change of invasion of Ukraine, isn't that sensitive to
         | bias, manipulation and interpretation? These are not exact
         | sciences.
        
           | wolverine876 wrote:
           | What do you think of how that was addressed in the GP?
        
           | colechristensen wrote:
           | You have to spend quite a lot of time thinking about quality
           | and values. It becomes impossible as the size of the "best
           | slice" you're seeking gets smaller (top half is much easier
           | than top ten percent, etc)
           | 
           | If your values are "everyone should agree with my opinions"
           | you'll have a garbage biased data set. There are other values
           | though. Bias free is also impossible because having a
           | definition of a perfectly neutral bias is itself a very
           | strong bias.
        
           | animal_spirits wrote:
           | "Best" will be chosen by the creators of software for
           | specific application uses. Medical software will use the
           | "best" medical LLM under the hood. Programming software
           | (Copilot et. all) will use the "best" programming LLM.
           | General purpose language models will probably still be used
           | by the public when doing internet searches. Or, an idea that
           | just popped into my head, use a classifier to determine which
           | model can most accurately answer the user's query, and send
           | the query off to that model for a response.
        
         | robrenaud wrote:
         | Diversity and quantity are important for LLM training.
         | 
         | A search engine can index more than just "the best sources",
         | and show results from the tail when no relevant matches are in
         | the best sources.
         | 
         | I would agree that with a softer restatement of your thesis
         | though, I am sure there is a lot of diminishing marginal
         | utility in search indexing broadly, especially as the web keeps
         | getting more and more full of spam and nonsense.
         | 
         | For pre-training LLMs, the quality/quantity/diversity story is
         | more nuanced. They do seem to benefit a lot from quantity. For
         | a fixed LLM training budget, the choice to train on the same
         | high quality documents for more epochs, or to train on lower
         | quality but unseen data is an interesting area of research.
         | Empirically, the research finds that additional epochs on the
         | same data starts to diminish after the 4th iteration. All the
         | research I've read tends to have an all or nothing flavor to
         | data selection. Either it makes it in, and gets processed the
         | same number of times, or it doesn't get in at all. There is
         | probably some juice in the middle ground, where high quality
         | data gets 4x'ed, bad data is still eliminated, but the lesser
         | but not terrible data gets in once.
        
           | wolverine876 wrote:
           | Thanks for an informed response!
        
         | staplers wrote:
         | a search engine that indexes only the highest quality
         | 
         | Any for-profit search engine eventually loses quality as it
         | succumbs to ad spend.
         | 
         | It'd require subsidies to remain profit-neutral (skew towards
         | quality). Think ycombinator and HN.
         | 
         | Even a subscription model will eventually skew towards
         | placating the masses with "dumbed down" content.
        
           | wolverine876 wrote:
           | > Even a subscription model will eventually skew towards
           | placating the masses with "dumbed down" content.
           | 
           | Accuracy and simplicity are not the same. I can see that most
           | people won't want to read the Stanford Encyclopedia of
           | Philosophy's take on Plato. But anyone can read the
           | Associated Press rather than someone's misinfo on the topic.
           | Cut out the latter.
        
         | StableAlkyne wrote:
         | > I've often thought that a search engine that indexes only the
         | highest quality, probably hand-curated, sources would be highly
         | desireable
         | 
         | That's what I miss about the old internet, where folks would
         | have link pages that were just other cool sites
         | 
         | Sure, discovery was harder, but it was harder to AstroTurf
         | worth SEO too
        
       | kingkongjaffa wrote:
       | Is it weird they mentioned these examples and not, OpenAI,
       | Anthropic, Gemini etc.?
       | 
       | > Transformer-based language models -- which include BERT,
       | RoBERTa, and IBM's Slate and Granite family of models
       | 
       | Why would they not mention the most popular transformer based
       | language models?
        
         | hackinthebochs wrote:
         | BERT and RoBERTa aren't competing against IBMs products. You
         | don't advertise your competitor in your own ad.
        
           | hiddencost wrote:
           | IBM isn't competing against anthropic / openAI/ Google.
           | 
           | IBM's business model is to be worse but sell to lots of
           | clients because the clients don't know any better.
        
       | fghorow wrote:
       | ELI5: Does one need to write code to use these, or is there a
       | front-end somewhere?
        
         | bottom999mottob wrote:
         | Using pre-trained language models like the encoder and
         | retrieval models mentioned [1] typically doesn't require
         | writing a lot of code, but there are still a few steps
         | involved.
         | 
         | The retrieval model, [2], is hosted on the Hugging Face
         | platform. To use it, you can use Hugging Face's Inference API
         | to send HTTP requests to their servers and receive responses
         | from the model.
         | 
         | HuggingFace's docs [3] provides instructions on how to use the
         | Inference API, including code examples in Python and other
         | languages. Essentially, you'll need to format your input text
         | according to the model's requirements, send an HTTP request to
         | the API endpoint, and then process the response.
         | 
         | This does require some basic programming knowledge to interact
         | with APIs and handle the requests/responses.
         | 
         | There are some third-party applications and services that
         | provide a front-end for accessing pre-trained language models
         | like this one, like Hugging Face Spaces, Replicate.ai, and
         | Google Colab. However, these often come with additional costs
         | or limitations...
         | 
         | Here's a related model by IBM and NASA for geospatial stuff
         | [4].
         | 
         | [1] https://research.ibm.com/blog/science-expert-LLM
         | 
         | [2] https://huggingface.co/nasa-impact/nasa-smd-ibm-st
         | 
         | [3]
         | https://huggingface.co/docs/huggingface_hub/v0.14.1/en/guide...
         | 
         | [4] https://huggingface.co/ibm-nasa-geospatial
        
       | givinguflac wrote:
       | This looks great! I'm excited to play with it.
       | 
       | Can anyone point me to a resource on how to load it?
       | 
       | I tried downloading the model into LM Studio on my Mac but it
       | seems there is more to be done than just loading it.
       | 
       | Any pointers much appreciated!
        
       | Alifatisk wrote:
       | What does IBM contribute with in this collaboration? The
       | development?
        
       | occamrazor wrote:
       | Note that the model is based on RoBERTa and has only 125m
       | parameter. It is not competing against any of the new popular
       | models, not even small ones like Phi or GeMMa.
        
       ___________________________________________________________________
       (page generated 2024-03-13 23:00 UTC)