[HN Gopher] IBM and NASA build language models to make scientifi...
___________________________________________________________________
IBM and NASA build language models to make scientific knowledge
more accessible
Author : rbanffy
Score : 104 points
Date : 2024-03-13 19:54 UTC (3 hours ago)
(HTM) web link (research.ibm.com)
(TXT) w3m dump (research.ibm.com)
| jwuphysics wrote:
| UniverseTBD (https://universetbd.org) is also making great
| strides in the space of large language models and astronomy.
| mncharity wrote:
| Briefly hopeful, I ask astrollama-7b-chat-alpha[1] "What color
| is the Sun?". It replies "The Sun has no color as it emits
| radiation across all wavelengths from ultraviolet to infrared.
| [...] there isn't an answer for what color the Sun truly is
| since it doesn't have one but rather produces every visible
| spectrum imaginable!". Sigh. Hmm, I wonder if LLM replies might
| be usefully mined to generate misconception lists?
|
| [1]
| https://huggingface.co/spaces/universeTBD/astrollama-7b-chat...
| wolverine876 wrote:
| Who wants to learn from anything less than the best sources?
|
| I've often thought that a search engine that indexes only the
| highest quality, probably hand-curated, sources would be highly
| desireable. I'm not really interested in learning from everyone
| about, for example, physics or history or climate change or the
| invasion of Ukraine; I only want the best. I'm not missing out,
| practically: there is far more than enough of the 'best' to
| consume all my time; there's a large opportunity cost to reading
| other things. Choosing the 'best' is somewhat subjective, but it
| is far better than arbitrarily or randomly choosing sources.
|
| LLMs, used for knowledge discovery and retrieval, would seem to
| benefit from the same sources.
| javiramos wrote:
| With Perplexity [0] you can narrow you LLM interactions to only
| reference Academic articles or other high-quality sources.
|
| [0] https://www.perplexity.ai/
| Tomte wrote:
| Ignoring the question whether LLMs can produce the best, the
| idea wouldn't be to cite the best sources, but to train only
| on the best sources.
|
| A garbage hallucination with a link to the Stanford
| Encyclopedia of Philosophy won't help anyone.
| throwanem wrote:
| On the other hand, given the query "facial recognition in
| paper wasps", Perplexity just gave not only an answer in
| accord with my prior understanding derived from reading in
| the field, but also surfaced a paper [1] published two
| months ago that I hadn't yet seen.
|
| I expected less, and I suspect a researcher could easily
| find gaps. But from the perspective an amateur autodidact,
| that's still a fairly impressive result.
|
| [1] https://www.pnas.org/doi/10.1073/pnas.1918592117
| porphyra wrote:
| That makes sense but the principle of "more data = more better"
| suggests that maybe training an LLM on all the possible data
| and then fine tuning it to only spit out the best answers might
| be better than only training it on only the best data to begin
| with.
| wolverine876 wrote:
| How will training it on false data, for example, result in
| better output?
| vecinu wrote:
| This might be a naive question but how does one determine what
| "best" is for multiple subjects?
|
| Even in your example, physics and mathematics could be curated
| for "best" when dealing with equations and foundational
| knowledge that has been hardened over decades. For history,
| climate change of invasion of Ukraine, isn't that sensitive to
| bias, manipulation and interpretation? These are not exact
| sciences.
| wolverine876 wrote:
| What do you think of how that was addressed in the GP?
| colechristensen wrote:
| You have to spend quite a lot of time thinking about quality
| and values. It becomes impossible as the size of the "best
| slice" you're seeking gets smaller (top half is much easier
| than top ten percent, etc)
|
| If your values are "everyone should agree with my opinions"
| you'll have a garbage biased data set. There are other values
| though. Bias free is also impossible because having a
| definition of a perfectly neutral bias is itself a very
| strong bias.
| animal_spirits wrote:
| "Best" will be chosen by the creators of software for
| specific application uses. Medical software will use the
| "best" medical LLM under the hood. Programming software
| (Copilot et. all) will use the "best" programming LLM.
| General purpose language models will probably still be used
| by the public when doing internet searches. Or, an idea that
| just popped into my head, use a classifier to determine which
| model can most accurately answer the user's query, and send
| the query off to that model for a response.
| robrenaud wrote:
| Diversity and quantity are important for LLM training.
|
| A search engine can index more than just "the best sources",
| and show results from the tail when no relevant matches are in
| the best sources.
|
| I would agree that with a softer restatement of your thesis
| though, I am sure there is a lot of diminishing marginal
| utility in search indexing broadly, especially as the web keeps
| getting more and more full of spam and nonsense.
|
| For pre-training LLMs, the quality/quantity/diversity story is
| more nuanced. They do seem to benefit a lot from quantity. For
| a fixed LLM training budget, the choice to train on the same
| high quality documents for more epochs, or to train on lower
| quality but unseen data is an interesting area of research.
| Empirically, the research finds that additional epochs on the
| same data starts to diminish after the 4th iteration. All the
| research I've read tends to have an all or nothing flavor to
| data selection. Either it makes it in, and gets processed the
| same number of times, or it doesn't get in at all. There is
| probably some juice in the middle ground, where high quality
| data gets 4x'ed, bad data is still eliminated, but the lesser
| but not terrible data gets in once.
| wolverine876 wrote:
| Thanks for an informed response!
| staplers wrote:
| a search engine that indexes only the highest quality
|
| Any for-profit search engine eventually loses quality as it
| succumbs to ad spend.
|
| It'd require subsidies to remain profit-neutral (skew towards
| quality). Think ycombinator and HN.
|
| Even a subscription model will eventually skew towards
| placating the masses with "dumbed down" content.
| wolverine876 wrote:
| > Even a subscription model will eventually skew towards
| placating the masses with "dumbed down" content.
|
| Accuracy and simplicity are not the same. I can see that most
| people won't want to read the Stanford Encyclopedia of
| Philosophy's take on Plato. But anyone can read the
| Associated Press rather than someone's misinfo on the topic.
| Cut out the latter.
| StableAlkyne wrote:
| > I've often thought that a search engine that indexes only the
| highest quality, probably hand-curated, sources would be highly
| desireable
|
| That's what I miss about the old internet, where folks would
| have link pages that were just other cool sites
|
| Sure, discovery was harder, but it was harder to AstroTurf
| worth SEO too
| kingkongjaffa wrote:
| Is it weird they mentioned these examples and not, OpenAI,
| Anthropic, Gemini etc.?
|
| > Transformer-based language models -- which include BERT,
| RoBERTa, and IBM's Slate and Granite family of models
|
| Why would they not mention the most popular transformer based
| language models?
| hackinthebochs wrote:
| BERT and RoBERTa aren't competing against IBMs products. You
| don't advertise your competitor in your own ad.
| hiddencost wrote:
| IBM isn't competing against anthropic / openAI/ Google.
|
| IBM's business model is to be worse but sell to lots of
| clients because the clients don't know any better.
| fghorow wrote:
| ELI5: Does one need to write code to use these, or is there a
| front-end somewhere?
| bottom999mottob wrote:
| Using pre-trained language models like the encoder and
| retrieval models mentioned [1] typically doesn't require
| writing a lot of code, but there are still a few steps
| involved.
|
| The retrieval model, [2], is hosted on the Hugging Face
| platform. To use it, you can use Hugging Face's Inference API
| to send HTTP requests to their servers and receive responses
| from the model.
|
| HuggingFace's docs [3] provides instructions on how to use the
| Inference API, including code examples in Python and other
| languages. Essentially, you'll need to format your input text
| according to the model's requirements, send an HTTP request to
| the API endpoint, and then process the response.
|
| This does require some basic programming knowledge to interact
| with APIs and handle the requests/responses.
|
| There are some third-party applications and services that
| provide a front-end for accessing pre-trained language models
| like this one, like Hugging Face Spaces, Replicate.ai, and
| Google Colab. However, these often come with additional costs
| or limitations...
|
| Here's a related model by IBM and NASA for geospatial stuff
| [4].
|
| [1] https://research.ibm.com/blog/science-expert-LLM
|
| [2] https://huggingface.co/nasa-impact/nasa-smd-ibm-st
|
| [3]
| https://huggingface.co/docs/huggingface_hub/v0.14.1/en/guide...
|
| [4] https://huggingface.co/ibm-nasa-geospatial
| givinguflac wrote:
| This looks great! I'm excited to play with it.
|
| Can anyone point me to a resource on how to load it?
|
| I tried downloading the model into LM Studio on my Mac but it
| seems there is more to be done than just loading it.
|
| Any pointers much appreciated!
| Alifatisk wrote:
| What does IBM contribute with in this collaboration? The
| development?
| occamrazor wrote:
| Note that the model is based on RoBERTa and has only 125m
| parameter. It is not competing against any of the new popular
| models, not even small ones like Phi or GeMMa.
___________________________________________________________________
(page generated 2024-03-13 23:00 UTC)