[HN Gopher] Nucleotide Transformer: building robust foundation m...
___________________________________________________________________
Nucleotide Transformer: building robust foundation models for human
genomics
Author : bookofjoe
Score : 47 points
Date : 2024-12-01 22:47 UTC (6 days ago)
(HTM) web link (www.nature.com)
(TXT) w3m dump (www.nature.com)
| bilsbie wrote:
| Cool! I don't understand what the genomic tasks it solves are.
| What can it actually do?
|
| Also can we train this same model on regular language data so we
| can converse about the genomes? I suppose a normal multi modal
| model can talk about what it sees in images in english. Could we
| have a similar thing with genomes? Ie DNA is just another
| modality in a multimodal.
| iwd wrote:
| I've been trialing a bunch of these models at work. They
| basically learn where the DNA has important functions, and what
| those functions are. Its very approximate, but up to now that's
| been very hard to do from just the sequence and no other data.
| bilsbie wrote:
| That's really cool. Can you share any insights the models
| have given you? My biggest point of confusion is what type of
| practical things these models can do.
|
| (Or Email in profile if you can't share publicly)
| throwawaymaths wrote:
| > Its very approximate, but up to now that's been very hard
| to do from just the sequence and no other data.
|
| the synthetic syn 1.0 project used a promoter search
| algorithm written in cobol by one of the leaders. one of the
| professors on the project had a wordperfect macro that found
| protein sequences, point being they weren't the best
| programmers in the world. i would hardly say its been "very
| hard"
| mbreese wrote:
| _> from just the sequence and no other data_
|
| This is my real question with these... we already have a
| _ton_ of other data for genomics. So, many of the important
| regions are already known and studied. And really, the
| functional importance of any given region /sequence is highly
| context/cell type specific. So, given this, what are the use
| cases? What kind of hypothesis generation can these models
| lead to that we aren't currently doing in genomics?
| BioGeek wrote:
| > Also can we train this same model on regular language data so
| we can converse about the genomes?
|
| Yes! That is what has been done in ChatNT [1] where you can ask
| natural language questions like "Determine the degradation rate
| of the human RNA sequence @myseq.fna on a scale from -5 to 5."
| and the ChatNT will answer with "The degradation rate for this
| sequence is 1.83."
|
| > My biggest point of confusion is what type of practical
| things these models can do.
|
| See for example this notebook [2] where the Nucleotide
| Transformer is finetuned to classify genomic sequences as two
| of the most basic genomic motifs: promoters and enhancers
| types.
|
| Disclaimer: I work at InstaDeep but was not involved in either
| of the above projects.
|
| [1] https://www.biorxiv.org/content/10.1101/2024.04.30.591835v2
| [2]
| https://github.com/huggingface/notebooks/blob/main/examples/...
| hirenj wrote:
| Possibly a dumb question - but are these models useful for
| homology finding? If you have two homologous genes, do they
| have similar embeddings?
|
| The reason I ask is I have a bunch of genes where I can't get
| much better than a 1:many orthology mapping, and if this
| method can capture related promoters/intronic regions etc per
| gene, and tell me if they are related, that would be a huge
| help (assuming this works on eukaryotic genomes).
___________________________________________________________________
(page generated 2024-12-07 23:01 UTC)