[HN Gopher] Nucleotide Transformer: building robust foundation m...
       ___________________________________________________________________
        
       Nucleotide Transformer: building robust foundation models for human
       genomics
        
       Author : bookofjoe
       Score  : 47 points
       Date   : 2024-12-01 22:47 UTC (6 days ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | bilsbie wrote:
       | Cool! I don't understand what the genomic tasks it solves are.
       | What can it actually do?
       | 
       | Also can we train this same model on regular language data so we
       | can converse about the genomes? I suppose a normal multi modal
       | model can talk about what it sees in images in english. Could we
       | have a similar thing with genomes? Ie DNA is just another
       | modality in a multimodal.
        
         | iwd wrote:
         | I've been trialing a bunch of these models at work. They
         | basically learn where the DNA has important functions, and what
         | those functions are. Its very approximate, but up to now that's
         | been very hard to do from just the sequence and no other data.
        
           | bilsbie wrote:
           | That's really cool. Can you share any insights the models
           | have given you? My biggest point of confusion is what type of
           | practical things these models can do.
           | 
           | (Or Email in profile if you can't share publicly)
        
           | throwawaymaths wrote:
           | > Its very approximate, but up to now that's been very hard
           | to do from just the sequence and no other data.
           | 
           | the synthetic syn 1.0 project used a promoter search
           | algorithm written in cobol by one of the leaders. one of the
           | professors on the project had a wordperfect macro that found
           | protein sequences, point being they weren't the best
           | programmers in the world. i would hardly say its been "very
           | hard"
        
           | mbreese wrote:
           | _> from just the sequence and no other data_
           | 
           | This is my real question with these... we already have a
           | _ton_ of other data for genomics. So, many of the important
           | regions are already known and studied. And really, the
           | functional importance of any given region /sequence is highly
           | context/cell type specific. So, given this, what are the use
           | cases? What kind of hypothesis generation can these models
           | lead to that we aren't currently doing in genomics?
        
         | BioGeek wrote:
         | > Also can we train this same model on regular language data so
         | we can converse about the genomes?
         | 
         | Yes! That is what has been done in ChatNT [1] where you can ask
         | natural language questions like "Determine the degradation rate
         | of the human RNA sequence @myseq.fna on a scale from -5 to 5."
         | and the ChatNT will answer with "The degradation rate for this
         | sequence is 1.83."
         | 
         | > My biggest point of confusion is what type of practical
         | things these models can do.
         | 
         | See for example this notebook [2] where the Nucleotide
         | Transformer is finetuned to classify genomic sequences as two
         | of the most basic genomic motifs: promoters and enhancers
         | types.
         | 
         | Disclaimer: I work at InstaDeep but was not involved in either
         | of the above projects.
         | 
         | [1] https://www.biorxiv.org/content/10.1101/2024.04.30.591835v2
         | [2]
         | https://github.com/huggingface/notebooks/blob/main/examples/...
        
           | hirenj wrote:
           | Possibly a dumb question - but are these models useful for
           | homology finding? If you have two homologous genes, do they
           | have similar embeddings?
           | 
           | The reason I ask is I have a bunch of genes where I can't get
           | much better than a 1:many orthology mapping, and if this
           | method can capture related promoters/intronic regions etc per
           | gene, and tell me if they are related, that would be a huge
           | help (assuming this works on eukaryotic genomes).
        
       ___________________________________________________________________
       (page generated 2024-12-07 23:01 UTC)