[HN Gopher] Large language models generate functional protein se...
       ___________________________________________________________________
        
       Large language models generate functional protein sequences across
       families
        
       Author : samwillis
       Score  : 84 points
       Date   : 2023-05-13 15:36 UTC (7 hours ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | sroussey wrote:
       | 2022, btw
        
         | tomohelix wrote:
         | >Published: 26 January 2023
         | 
         | Well, technically it is out in 2023 but sure, you can argue it
         | was probably in the prepub state since 2022...
        
       | hecanjog wrote:
       | I'm surprised that Salesforce has a research division, and
       | they're working on something like this.
        
         | ipnon wrote:
         | The game theory of useless corporate research departments is
         | that by spinning plates for you they're not building moats for
         | your competitors. There is quite a lot of money to be saved by
         | essentially nerd sniping with large wads of cash.
        
         | sb8244 wrote:
         | I know that Salesforce is committed to the 1% initiative, so
         | maybe this falls into that. 1% can do a lot at their revenue.
        
       | samwillis wrote:
       | PDF: http://cdn.fraserlab.com/publications/2023_madani.pdf
       | 
       | Code: https://github.com/salesforce/progen
        
       | tomohelix wrote:
       | Some notable technical information: It is a 1.2bln param model,
       | trained on raw sequences only, can generate full length
       | functional proteins of about 150-200 residues (approx lysozyme
       | size). The generated proteins are very different to native ones
       | (30-40% similarity).
       | 
       | The interesting thing about this model is that it also exhibit
       | emergent capabilities. It was trained only on raw sequences but
       | somehow managed to capture information about functionality and
       | solubility of the folded proteins, and then implemented that in
       | the generated sequences.
       | 
       | Amino acid sequences are just a bunch of jumbled words if you
       | compared them to English. It usually has to go through folding to
       | form proper "sentences" with meanings. I guess you can compare
       | this to "grammar". This is probably the model managed to learn
       | protein grammar purely by brute force. Now if only we can get a
       | model in the range of 100bln parameters...
        
       | lysozyme wrote:
       | When evaluating this work, it's important to remember that the
       | functional labels and protein family assignments on each of the
       | 280 million input sequences were originally assigned by an HMM
       | model using human curated sequence groups as part of the pfam
       | project, so the model is predicting a prediction (or perhaps
       | conditioned on a prediction would be more accurate).
       | 
       | Furthermore, the authors must engage a lot of human curation to
       | ensure the sequences they generate are active. First, they pick
       | an easy target. Second, they employ by-hand classical
       | bioinformatics techniques on their predicted sequences after they
       | are generated. For example, they manually align them and select
       | those which contain specific important amino acids at specific
       | positions which are present in 100% of functional proteins of
       | that class, and are required for function. This is all done by a
       | human bioinformatics expert (or automated) before they test the
       | generated sequences. This is the protein equivalent of cherry-
       | picking great examples of, for example, ChatGPT responses and
       | presenting them as if the model only made predictions like that.
       | 
       | One other comment, in protein science, a sequence with 40%
       | identity to another sequence is not "very different" if it is
       | homologous. Since this model is essentially generating homologs
       | from a particular class, it's no surprise at a pairwise amino
       | acid level, the generated sequences have this degree of
       | similarity. Take proteins in any functional family and compare
       | them. They will have the same overall 3-D structure--called their
       | "fold"--yet have pairwise sequence identities much lower than
       | 30-40%. This "degeneracy", the notion that there are many diverse
       | sequences that all fold into the same shape, is both a
       | fundamental empirical observation in protein science as well as a
       | grounded physical theory.
       | 
       | Not to be negative. I really enjoyed reading this paper and I
       | think the work is important. Some related work by Meta AI is the
       | ESM series of models [1] trained on the same data (the UniProt
       | dataset [2]).
       | 
       | One thing I wonder is about the vocabulary size of this model.
       | The number of tokens is 26 for the 20 amino acids and some
       | extras, whereas for a LLM like Meta's LLaMa the vocab size is
       | 32,000. I wonder how that changes training and inference, and how
       | we can adopt the transformer architecture for this scenario.
       | 
       | 1. https://github.com/facebookresearch/esm
       | 
       | 2. https://www.uniprot.org/help/downloads
        
         | tomohelix wrote:
         | I consider all the manual curation effectively a form of RLHF
         | that can be imposed automatically later on. We saw how much
         | this can improve a raw LLM by looking at the output of ChatGPT.
         | Otherwise, the criticism of LLMs being just glorified
         | autocomplete machines isn't that far from reality. In other
         | words, it is just an expected requirement for LLMs to be
         | effective.
         | 
         | You are probably right that lysozyme is an easy target and may
         | have large sequence variety between homologs so saying "very
         | different" for 30-40% is not correct. But that is only in the
         | context of biology and protein structure and function. This is
         | an LLM trained on primary sequences only. It doesn't know
         | anything about the folds or domains or functional sites (unless
         | I am wrong and those are part of the metadata fed to it during
         | training). Yet it did learn enough to generalize to the point
         | that even with only 30-40% identity, it still produces soluble
         | proteins with the same function. I am sure you know that at 40%
         | differences, one protein can be in an entirely different
         | superfamily from another. So it is still an impressively low
         | identity score.
         | 
         | Also, I think it is more appropriate to compare the amino acids
         | to things like the alphabets than vocabs. Domains would
         | probably be an equivalent to LLaMa vocab.
        
       | ajuc wrote:
       | AI generating a virus/prion/whatever that we synthesize without
       | understanding what it does is the easiest way to the bad
       | singularity people were warning us about.
        
       ___________________________________________________________________
       (page generated 2023-05-13 23:00 UTC)