[HN Gopher] Large language models generate functional protein se...
___________________________________________________________________
Large language models generate functional protein sequences across
families
Author : samwillis
Score : 84 points
Date : 2023-05-13 15:36 UTC (7 hours ago)
(HTM) web link (www.nature.com)
(TXT) w3m dump (www.nature.com)
| sroussey wrote:
| 2022, btw
| tomohelix wrote:
| >Published: 26 January 2023
|
| Well, technically it is out in 2023 but sure, you can argue it
| was probably in the prepub state since 2022...
| hecanjog wrote:
| I'm surprised that Salesforce has a research division, and
| they're working on something like this.
| ipnon wrote:
| The game theory of useless corporate research departments is
| that by spinning plates for you they're not building moats for
| your competitors. There is quite a lot of money to be saved by
| essentially nerd sniping with large wads of cash.
| sb8244 wrote:
| I know that Salesforce is committed to the 1% initiative, so
| maybe this falls into that. 1% can do a lot at their revenue.
| samwillis wrote:
| PDF: http://cdn.fraserlab.com/publications/2023_madani.pdf
|
| Code: https://github.com/salesforce/progen
| tomohelix wrote:
| Some notable technical information: It is a 1.2bln param model,
| trained on raw sequences only, can generate full length
| functional proteins of about 150-200 residues (approx lysozyme
| size). The generated proteins are very different to native ones
| (30-40% similarity).
|
| The interesting thing about this model is that it also exhibit
| emergent capabilities. It was trained only on raw sequences but
| somehow managed to capture information about functionality and
| solubility of the folded proteins, and then implemented that in
| the generated sequences.
|
| Amino acid sequences are just a bunch of jumbled words if you
| compared them to English. It usually has to go through folding to
| form proper "sentences" with meanings. I guess you can compare
| this to "grammar". This is probably the model managed to learn
| protein grammar purely by brute force. Now if only we can get a
| model in the range of 100bln parameters...
| lysozyme wrote:
| When evaluating this work, it's important to remember that the
| functional labels and protein family assignments on each of the
| 280 million input sequences were originally assigned by an HMM
| model using human curated sequence groups as part of the pfam
| project, so the model is predicting a prediction (or perhaps
| conditioned on a prediction would be more accurate).
|
| Furthermore, the authors must engage a lot of human curation to
| ensure the sequences they generate are active. First, they pick
| an easy target. Second, they employ by-hand classical
| bioinformatics techniques on their predicted sequences after they
| are generated. For example, they manually align them and select
| those which contain specific important amino acids at specific
| positions which are present in 100% of functional proteins of
| that class, and are required for function. This is all done by a
| human bioinformatics expert (or automated) before they test the
| generated sequences. This is the protein equivalent of cherry-
| picking great examples of, for example, ChatGPT responses and
| presenting them as if the model only made predictions like that.
|
| One other comment, in protein science, a sequence with 40%
| identity to another sequence is not "very different" if it is
| homologous. Since this model is essentially generating homologs
| from a particular class, it's no surprise at a pairwise amino
| acid level, the generated sequences have this degree of
| similarity. Take proteins in any functional family and compare
| them. They will have the same overall 3-D structure--called their
| "fold"--yet have pairwise sequence identities much lower than
| 30-40%. This "degeneracy", the notion that there are many diverse
| sequences that all fold into the same shape, is both a
| fundamental empirical observation in protein science as well as a
| grounded physical theory.
|
| Not to be negative. I really enjoyed reading this paper and I
| think the work is important. Some related work by Meta AI is the
| ESM series of models [1] trained on the same data (the UniProt
| dataset [2]).
|
| One thing I wonder is about the vocabulary size of this model.
| The number of tokens is 26 for the 20 amino acids and some
| extras, whereas for a LLM like Meta's LLaMa the vocab size is
| 32,000. I wonder how that changes training and inference, and how
| we can adopt the transformer architecture for this scenario.
|
| 1. https://github.com/facebookresearch/esm
|
| 2. https://www.uniprot.org/help/downloads
| tomohelix wrote:
| I consider all the manual curation effectively a form of RLHF
| that can be imposed automatically later on. We saw how much
| this can improve a raw LLM by looking at the output of ChatGPT.
| Otherwise, the criticism of LLMs being just glorified
| autocomplete machines isn't that far from reality. In other
| words, it is just an expected requirement for LLMs to be
| effective.
|
| You are probably right that lysozyme is an easy target and may
| have large sequence variety between homologs so saying "very
| different" for 30-40% is not correct. But that is only in the
| context of biology and protein structure and function. This is
| an LLM trained on primary sequences only. It doesn't know
| anything about the folds or domains or functional sites (unless
| I am wrong and those are part of the metadata fed to it during
| training). Yet it did learn enough to generalize to the point
| that even with only 30-40% identity, it still produces soluble
| proteins with the same function. I am sure you know that at 40%
| differences, one protein can be in an entirely different
| superfamily from another. So it is still an impressively low
| identity score.
|
| Also, I think it is more appropriate to compare the amino acids
| to things like the alphabets than vocabs. Domains would
| probably be an equivalent to LLaMa vocab.
| ajuc wrote:
| AI generating a virus/prion/whatever that we synthesize without
| understanding what it does is the easiest way to the bad
| singularity people were warning us about.
___________________________________________________________________
(page generated 2023-05-13 23:00 UTC)