[HN Gopher] AlphaFold reveals the structure of the protein universe
___________________________________________________________________
AlphaFold reveals the structure of the protein universe
Author : MindGods
Score : 376 points
Date : 2022-07-28 11:18 UTC (5 hours ago)
(HTM) web link (www.deepmind.com)
(TXT) w3m dump (www.deepmind.com)
| codedokode wrote:
| Today I learned that there are bacteria that have a protein
| helping to form ice on plants [1] to destroy them and extract
| nutrients (however I didn't understand how bacteria themselves
| survive this).
|
| Machine learning typically uses existing data to predict new
| data. Please explain: Does it mean that AlphaFold can only use
| known types of interactions between atoms and will mispredict the
| structure of proteins that use not yet known interactions?
|
| And why we cannot just simulate protein behaviour and
| interactions using quantum mechanics?
|
| [1] https://pubs.acs.org/doi/10.1021/acs.jpcb.1c09342
| flobosg wrote:
| > And why we cannot just simulate protein behaviour and
| interactions using quantum mechanics?
|
| QM calculations have been done in proteins, but they're
| computationally very expensive. IIRC, there are hybrid
| approaches where only a small portion of interest in the
| protein structure is modelled by QM and the rest by classical
| molecular mechanics.
| beanwood wrote:
| >And why we cannot just simulate protein behaviour and
| interactions using quantum mechanics?
|
| If you wanted to simulate the behaviour of an entire protein
| using quantum mechanics, the sheer number of calculations
| required would be infeasible.
|
| For what it's worth, I have a background in computational
| physics and am studying a PhD in structural biology. For any
| system (of any size) that you want to simulate, you have to
| consider how much information you're willing to 'ignore' in
| order to focus on the information you would like to 'get out'
| of a set of simulations. Being aware of the approximations you
| make and how this impacts your results is crucial.
|
| For example, if I am interested in how the electrons of a group
| of Carbon atoms (radius ~ 170 picometres) behave, I may want to
| use Density Functional Theory (DFT), a quantum mechanical
| method.
|
| For a single, small protein (e.g. ubiquitin, radius ~ 2
| nanometres), I may want to use atomistic molecular dynamics
| (AMD), which models the motion of every single atom in response
| to thermal motion, electrostatic interactions, etc using
| Newton's 2nd law. Electron/proton detail has been approximated
| away to focus on overall atomic motion.
|
| In my line of work, we are interested in how big proteins (e.g.
| the dynein motor protein, ~ 40 nanometres in length) move
| around and interact with other proteins at longer time (micro-
| to millisecond) and length (nano- to micrometre) scales than
| DFT or AMD. We 'coarse-grain' protein structures by
| representing groups of atoms as tetrahedra in a continuous mesh
| (continuum mechanics). We approximate away atomic detail to
| focus on long-term motion of the whole protein.
|
| Clearly, it's not feasible to calculate the movement of dynein
| for hundreds of nanoseconds using DFT! The motor domain alone
| in dynein contains roughly one million atoms (and it has
| several more 'subunits' attached to it). Assuming these are
| mostly Carbon, Oxygen or Nitrogen, then you're looking at
| around ten million electons in your DFT calculations, for a
| single step in time (rounding up). If you're dealing with the
| level of atomic bonds, you're probably going to a use time
| steps between a femto- (10^-15 s) or picosecond (10^-12 s). The
| numbers get a bit ridiculous. There are techniques that combine
| QM and AMD, although I am not too knowledgeable in this area.
|
| Some further reading, if you're interested (I find Wikipedia
| articles on these topics to generally be quite good):
|
| DFT: https://en.wikipedia.org/wiki/Density_functional_theory
|
| Biological continuum mechanics:
| https://doi.org/10.1371/journal.pcbi.1005897
|
| Length scales in biological simulations:
| https://doi.org/10.1107/S1399004714026777
|
| Electronic time scales:
| https://www.pnas.org/doi/10.1073/pnas.0601855103
| kache_ wrote:
| This is an incredible gift to humanity. A huge positive impact.
| The team should be proud
| carbocation wrote:
| The press release is a bit difficult to place into historical
| context. I believe that the first AlphaFold release was mostly
| human and mouse proteins, and this press release marks the
| release of structures for additional species.
| azangru wrote:
| > I believe that the first AlphaFold release was mostly human
| and mouse proteins,
|
| More than that. The press release actually contains an
| infographic comparing the amount of published protein models
| for different clades of organisms. The infographic shows that
| the previous release (~1mln proteins) contained proteins of
| some animal, plant, bacterial, and fungal species.
| cm2187 wrote:
| How do you know that the predicted structure will be correct? I
| presume researchers will need to validate the structure
| empirically. Do we know how good the model has been at predicting
| so far?
| donut2d wrote:
| A fun way I've been thinking about all this is what
| nanotech/nanobots are actually going to look like. Tiny little
| protein machines doing what they've been doing since the dawn of
| life. We now have a library of components, and as we start
| figuring out what they can do, and how to stack them, we can
| start building truly complex machinery for whatever crazy tasks
| we can imagine. The impact goes so far beyond drugs and
| treatments.
| candiddevmike wrote:
| Is folding@home obsolete now?
| foxhop wrote:
| I would say no, the two approaches may be used to validate each
| other.
| flobosg wrote:
| Folding@home answers a related but different question. While
| AlphaFold returns the picture of a folded protein in its most
| energetically stable conformation, Folding@home returns a video
| of the protein undergoing folding, traversing its energy
| landscape.
| dekhn wrote:
| It's not, but the question is (and has long been) whether the
| energy expended by folding@home is worth the scientific result.
| IMHO- probably not.
| seventytwo wrote:
| Good question... I'd imagine that other methods of folding
| solutions are still valuable, because AlphaFold needs to be
| checked.
| klemola wrote:
| As an aside, the protein structure visualizations in the article
| are pretty. Is there a good source for more?
| alphabetting wrote:
| https://alphafold.ebi.ac.uk/
| flobosg wrote:
| * https://pdb101.rcsb.org/motm/
|
| * https://ccsb.scripps.edu/goodsell/
|
| * https://pdb101.rcsb.org/sci-art/geis-archive/irving-geis
|
| * https://www.digizyme.com/portfolio.html
|
| * https://www.drewberry.com/
|
| * https://biochem.web.utah.edu/iwasa/projects.html
|
| * http://onemicron.com/
|
| * The art of Jane Richardson, of which I couldn't find a link
|
| * This blog has plenty of good links:
| https://blogs.oregonstate.edu/psquared/
| yuan43 wrote:
| > Today, I'm incredibly excited to share the next stage of this
| journey. In partnership with EMBL's European Bioinformatics
| Institute (EMBL-EBI), we're now releasing predicted structures
| for nearly all catalogued proteins known to science, which will
| expand the AlphaFold DB by over 200x - from nearly 1 million
| structures to over 200 million structures - with the potential to
| dramatically increase our understanding of biology.
|
| And later:
|
| > Today's update means that most pages on the main protein
| database UniProt will come with a predicted structure. All 200+
| million structures will also be available for bulk download via
| Google Cloud Public Datasets, making AlphaFold even more
| accessible to scientists around the world.
|
| This is the actual announcement.
|
| UniProt is a large database of protein structure and function.
| The inclusion of the predicted structures alongside the
| experimental data makes it easier to include the predictions in
| workflows already set up to work with the other experimental and
| computed properties.
|
| It's not completely clear from the article whether any of the
| 200+ million predicted structures deposited to UniProt have not
| be previously released.
|
| Protein structure determines function. Before AlphaFold,
| experimental structure determination was the only option, and
| that's very costly. AlphaFold's predictions appears to be good
| enough to jumpstart investigations without an experimental
| structure determination. That has the potential to accelerate
| many areas of science and could percolate up to therapeutics.
|
| One area that doesn't get much discussion in the press is the
| difference between solid state structure and solution state
| structure. It's possible to obtain a solid state structure
| determination (x-ray) that has nothing to do with actual behavior
| in solution. Given that AlhpaFold was trained to a large extent
| on solid state structures, it could be propagating that bias into
| its predicted structures.
|
| This paper talks about that:
|
| > In the recent Critical Assessment of Structure Prediction
| (CASP) competition, AlphaFold2 performed outstandingly. Its worst
| predictions were for nuclear magnetic resonance (NMR) structures,
| which has two alternative explanations: either the NMR structures
| were poor, implying that Alpha-Fold may be more accurate than
| NMR, or there is a genuine difference between crystal and
| solution structures. Here, we use the program Accuracy of NMR
| Structures Using RCI and Rigidity (ANSURR), which measures the
| accuracy of solution structures, and show that one of the NMR
| structures was indeed poor. We then compare Alpha-Fold
| predictions to NMR structures and show that Alpha-Fold tends to
| be more accurate than NMR ensembles. There are, however, some
| cases where the NMR ensembles are more accurate. These tend to be
| dynamic structures, where Alpha-Fold had low confidence. We
| suggest that Alpha-Fold could be used as the model for NMR-
| structure refinements and that Alpha-Fold structures validated by
| ANSURR may require no further refinement.
|
| https://pubmed.ncbi.nlm.nih.gov/35537451/
| sseagull wrote:
| > Before AlphaFold, experimental structure determination was
| the only option
|
| Other computational methods have existed for a long time.
| Folding@home was founded 22 years ago.
| dekhn wrote:
| folding@home doesn't predict structures, it simulates protein
| folding. Different area with some overlap.
| COGlory wrote:
| Before my comment gets dismissed, I will disclaim I am a
| professional structural biologist that works in this field every
| day.
|
| These threads are always the same: lots of comments about protein
| folding, how amazing DeepMind is, how AlphaFold is a success
| story, how it has flipped an entire field on it's head, etc. The
| language from Google is so deceptive about what they've actually
| done, I think it's actually intentionally disingenuous.
|
| At the end of the day, AlphaFold is amazing homology modeling. I
| love it, I think it's an awesome application of machine learning,
| and I use it frequently. But it's doing the same thing we've been
| doing for 2 decades: pattern matching sequences of proteins with
| unknown structure to sequences of proteins with known structure,
| and about 2x as well as we used to be able to.
|
| That's extremely useful, but it's not knowledge of protein
| folding. It can't predict a fold de novo, it can't predict folds
| that haven't been seen (EDIT: this is maybe not strictly true,
| depending on how you slice it), it fails in a number of edge
| cases (remember, in biology, edge cases are everything) and
| again, I can't stress this enough, we have no new information on
| how proteins fold. We know all the information (most of at least)
| for a proteins final fold is in the sequence. But we don't know
| much about the in-between.
|
| I like AlphaFold, it's convenient and I use it (although for
| anything serious or anything interacting with anything else, I
| still need a real structure), but I feel as though it has been
| intentionally and deceptively oversold. There are 3-4 other deep
| learning projects I think have had a much greater impact on my
| field.
|
| EDIT: See below: https://news.ycombinator.com/item?id=32265662
| for information on predicting new folds.
| flobosg wrote:
| > AlphaFold is amazing homology modeling
|
| If it is homology modelling, then how can it work without input
| template structures?
| COGlory wrote:
| It has template structures. AlphaFold uses the following
| databases: BFD, MGnify,
| PDB70, PDB (structures in the mmCIF format),
| PDB seqres - only for AlphaFold-Multimer, Uniclust30,
| UniProt - only for AlphaFold-Multimer, UniRef90.
| flobosg wrote:
| Those databases are used to derive the evolutionary
| couplings and distance matrices used by the algorithm.
| Several of those databases aren't even structural ones.
| Furthermore, AlphaFold can function with only a MSA as an
| input, without retrieving a single PDB coordinate.
| COGlory wrote:
| It's all about boosting signal by finding other proteins
| that are similar, until you get to the point that you can
| identify a fold to assign to a region of the protein.
| That's why some are structural, and some are not.
|
| >Furthermore, AlphaFold can function with only a MSA as
| an input, without retrieving a single PDB coordinate.
|
| Yes, it has a very nice model of what sequences should
| look like in 3D. That model is derived from experimental
| data. So if I give AlphaFold an MSA of a new, unknown
| protein fold (substantively away from any known fold), it
| cannot predict it.
| flobosg wrote:
| > Yes, it has a very nice model of what sequences should
| look like in 3D.
|
| A structural model, you would say.
|
| > That model is derived from experimental data.
|
| That doesn't make it a template-based model, or a
| homology one.
|
| > if I give AlphaFold an MSA of a new, unknown protein
| fold (substantively away from any known fold), it cannot
| predict it
|
| That will depend on the number of effective sequences
| found to derive couplings. Domains with novel folds
| usually have a low number of remotely homolog sequences
| and for that reason the method will fail, not just
| because they are novel.
| zack-m wrote:
| Yup. It's great, but there are still many aspects to unpack and
| work on. Hence why Rosetta is a thing.
| flobosg wrote:
| Rosetta methods are also moving towards ML. Here's an article
| from last week:
| https://www.science.org/doi/10.1126/science.abn2100
| [deleted]
| Lloydksk wrote:
| I mean like whats this about AlphaFold is gone
| adamsmith143 wrote:
| > it can't predict folds that haven't been seen
|
| This seems strange to me. The entire point of these types of
| models is to predict things on unseen data. Are you saying
| Deepmind is completely lying about their model?
|
| Deepmind solved CASP, isn't the entire point of that
| competition to predict unseen structures?
|
| If AlphaFold doesn't predict anything then what are you using
| it to do?
| COGlory wrote:
| AlphaFold figures out that my input sequence (which has no
| structural data) is similar to this other protein that has
| structural data. Or maybe different parts of different
| proteins. It does this extremely well.
| flobosg wrote:
| This is a gross misrepresentation of the method.
| COGlory wrote:
| Perhaps you'd care to explain how? AlphaFold does not
| work on new folds. It ultimately relies on mapping
| sequence to structure. It does it better than anyone
| else, and in ways a human probably couldn't, but if you
| give it a brand new fold with no relation to other folds,
| it cannot predict it. I routinely areas of extremely low
| confidence many of my AlphaFold models. I work in
| organisms that have virtually 0 sequence identity. This
| is a problem I deal with every day. I wish AlphaFold
| worked in the way you are suggesting, but it just flat
| out does not.
| dekhn wrote:
| No organisms have virtually 0 sequence identity. That's
| nonsense. Can you give an example? n Even some random
| million-year-isolated archae shares the majority of its
| genes with common bacteria.
| flobosg wrote:
| > It ultimately relies on mapping sequence to structure.
|
| So does every structural prediction method.
|
| > if you give it a brand new fold with no relation to
| other folds, it cannot predict it
|
| That will depend on the number of effective sequences,
| not the actual fold.
|
| > I work in organisms that have virtually 0 sequence
| identity.
|
| Then the problem is low sequence coverage, not the
| protein fold. On a side note, there are sensitive
| homology search protocols that rely very little on actual
| sequence identity.
| [deleted]
| mupuff1234 wrote:
| > There are 3-4 other deep learning projects I think have had a
| much greater impact on my field.
|
| Don't leave us hanging... which projects?
| COGlory wrote:
| 1) Isonet - takes low SNR cryo-electron tomography images
| (that are extremely dose limited, so just incredibly blurry
| and frequently useless) and does two things:
|
| * Deconvolutes some image aberrations and "de-noises" the
| images
|
| * Compensates for missing wedge artifacts (missing wedge is
| the fact that the tomography isn't done -90deg --> +90deg,
| but usually instead -60deg --> +60deg, leaving a 30deg wedge
| on the top and bottom of basically no information) which
| usually are some sort of directionality in image density. So
| if you have a sphere, the top and bottom will be extremely
| noisy and stretched up and down (in Z).
|
| https://www.biorxiv.org/content/10.1101/2021.07.17.452128v1
|
| 2) Topaz, but topaz really counts as 2 or 3 different
| algorithms. Topaz has denoising of tomograms and of flat
| micrographs (i.e. images taken with a microscope, as opposed
| to 3D tomogram volumes). That denoising is helpful because it
| increases contrast (which is the fundamental problem in Cryo-
| EM for looking at biomolecules). Topaz also has a deep
| learning particle picker which is good at finding views of
| your protein that are under-represented, or otherwise
| missing, which again, normally results in artifacts when you
| build your 3D structure.
|
| https://emgweb.nysbc.org/topaz.html
|
| 3) EMAN2 convolutional neural network for tomogram
| segmentation/Amira CNN for segmentation/flavor of the week
| CNN for tomogram segmentation. Basically, we can get a 3D
| volume of a cell or virus or whatever, but then they are
| noisy. To do anything worthwhile with it, even after
| denoising, we have to say "this is cell membrane, this is
| virus, this is nucleic acid" etc. CNNs have proven to be
| substantially better at doing this (provided you have an
| adequate "ground truth") than most users.
|
| https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5623144/
| [deleted]
| dekhn wrote:
| I've directly communicated with the leaders of CASP and at DM
| that they should stop representing this as a form of protein
| folding and just call it "crystal/cryoEM structure prediction"
| (they filter out all the NMR structures from PDB since they
| aren't good for prediction). They know it's disingenuous and
| they do it on purpose to give it more impact than it really
| deserves.
|
| I would like to correct somethign here- it does predict
| structures de novo and predict folds that haven't been seen
| before. That's because of the design of the NN- it uses
| sequence information to create structural constraints. If those
| constraints push the modeller in the direction of a novel fold,
| it will predict that.
|
| To me what's important about this is that it demonstrated the
| obvious (I predicted this would happen eventually, shortly
| after losing CASP in 2000).
| COGlory wrote:
| >I would like to correct somethign here- it does predict
| structures de novo and predict folds that haven't been seen
| before. That's because of the design of the NN- it uses
| sequence information to create structural constraints. If
| those constraints push the modeller in the direction of a
| novel fold, it will predict that.
|
| Could you expand on this? Basically it looks at the data, and
| figures out what's an acceptable position in 3D space for
| residues to occupy, based on what's known about other
| structure?
|
| I will update my original post to point out I may be not
| entirely correct there.
|
| The distinction I'm trying to make is that there's a
| difference between looking at pre-existing data and modeling
| (ultimately homology modeling, but maybe slightly different)
| and understanding how protein folding works, being able to
| predict de novo how an amino acid sequence will become a 3D
| structure.
|
| Also thank you for contacting CASP about this.
| dekhn wrote:
| From what I can tell, the model DM built is mining subtle
| relationships between aligned columns of multiple sequence
| alignments and any structural information which is tangibly
| related to those sequences. Those relationships can be used
| to infer rough atomic distances ("this atom should be
| within 3 and 7 angstroms of this other atom"). A large
| matrix (partially filled out) of distances is output, and
| those distances are used as constraints in a force field
| (which also includes lots of prior knowledge about protein
| structure) and then they run simulations which attempt to
| minimize both the force field and constraint terms.
|
| In principle you don't even need a physical force field- if
| you have enough distance information between pairs of
| atoms, you can derive a plausible structure by embedding
| the distances in R3
| (https://en.wikipedia.org/wiki/Distance_geometry and https:
| //citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21...
|
| Presumably, the signal they extract includes both rich
| local interactions (amino acids near in sequence) and
| distant ones inferred through sequence/structure
| relationships, and the constraints could in fact push a
| model towards a novel fold, presumably through some
| extremely subtle statistical relationships to other
| evolutionarily related proteins that adopt a different
| fold.
| ramraj07 wrote:
| Not sure if you should be reminded of how alpha fold started,
| it started by winning a competition thought un winnable by
| academics. Top labs working in protein structure prediction
| have fundamentally changed direction after alpha fold and are
| working to do the same even better.
|
| This is not the first (or even tenth) time I'm seeing an
| academic trying to undermine genuine progress almost to the
| level of gaslighting. Comparing alphafold to conventional
| homology modeling is disingenuous at its most charitable
| interpretation.
|
| Not sure what else to say. Structural biology has always been
| the weirdest field I've seen, the way students are abused
| (crystallize and publish in nature or go bust), and how every
| nature issue will have three structure papers as if that cures
| cancer every day. I suppose it warps one's perception of
| outsiders after being in such a bubble?
|
| signed, someone with a PhD in biomedical engineering, did a ton
| of bio work.
| shpongled wrote:
| > Comparing alphafold to conventional homology modeling is
| disingenuous at its most charitable interpretation.
|
| It's really not - have you played around with AF at all? Made
| mutations to protein structures and asked it to model them?
| Go look up the crystal structures for important proteins like
| FOXA1 [1], AR [2], EWSR1 [3], etc (i.e. pretty much any
| protein target we really care about and haven't previously
| solved) and tell me with a straight face that AF has "solved"
| protein folding - it's just a fancy language model that's
| pattern matching to things it's already seen solved before.
|
| signed, someone with a PhD in biochemistry.
|
| [1] https://alphafold.ebi.ac.uk/entry/P55317 [2]
| https://alphafold.ebi.ac.uk/entry/P10275 [3]
| https://alphafold.ebi.ac.uk/entry/Q01844
| COGlory wrote:
| > Not sure if you should be reminded of how alpha fold
| started, it started by winning a competition thought un
| winnable by academics. Top labs working in protein structure
| prediction have fundamentally changed direction after alpha
| fold and are working to do the same even better.
|
| Not sure what part of "it does homology modeling 2x better"
| you didn't see in my comment? AlphaFold scored something like
| 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%?
| So it's ~2x as good as I-TASSER which is exactly what I said
| in my comment.
|
| >This is not the first (or even tenth) time I'm seeing an
| academic trying to undermine genuine progress almost to the
| level of gaslighting. Comparing alphafold to conventional
| homology modeling is disingenuous at its most charitable
| interpretation.
|
| It literally is homology modeling. The deep learning aspect
| is to boost otherwise unnoticed signal that most homology
| modeling software couldn't tease out. Also, I don't think I'm
| gaslighting, but maybe I'm wrong? If anything, I felt gaslit
| by the language around AlphaFold.
|
| >Not sure what else to say. Structural biology has always
| been the weirdest field I've seen, the way students are
| abused (crystallize and publish in nature or go bust), and
| how every nature issue will have three structure papers as if
| that cures cancer every day. I suppose it warps one's
| perception of outsiders after being in such a bubble?
|
| What on earth are you even talking about? The vast, VAST
| majority of structures go unpublished ENTIRELY, let alone
| published in nature. There are almost 200,000 structures on
| deposit in the PDB.
| underdeserver wrote:
| > Not sure what part of "it does homology modeling 2x
| better" you didn't see in my comment? AlphaFold scored
| something like 85% in CASP in 2020, in CASP 2016, I-TASSER
| had I think 42%? So it's ~2x as good as I-TASSER which is
| exactly what I said in my comment.
|
| Wait, stop, I don't know anything about proteins but 84%
| success is not ~2x better than 42%.
|
| It doesn't really make sense to talk about 2x better in
| terms of success percentages, but if you want a feel, I
| would measure 1/error instead (a 99% correct system is 10
| times better than a 90% correct system), making AlphaFold
| around 3.6 times better.
| palmtree3000 wrote:
| I think odds ratio ( p/(1-p) ) is the thing I'd use here.
| It gives the right limiting behavior (at p ~= 0, doubling
| p is twice as good, and at p~=1, halving 1-p is twice as
| good) and it's the natural way to express Bayes rule,
| meaning you can say "I'm twice as sure (in odds ratio
| terms) based on this evidence" and have that be solely a
| property of the update, not the prior.
| dekhn wrote:
| What ramraj is talking about: if you go into a competitive
| grad program to get a PhD in structural biology, your
| advisor will probably expect that in 3-4 years you will:
| crystallize a protein of interest, collect enough data to
| make a model, and publish that model in a major journal.
| Many people in my program could not graduate until they had
| a Nature or Science paper (my advisor was not an asshole, I
| graduated with just a paper in Biochemistry).
|
| In a sense both of you are right- DeepMind is massively
| overplaying the value of what they did, trying to expand
| its impact far beyond what they actually achieved (this is
| common in competitive biology), but what they did was such
| an improvement over the state of the art that it's
| considered a major accomplishment. It also achieved the
| target of CASP- which was to make predictions whose scores
| are indistinguishable from experimentally determined
| structures.
|
| I don't think academics thought CASP was unwinnable but
| most groups were very surprised that an industrial player
| using 5 year old tech did so well.
| flobosg wrote:
| Hear, hear. This is probably the best take.
| valarauko wrote:
| > What ramraj is talking about: if you go into a
| competitive grad program to get a PhD in structural
| biology, your advisor will probably expect that in 3-4
| years you will: crystallize a protein of interest,
| collect enough data to make a model, and publish that
| model in a major journal.
|
| All of that is generally applicable to molecular biology
| in general, and I don't see how the field of structural
| biology is especially egregious, the way ramraj is making
| it out to be.
| flobosg wrote:
| Protein crystallization can be very difficult and there
| is no general solution. Kits that screen for crystal
| growth conditions usually help but optimization is needed
| in most cases. Then, that crystal must have certain
| properties that allow for good data acquisition at the
| X-ray facility. That's another problem by itself and
| months or years can pass until you get a suitable protein
| crystal and X-ray diffraction dataset where you can model
| your structure.
| valarauko wrote:
| I'm familiar with protein crystallization and the
| difficulties associated with it. What I don't agree with
| is the characterization of the field as especially
| difficult, above and beyond modern biology in general.
| Nor can I support the assertion that structural biology
| students are subject to special abuse that regular grad
| students are not.
|
| > ... can be very difficult and there is no general
| solution
|
| This is true of pretty much any graduate work in
| molecular biology.
| dekhn wrote:
| I did rotations in multiple types of lab as part of my
| program and I can't say I ever found that students in
| regular molecular biology labs had nearly as hard a time
| as structural biologists; SB is its own class of hell.
| Given the number of papers published in molecular biology
| that turn out to be "gel was physically cut and
| reasssembled to show the results the authors desired"
| (it's much harder to cheat on a protein structure)...
| lucidrains wrote:
| To add to this, the deep learning field has already moved
| on towards MSA-less structure prediction. None of this
| would be possible without building on top of the work
| open sourced by Deepmind.
|
| https://www.biorxiv.org/content/10.1101/2022.07.21.500999
| v1 https://www.biorxiv.org/content/10.1101/2022.07.20.500
| 902v1
|
| To be overly dismissive is to lack imagination.
| jakosz wrote:
| Now we can start guessing what futures they are betting on:
| these, in which open-sourcing the whole thing commoditises
| critical complements.
|
| ---
|
| https://www.gwern.net/Complement
| jarenmf wrote:
| This is probably one of the best applications of AI in science in
| terms of impact so far. I can't think of any other problem with
| the same potential impact.
|
| EDIT: grammar
| 323 wrote:
| AI translate has probably a bigger worldwide impact so far.
| jebarker wrote:
| jarenmf said "in science" - but it is an interesting question
| how much automated translation has helped scientists
| translate papers from other languages.
| hijodelsol wrote:
| It even goes both ways - it allows non-native English
| speakers to publish their work in correct
| technical/scientific English with far less barriers.
| calebkaiser wrote:
| This is definitely one of the most exciting spaces in AI right
| now. Another somewhat-related startup is PostEra (medicinal
| chemistry for drug discovery via AI) https://postera.ai/about/
| axg11 wrote:
| AlphaFold is the best counterpoint to tech cynics.
|
| One of the largest public tech companies in the world funded a
| multi-year scientific project, executed the research flawlessly
| and moved forward an entire scientific field. They then went on
| to openly release the code _and_ data, working with a publicly
| funded organization (EMBL-EBI) to ensure researchers across the
| globe can easily access the outputs.
|
| I'm not arguing that every tech company is a net positive for
| humanity. Google itself isn't perfect. Google + DeepMind is
| setting a bloody high bar though.
| t00 wrote:
| You are right and when thinking about it I can see 2 problems
| which I hope in the future can have even more impact:
|
| 1. Using AI to determine the most efficient methods of doing
| mathematical expressions, transformations and computation
| algorithms - division, square root, maybe traveling salesman -
| these which take relatively high amount of CPU cycles to
| compute and are used everywhere. If inputs and outputs can be
| assigned to it, AI can eventually build a transformation which
| can be reproduced using a silicon.
|
| 2. Physics phenomena in general, not only organic protein, can
| be measured and with sufficient ability to quantize them to
| inputs and experimentally obtained outputs to train the
| network, we could in theory establish new formulas or constants
| and progress the understanding of the Universe.
| lrhegeba wrote:
| the groundworks, at least partially, happen as you typed
| this: https://www.nature.com/articles/d41586-021-01627-2
| dekhn wrote:
| Demis and John will probably win either the Chemistry or Physics
| Nobel Prize in the next couple of years.
| thomasahle wrote:
| Some people are using "AI wins a Nobel price" as the new Turing
| test. Maybe that is going to happen sooner than they expect. Or
| maybe the owners of the AI will always claim it on its behalf.
| dekhn wrote:
| there's no AI here. This is just ML. All deepmind did here
| was use multiple excellent resources- large numbers of
| protein sequences, and small numbers of protein structures,
| to create an approximation function of protein structure,
| without any of the deep understanding of "why".
| Interestingly, the technology they used to do this didn't
| exist 5 years ago!
| dalbasal wrote:
| Can someone put AlphaFold's problem space into perspective for
| me?
|
| Why is protein folding important? Theoretical importance? Can we
| do something with protein folding knowledge? If so, what?
|
| I've been hearing about AlphaFold from the CS side. There they
| seem to focus on protein folding primarily as an interesting
| space to apply their CS efforts.
| axg11 wrote:
| If we knew:
|
| (a) the structure of every protein (what DeepMind is doing
| here)
|
| (b) how different protein structures interact (i.e. protein
| complexes - DeepMind is working on this but not there yet)
|
| Then we could use those two building blocks to design new
| proteins (drugs) that do what we want. If we solve those two
| problems with very high accuracy, we can also reduce the time
| it takes to go from starting a drug discovery programme to
| approved medicine.
|
| Obtaining all protein structures and determining how they
| interact is a key step towards making biology more predictable.
| Previously, solving the structure of a protein was very time
| consuming. As a result, we didn't know the structure for a
| majority of proteins. Now that it's much faster, downstream
| research can move faster.
|
| Caveat: we should remember that these are all computational
| predictions. AlphaFold's predictions can be wrong and protein
| structures will still need to be validated. Having said that,
| lots of validation has already occurred and confidence in the
| predictions grows with every new iteration of AlphaFold.
| lamename wrote:
| How are the predictions validated? Waiting for the old
| fashioned way for... very difficult crystal structure
| experiments? Or something else?
| RandomLensman wrote:
| If the predictions are generally good enough, could also
| skip the validation and directly try to get a desired
| effect or reaction. That isn't strictly speaking validating
| the structure, but depending on the use case might be
| easier to just go for an outcome - really a question of
| application and cost efficiency.
| siver_john wrote:
| I mean nothing is stopping you from skipping validation
| with pre-alphafold techniques and say for drug discovery
| to already do drug screening using the predicted
| structure. It's just the drug screening software is
| already error prone so you are still going to have to do
| some validation. However having an idea on a potential
| structure means that you can do other techniques that are
| simpler to validate it that are less expensive/time
| consuming (I'm thinking of things similar to FRET).
|
| Another idea is these may come into play for anti-
| verification, so if you are drug screening against a
| known structure. You could potentially use these more
| flawed structures of proteins you don't want to target
| but may be similar, and try to reduce the drug's efficacy
| at binding them. Or something to that effect. All of that
| is fun ideas that are currently being explored in that
| space but we'll see where it takes us.
| flobosg wrote:
| > very difficult crystal structure experiments?
|
| Apart from X-ray crystallography there are other methods
| for structure determination such as nuclear magnetic
| resonance (NMR) or cryo-electron microscopy (cryo-EM). The
| latter has seen a dramatic improvement in resolution over
| the last decade.
| misnome wrote:
| For a lot of X-ray crystallography cases, some of the
| difficulty is working out with no prior information, the
| actual structure from the collected data. This makes a lot
| of that... much easier because with
| https://en.wikipedia.org/wiki/Molecular_replacement
| something that is "close, but not correct" can be used to
| bootstrap the actual structure from.
| f38zf5vdt wrote:
| Most of them are not, just estimations based on previous
| results given sequences with known structure.
|
| Every couple years there is a massive competition called
| CASP where labs submit previously unresolved protein
| structures derived from experimental EM, x-ray
| crystallography, or NMR studies and other labs attempt to
| predict these structures using their software. AlphaFold2
| absolutely destroyed the other labs in the main contest
| (regular monomeric targets, predominantly globular) for
| structure resolution two years ago, in CASP 14.
|
| https://predictioncenter.org/casp14/zscores_final.cgi
|
| The latest contest, CASP15, is currently underway and
| expected to end this year. As with all ML, the usual
| caveats apply to the models Google generated -- the dangers
| of overfitting to existing structures, artifacts based on
| the way the problem was modelled, etc
| jcranmer wrote:
| > Then we could use those two building blocks to design new
| proteins (drugs) that do what we want. If we solve those two
| problems with very high accuracy, we can also reduce the time
| it takes to go from starting a drug discovery programme to
| approved medicine.
|
| Drugs are usually not proteins, but instead small molecules
| that are designed to help or interfere with the operation of
| proteins instead.
| axg11 wrote:
| That is only true because of our current tools and
| capabilities. With improved manufacturing techniques and
| AlphaFold++ I think biologics will dominate. Even still,
| there are ~2000 approved biologics [0].
|
| [0] - https://purplebooksearch.fda.gov/advanced-search
| bckr wrote:
| Yep, proteins are so much more flexible / precise than
| small molecules. Also we can get the body to produce
| them. Think mRNA vaccines.
| empiricus wrote:
| The endgame of protein folding is nanotechnology. Life is also
| nanotechnology. So basically the end game is to take full
| control of life: change existing life forms, design new life
| forms, new ecosystems. Eat the earth first, then the universe.
| Not yet there, but getting much closer.
| green-eclipse wrote:
| There are a million articles and podcasts explaining exactly
| your question. Those will be better than HN responses. I
| suggest you take 15 seconds to Google it.
| dev_tty01 wrote:
| Yes there are a million articles. That is why asking a
| question here on HN is useful. The HN community more often
| than not offers intelligent insight as well as curated
| recommended links for learning more about a topic. Yes, the
| signal-to-noise ratio isn't perfect on HN, but it is a lot
| better than random Google searches. If nothing else, it often
| leads to enough basic understanding so that someone can then
| perform more refined and therefore more productive Google
| searches. I appreciate the basic questions and the generous
| nature of many HN commenters who offer thoughtful responses.
| feet wrote:
| As others have already mentioned, proteins are the machinery of
| the cell. They perform an immense array of functions and they
| must fold in a certain way to perform these functions. This is
| part of what's known as the structure-function relationship.
|
| Misfolded proteins are contributors to numerous pathological
| conditions and the more we can understand about how and why
| this folding happens, the better we can treat these conditions.
|
| Another aspect is that while we can at least partially
| determine the primary structure (the amino acid sequence) of
| proteins from DNA and RNA, we don't necessarily know their
| secondary or tertiary structures (3 dimensional conformation).
| This is a key piece of the puzzle for figuring out how these
| proteins do their proteiny things and how they interact with
| other proteins and even how they form quaternary structures
| with other proteins (an assembly of multiple proteins that
| perform some function, many pores are assemblies like this).
| Once we know these structures and understand how they work on a
| structural and chemical level, we can manipulate them far more
| easily.
|
| In order to do rational drug design, which is designing a drug
| for a specific target or active site on a protein, we need to
| understand these structures. Working to solve protein folding
| is a key step in treating disease states and understanding how
| cells work on a fundamental level. The impact is hard to
| understate.
| turkeygizzard wrote:
| My understanding is that protein folding is a major cost
| bottleneck in drug design.
|
| Researchers can come up with candidate molecule formulas that
| might work as good drugs, but the problem is that these
| proteins organize/fold themselves physically in a hard-to-
| predict way. And how they fold directly affects their
| properties as drugs.
|
| If AlphaFold can accurately predict folding, it'll allow
| researchers to prioritize drug candidates more accurately which
| will reduce research time and costs. Supposedly the major
| pharmaceutical companies can spend up to billions when
| designing a single drug. Optimistically, predicting protein
| folding better will allow for much more rapid and cheaper drug
| development
| matheusmoreira wrote:
| Proteins are the molecular machines of all living beings on
| this planet. They do almost everything. We need to understand
| their structure and how that structure changes when they
| perform some function. Essentially we have to reverse engineer
| nature's machines.
|
| Biology has traditionally studied these things from the outside
| in, top down. Advances in biology are correlated with advances
| in microscope technology. We went from seeing with the naked
| eye to being able to magnify tissue to see cells and even
| what's inside of them.
|
| Studying biology from the inside out, bottom up is
| significantly harder. You have chemistry which is hard science.
| On top of that you have biochemistry which is equally rigorous.
| Above that level of abstraction things start to get a lot less
| well defined. We go from organic chemistry (carbon, hydrogen,
| oxygen, etc.) to more complex things found in life (aminoacids,
| carbohydrates, lipids) to polymers of those things (proteins,
| polysaccharides, triglycerides) to a point where they get so
| big and complex their tridimensional structure starts
| influencing how they interact with other molecules then we
| start seeing aggregates (quaternary proteins) of these complex
| things!
|
| Important proteins like ATP synthase are well understood (it's
| a friggin' turbine like in a hydroelectric power plant!) but
| there are so many others haven't received the same attention.
| Bioinformatics technology is changing that. The human genome's
| been mapped. Now it's time to map the proteome.
| Freebytes wrote:
| The applications and importance has been discussed, but let me
| explain why what we are doing right now does not work which
| will also emphasize the importance of this.
|
| At this time, we create drugs, test them on animals, and see
| what the side effects and results actually are. We are very
| limited in our capabilities and basically throw mud at the wall
| and see what sticks. This would allow us to try potential drug
| candidates without so much randomness.
| fabiospampinato wrote:
| You are basically made of proteins, which are basically folded
| sequences of amino acids, proteins are molecular machines that
| are the fundamental building block of animals, plants,
| bacteria, fungi, viruses etc.
|
| So yeah the applications are enormous, from medicine to better
| industrial chemical processes, from warfare to food
| manufacturing.
| jebarker wrote:
| > proteins are molecular machines
|
| Does that imply proteins have some dynamics that need to be
| predicted too? I remember seeing animations of molecular
| machines that appeared to be "walking" inside the body - are
| those proteins or more complex structures?
| gilleain wrote:
| Yes, very much so. Even for proteins that seems like they
| are just scaffolding for a catalytic centre can have
| important dynamics.
|
| A classic example is haemoglobin, that 'just' binds to
| oxygen at the iron in the middle of the haem. Other binding
| sites remote from the oxygen binding one can bind to other
| molecules - notably carbon dioxide. The 'Bohr effect'
| mechanism is outlined here : https://en.wikipedia.org/wiki/
| Bohr_effect#Allosteric_interac...
|
| Even at the lowest level, there is some evidence that
| ligand binding can affect the structure of the backbone of
| the protein. For example, peptide plane flipping
| https://en.wikipedia.org/wiki/Peptide_plane_flipping
| although I'm not sure where the research is on this
| nowadays.
| jebarker wrote:
| Fascinating. I'm a mathematician turned AI researcher by
| day. The more I learn about biology the more I feel that
| our human engineering (hard and soft) is trivially simple
| compared to what evolution has already created and left
| for us to study.
| gilleain wrote:
| Heh, I'm somewhat the opposite, as I went from a degree
| in biochemistry into bioinformatics (protein folds and
| structure) and have an amateur interest in maths.
|
| I often recommend the book 'Cats' Paws and Catapults' by
| Steven Vogel, which discusses the differences between
| human and biological engineering. There are very
| different constraints, as well as the more obvious
| difference between intelligently directed design (by
| humans) and evolutionary design.
|
| It's not totally true that our engineering is simpler
| than biological systems. After all, we can now manipulate
| matter down to the atomic level. There have been attempts
| to move towards self-assembling systems, or even self-
| repairing ones. Not quite on the same scale or
| sophistication as biological systems, of course.
|
| Is a modern computer chip more complex than a virus? Is a
| large connected system of computers more complex than a
| mycelial network? Are there more parts connected in more
| ways in a nuclear power station as there are in a simple
| bacterial cell?
|
| I think it is definitely worth considering the
| similarities and differences.
| jebarker wrote:
| Definitely interesting comparisons.
|
| I think my sense that biology is more complex than human
| engineering is that our engineering seems much more
| homogeneous. Computer chips are certainly very dense but
| the individual number of different component types are
| very small. Biology on the other hand seems to have a
| huge number of fundamental building blocks, e.g.
| proteins, with different behaviors. I suppose that maybe
| that's the wrong comparison though and software is where
| our engineering diversity is present.
|
| It may well just be the case that my lack of familiarity
| with biology makes it seem more complex to me.
| fabiospampinato wrote:
| Yes proteins can be dynamic, and multiple proteins can come
| together to form bigger structures (like a cell's membrane
| for example), and they can very much interact with other
| proteins and molecules in complex ways.
|
| I think AlphaFold gives us more of a static picture that
| doesn't tell us much about how the protein will interact
| with the world. Predicting these dynamic aspects, protein-
| to-protein interactions, and ultimately an entire cell, is
| probably what they'll like to do next.
|
| I don't know if the walking buddy is a single protein, it
| may be a handful of different ones joined together.
| andrepd wrote:
| https://en.m.wikipedia.org/wiki/Protein_folding
| epups wrote:
| Proteins are what makes everything in a cell work. They are
| produced as a "linear" structure that must fold into a proper
| shape to execute its function, such as acting as a pore that
| only lets a specific chemical through the cell membrane.
|
| The importance here is to figure out potential targets for
| treatments that take into account particularities of certain
| proteins. That could produce better drugs with less side
| effects.
| flobosg wrote:
| This might be an interesting resource for you:
| https://pdb101.rcsb.org/
| pbw wrote:
| The genome, all of our DNA combined, is just a bunch of 1D
| strings like "cgtattctgcttgta". Those strings encode proteins,
| which fold up into a 3D shape once created. This 3D shape is
| what determines what the protein actually does inside the cell.
| Without understanding protein folding we don't understand what
| the DNA actually does.
| swayvil wrote:
| I had a dream about this a few days ago. About complexly
| wrinkled/crumpled/convolved things.
|
| Like a fresh crepe stuffed into the toe of a boot. Bewilderingly
| complex.
|
| But I have a question. Does such contortion work for 3d
| "membranes" in a 4d space? It's something I'm chewing on. Hard to
| casually visualize, obviously.
| gspr wrote:
| Of course! The term you might wanna start off googling is
| "curvature of manifolds". What's even neater than "3d thing
| curving in 4d space" is that these notions can be made precise
| also without the "in [whatever] space" part (see "intrinsic
| curvature" and "Riemannian manifold").
| swayvil wrote:
| Thank you very much.
| naves wrote:
| Just imagine if the tech world puts all programatic advertising
| development on hold for a year and the collective brain power is
| channeled to science instead...
| alphabetting wrote:
| _Obtaining this dataset prior to alphafold would have cost on the
| order of $200 trillion._
| https://twitter.com/wintonARK/status/1552653527670857729
|
| Anyone knowledgeable know if this estimate is accurate? Insane if
| true
| green-eclipse wrote:
| It's impossible to really put a number on it, because the task
| itself was impossible. PHDs and the field's top scientists
| simply couldn't figure out many complicated protein structures
| after years of attempts, and the fact that there's so many
| (200M+) mean that the problem space is vast.
| dekhn wrote:
| It doesn't make any sense on multiple levels. This is a
| computational prediction and there was no computational
| alternative- for many of these proteins would never have had a
| structure solved even if you spent the money. They are just
| taking $cost_per_structure_solved *
| number_of_remaining_structures and assuming that things scale
| linearly like that.
|
| Note that crystallographers are now using these predicftions to
| bootstrap models of proteins they've struggled to work with,
| which indicates the level of trust in the structural community
| for these predictions is pretty high.
| shauryamanu wrote:
| Even if that's exaggerated, it might have taken significant
| time to reach to this stage. Probably on the order of >50
| years.
| [deleted]
| crispyambulance wrote:
| I got a 5th grader question about how proteins are
| used/represented graphically that I've never been able to find a
| satisfying answer for.
|
| Basically, you see these 3D representations of specific proteins
| as a crumple of ribbons-- literally like someone ran multi-
| colored ribbons though scissors to make curls and dumped it on
| the floor (like a grade school craft project).
|
| So... I understand that proteins are huge organic molecules
| composed of thousands of atoms, right? Their special capabilities
| arise from their structure/shape. So basically the molecule
| contorts itself to a low energy state which could be very complex
| but which enables it to "bind?" to other molecules expressly
| because of this special shape and do the special things that
| proteins do-- that form the basis of living things. Hence the
| efforts, like Alphafold, to compute what these shapes are for any
| given protein molecule.
|
| But what does one "do" with such 3D shapes?
|
| They seem intractably complex. Are people just browsing these
| shapes and seeing patterns in them? What do the "ribbons"
| signify? Are they just some specific arrangement of C,H,O? Why
| are some ribbons different colors? Why are there also thread-like
| things instead of all ribbons?
|
| Also, is that what proteins would really look like if you could
| see at sub-optical wavelength resolutions? Are they really like
| that? I recall from school the equipartition theorem-- 1/2 KT of
| kinetic energy for each degree of freedom. These things obviously
| have many degrees of freedom. So wouldn't they be "thrashing
| around" like rag doll in a blender at room temperature? It seems
| strange to me that something like that could be so central to
| life, but it is.
|
| Just trying to get myself a cartoonish mental model of how these
| shapes are used! Anyone?
| dekhn wrote:
| The ribbons and helices you see in those pictures are abstract
| representations of the underlying positions of specific
| arrangements of carbon atoms along the backbone.
|
| There are tools such as DSSP
| https://en.wikipedia.org/wiki/DSSP_(hydrogen_bond_estimation...
| which will take out the 3d structure determined by
| crystallography and spit out hte ribbons and helices- for
| example, for helices, you can see a specific arrangement of
| carbons along the protein's backbone in 3d space (each carbon
| interacts with a carbon 4 amino acids down the chain).
|
| Protein motion at room temperature varies depending on the
| protein- some proteins are rocks that stay pretty much in the
| same single conformation forever once they fold, while others
| do thrash around wildly and others undergo complex, whole-
| structure rearrangements that almost seem magical if you try to
| think about them using normal physics/mechanical rules.
|
| Having a magical machine that could output the full manifold of
| a protein during the folding process at subatomic resolution
| would be really nice! but there would be a lot of data to
| process.
| crispyambulance wrote:
| Thanks, awesome! So what do molecular biologists do with
| these 3D representations once they have them? Do they
| literally just see how they fit to other proteins?
| buntsai wrote:
| Often the ribbons (alpha-helices and beta=sheets) form
| "protein domains". Canonically, these are stable, folded
| structures with conserved shapes and functions that serve
| as the building blocks of proteins, like lego pieces. These
| protein domains can be assembled in different ways to form
| proteins of different function. Different protein domains
| that have the same evolutionary origin have conserved
| structure even when the underlying amino acid sequence, or
| DNA sequence has changed beyond recognition over millions
| of years of evolution. In other words, molecular biologists
| use structure as a proxy for function. Looking at how the
| same protein domains works in different proteins in
| different species can give us clues as to how a protein
| might work in human biology or disease.
| dekhn wrote:
| There are many uses for structure. Personally, I find the
| 3d structures to be useful as a mental guide for picturing
| things, and certainly people do try to "dock" proteins that
| have complementary structures, but unfortunately, the
| biophysics of protein complexes suggests that the
| conformation change on binding is so large that the
| predicted structures aren't super-helpful.
|
| Certainly, in a corpo like mine (Genentech/Roche) protein
| structures have a long history of being used in drug
| discovery- not typically a simple "dock a ligand to a
| protein" but more for constructing lab experiments that
| help elucidate the actual mechanistic biology going on.
| That is only a tiny part of a much larger process to work
| on disease targets to come up with effective treatments.
| Genentech is different from most pharma in that their
| treatments are themselves typically proteins, rather than
| small molecules.
| cletus wrote:
| The ability for another molecule (probably another protein) to
| "react" or interact with the protein depends not only on the
| chemistry but also the shape. An otherwise compatible sequence
| of atoms might not be able to react because it and the binding
| site are just incompatibly shaped.
|
| This is hugely important for developing drugs and vaccines.
|
| To see the effect of this look no further than prions. Prions
| are the exact same protein that are folded in weird ways.
| Worse, they can "transmit" this misfolded shape to other
| otherwise normal proteins. Prions behave differently just
| because of the different shape and can lead to disease. This is
| exactly what Mad Cow's Disease (BSE) is.
|
| What we get taught in high school about chemistry is incredibly
| oversimplified.
|
| One example of this I like is the geometry of a water molecule.
| When we first learn about atoms, we learn the "solar system"
| model (aka Bohr). The reality is instead that we have 3D
| probability distributions of where electrons might be. These
| clouds are in pairs. I believe this is to do with the inverted
| wavefunction really we're getting beyond my knowledge of
| quantum mechanics here so that's just a guess.
|
| Well those clouds additionally form valence shells. We learn
| about these and how atoms want to form completely valence
| shells. So Oxygen has 8 electrons ie 4 pairs of electrons. When
| bonding with 2 hydrogen atoms we end up with a weird geometry
| of ~104.5 degrees between the two hydrogen atoms because of how
| these pairs interact. The naive assumption might expect that
| the two hydrogen atoms are 180 degree apart.
|
| So back to proteins, you may have learned about hydrogen bonds.
| This affects molecular shape because when a hydrogen atom
| shares an electron, it is often positively charged. That
| positive charge pushes away other positive charges. This is the
| realy difficulty in protein folding because with a molecule of
| thousands of atoms and weird geometry you may find distant
| parts of the molecule interacting with hydrogen bonds.
|
| So a single cell consists of thousands (IIRC) of different
| proteins. Figuring out those interactions is important but
| incredibly difficult.
| comicjk wrote:
| All of the loops and swirls are summary representations of
| known atomic positions: really, knowing a protein structure
| means knowing the position of every atomic nucleus, relative to
| the nuclei, down to some small resolution, and assuming a low
| temperature.
|
| The atoms do wiggle around a bit at room temperature (and even
| more at body temperature), which means that simulating them
| usefully typically requires sampling from a probability
| distribution defined by the protein structure and some prior
| knowledge about how atoms move (often a potential energy
| surface fitted to match quantum mechanics).
|
| There are many applications of these simulations. One of the
| most important is drug design: knowing the structure of the
| protein, you can zoom in on a binding pocket and design a set
| of drug molecules which might disable it. Within the computer
| simulation, you can mutate a known molecule into each of your
| test molecules and measure the change in binding affinity,
| which tells you pretty accurately which ones will work. Each of
| these simulations requires tens of millions of samples from the
| atomic probability distribution, which typically takes a few
| hours on a GPU given a good molecular dynamics program.
| bckr wrote:
| > Are people just browsing these shapes and seeing patterns in
| them
|
| That's one approach.
|
| The thing to understand is that proteins form "binding sites":
| areas that are more likely to attract other particular regions
| of proteins or other molecules, or even atoms. Think about
| hemoglobin. The reason it holds onto oxygen atoms is because it
| has binding sites.
|
| Binding sites are great because they represent more freedom to
| do things than molecules typically have. Normal chemistry
| consists of forming strong electronic bonds between atoms, or
| forming rigid lattices/crystals.
|
| Binding sites allow molecules to do things like temporarily
| attach to each other and let each other go under certain
| circumstances, for instance when another binding site is
| active/inactive. This can happen through "conformation change",
| where a molecule bound/unbound on some binding site makes the
| protein change shape slightly. This is how proteins can act
| like machines.
|
| > What do the "ribbons" signify
|
| Different regions of the protein have different sequences of
| amino acids. Amino Acids have somewhat different shapes from
| each other. The ribbons are actually broader than the spindles
| (or threads), and less flexible. Not sure about the different
| colors, maybe someone else can fill in.
|
| > Also, is that what proteins would really look like if you
| could see at sub-optical wavelength resolutions?
|
| Not really, it's an abstraction. They're big molecules, so if
| you look closely they're made of atoms, which are (kinda, sorta
| not really, quantum stuff) spherical.
|
| > So wouldn't they be "thrashing around" like rag doll in
| blender at room temperature?
|
| Yes, but the attractions between the different parts of the
| molecule keeps it somewhat under control. So more like an
| undulating little creature, jellyfish perhaps.
|
| > It seems strange to me that something like that could be so
| central to life
|
| Yep, gotta remember that it's all statistical. These things are
| getting made, do their job, breaking, and getting degraded some
| insane number of times per second. Swarm behavior, sort of.
| ghkbrew wrote:
| In addition to /u/dekhn 's excellent description, this
| phenomenon is referred to as a protein's "secondary structure"
| [0]
|
| [0] https://en.m.wikipedia.org/wiki/Protein_secondary_structure
| biofox wrote:
| Short answer is that the ribbon representation is a visual
| simplification based on known structures -- they are actually
| composed of atoms.
|
| They certainly do "thrash around", but that thrashing is
| constrained by the bonds that are formed, which greatly limits
| the degrees of freedom. Here's a short video of a simulation to
| demonstrate:
|
| https://www.youtube.com/watch?v=fggqPtaZj8g
| paraph1n wrote:
| > I recall from school the equipartition theorem-- 1/2 KT of
| kinetic energy for each degree of freedom. These things
| obviously have many degrees of freedom. So wouldn't they be
| "thrashing around" like rag doll in a blender at room
| temperature?
|
| It's funny you say that, because the first image on the English
| Wikipedia page for Equipartition Theorem[1] is an animation of
| the thermal motion of a peptide.
|
| [1]: https://en.wikipedia.org/wiki/Equipartition_theorem
| dekhn wrote:
| BTW, in terms of protein dynamics, before you even think
| about the thrashing around- 1.2kt at room temperature is
| enough to form and break hydrogen bonds in real time (around
| 1-2kcal) so presumably, protein h-bonds are breaking and
| reforming spontaneously at scale.
| sabujp wrote:
| Many thanks to Deepmind for releasing predicted structures of all
| known protein _monomers_. What I 'd like next is for Alphafold
| (or some other software) to be able to show us multimeric
| structures based on the single monomer/subunit predictions and
| protein-protein interactions (i.e. docking). For example the one
| I helped work on back in my structural biology days was the
| circadian clock protein KaiC :
| https://www.rcsb.org/structure/2GBL, that's the "complete"
| hexameric structure that shows how each of the subunits pack. The
| prediction for the single monomer that forms a hexamer is very
| close to the experimental
| https://alphafold.ebi.ac.uk/entry/Q79PF4 and in fact shows the
| correct structure of AA residues 500 - 519 which we were never
| able to validate until 12 years later
| (https://www.rcsb.org/structure/5C5E) when we expressed those
| residues along with another protein called KaiA which we knew
| binds to the "top" CII terminal (AAs 497-519) of KaiC. If we
| would have had this data then, it would have allowed us to not
| only make better predictions about biological function and
| protein-protein interactions but would have helped better guide
| future experiments.
|
| What we can do with this data now is use methods such as cryo-em
| to see the "big picture", i.e. multi-subunit protein-protein
| interactions where we can plug in the Alphafold predicted
| structure into the cryo-em 3d density map and get predicted
| angstrom level views of what's happening without necessarily
| having to resort to slower methods such as NMR or x-ray
| crystallography to elucidate macromolecular interactions.
|
| A small gripe about the alphafold ebi website: it doesn't seem to
| show the known experimental structure, it just shows
| "Experimental structures: None available in PDB". For example the
| link to the alphafold structure above should link to the 2GBL,
| 1TF7, or any of the other kaic structures from organism PCC7942
| at RCSB. This would require merging/mapping data from RCSB with
| EBI and at least doing some string matching, hopefully they're
| working on it!
| inspirerhetoric wrote:
| Does anyone know what it would cost to download this whole
| dataset? Google Cloud Datasets only allow 1 TB/month for free to
| download, I believe
| [deleted]
| bifftastic wrote:
| How do they know their structures are correct?
| tomrod wrote:
| This is the right line of questioning.
|
| As we solve viewability into the complex coding of proteins, we
| need to be right. Next, hopefully, comes causal effect
| identification, then construction ability.
|
| If medicine can use broad capacity to create bespoke proteins,
| our world becomes both weird and wonderful.
| __rito__ wrote:
| They won a decades-long standing challenge predicting the
| protein structures of a much smaller (yet significantly quite
| large) set of proteins using a model (AlphaFold).
|
| Then they use the model to predict more.
|
| Although we don't know if they are _correct_ , these structures
| are the best (or the least bad) we have for now.
| ArnoVW wrote:
| We know the structure of some proteins. It's not that it's
| impossible to measure, it's just very expensive. This is why
| having a model that can "predict" it is so useful.
| luma wrote:
| Same as any other prediction I'd presume. Run it against a
| known protein and see how the answer lines up. Predict the
| structure of an unknown protein, then use traditional methods
| (x-ray crystallography, maybe STEM, etc) to verify.
| iandanforth wrote:
| "Verify" is almost correct. The crystallography data is taken
| to be "ground truth" and the predicted protein structure from
| AlphaFold is taken to be a good guess starting point. Then
| _other_ software can produce a model that is a best fit to
| the ground truth data starting from the good guess. So even
| if the guess is wrong in detail it 's still useful to reduce
| the search space.
| christudor wrote:
| This is exactly right.
| gilleain wrote:
| As a simple example, one measure used to compare a predicted
| structure against a reference is the RMSD (root mean square
| deviation).
|
| https://en.m.wikipedia.org/wiki/Root-mean-
| square_deviation_o...
|
| The lower the RMSD between two structures, the better (up to
| some limit).
| lrem wrote:
| Disclaimer: I work in Google, organizationally far away from
| Deep Mind and my PhD is in something very unrelated.
|
| They can't possibly know that. What they know is that their
| guesses are very significantly better than the previous best
| and that they could do this for the widest range in history.
| Now, verifying the guess for a single (of the hundreds of
| millions in the db) protein is up to two years of expensive
| project. Inevitably some will show discrepancies. These will be
| fed to regression learning, giving us a new generation of even
| better guesses at some point in the future. That's what I
| believe to be standard operating practice.
|
| A more important question is: is today's db good enough to be a
| breakthrough for something useful, e.g. pharma or agriculture?
| I have no intuition here, but the reporting claims it will be.
| f38zf5vdt wrote:
| The press release reads like an absurdity. It's not the
| "protein universe", it's the "list of presumed globular
| proteins Google found and some inferences about their
| structure as given by their AI platform".
|
| Proteins don't exist as crystals in a vacuum, that's just how
| humans solved the structure. Many of the non-globular
| proteins were solved using sequence manipulation or other
| tricks to get them to crystallize. Virtually all proteins
| exist to have their structures interact dynamically with the
| environment.
|
| Google is simply supplying a list of what it presumes to be
| low RMSD models based on their tooling, for some sequences
| they found, and the tooling is based itself on data mostly
| from X-ray studies that may or may not have errors. Heck,
| we've barely even sequenced most of the DNA on this planet,
| and with methods like alternative splicing the transcriptome
| and hence proteome has to be many orders of magnitude larger
| than what we have knowledge of.
|
| But sure, Google has solved the structure of the "protein
| universe", whatever that is.
| dekhn wrote:
| People have been making grand statements about the
| structure of the protein universe for quite some time (I've
| seen a fair number of papers on this, such as https://faseb
| .onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... and http
| s://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj...
| from a previous collaborator of mine).
|
| Google didn't solve the structure of the protein universe
| (thank you for saying that). But the idea of the protein
| structure universe is fairly simple- it's a latent space
| that allows for direct movement over what is presumably the
| rules of protein structures along orthogonal directions. It
| would encompass all the "rules" in a fairly compact and
| elegant way. Presumably, superfamilies would automagically
| cluster in this space, and proteins in different
| superfamilies would not.
| lrem wrote:
| I recognize your superior knowledge in the topic and assume
| you're right.
|
| But you also ignore where we're at in the standard cycle:
|
| https://phdcomics.com/comics/archive_print.php?comicid=1174
|
| ;)
| f38zf5vdt wrote:
| That's exactly what this is, but it's embarrassing that
| it's coming from somewhere purported to be a lab. Any of
| the hundreds or more of labs working in protein structure
| prediction for the past 50 years could have made this
| press release at any time and said, "look, we used a
| computer and it told us these are the structures, we
| solved the protein universe!"
|
| It's not to diminish the monumental accomplishment that
| was the application of modern machine learning techniques
| to outpace structure prediction in labs, but other famous
| labs have already moved to ML predictions and are
| competitive with DeepMind now.
| VikingCoder wrote:
| > but other famous labs have already moved to ML
| predictions and are competitive with DeepMind now.
|
| That's great! AlphaFold DB mas made 200 million structure
| predictions available for everyone. How many structure
| predictions have other famous labs made available for
| everyone?
| f38zf5vdt wrote:
| As many as you wanted to throw at them, considering the
| vast majority are open source and could be run on your
| own server cluster. CASP15 is ongoing so by the end of
| the year we will know how much absolute progress has been
| made by others.
|
| Google has the advantage of the biggest guns here: the
| fastest TPUs with the most memory in the biggest
| clusters, so running inference with a massive number of
| protein sequences is much easier for them.
| dekhn wrote:
| Many teams have been predicting structures for enormous
| numbers of proteins for some time now. It's just that
| nobody had any confidence in the predictions.
| gfodor wrote:
| > but other famous labs have already moved to ML
| predictions and are competitive with DeepMind now.
|
| it seems obvious this was going to happen, because
| https://github.com/deepmind/alphafold
| gilleain wrote:
| edit: I should have read the post first! What do you mean
| 'only globular proteins'? They say they have predictions
| for all of UniProt...
|
| ---------------
|
| Yes, the idea of a 'protein universe' seems like it should
| at least encompass 'fold space'.
|
| For example, WR Taylor :
| https://pubmed.ncbi.nlm.nih.gov/11948354/
|
| I think the rough estimate was that there were around 1000
| folds - depending on how fine-grained you want to go.
|
| Absolutely agree, though, that a lot of proteins are hard
| to crystalise (i understand) due to being trans-membrane or
| just the difficulty of getting the right parameters for the
| experiment.
| f38zf5vdt wrote:
| I don't think non-globular proteins are well represented
| by the predictions. All our predictions for proteins are
| based on proteins we were able to crystallize, so my
| guess is that even if many of them aren't globular
| proteins the predictions themselves are made from the
| foundations of structures we do have, which are
| predominantly globular proteins and it's presumed that
| the inference treats folding as if they were globular and
| crystallized (non-dynamic). X-ray crystallography and
| fitting to electron density maps itself is a bit of an
| art form.
|
| For example for transmembrane proteins, there is a gross
| under-representation of structures derived from
| experimental evidence, so we would expect that whatever
| your algorithm is "solving" is going to have a much
| higher degree of error than globular proteins, and likely
| artifacts associated with learning from much more
| abundant globular proteins.
|
| edit: As an example, "Sampling the conformational
| landscapes of transporters and receptors with
| AlphaFold2". AF2 was able to reproduce the alternative
| conformations of GPCRs, but only with non-default
| settings. With default settings there is clear evidence
| of overfitting.
|
| > Overall, these results demonstrate that highly accurate
| models adopting both conformations of all eight protein
| targets could be predicted with AF2 by using MSAs that
| are far shallower than the default. However, because the
| optimal MSA depth and choice of templates varied for each
| protein, they also argue against a one-size-fits-all
| approach for conformational sampling.
| gilleain wrote:
| Fair point. I guess if their training data is biased
| towards existing known structures (via xray or nmr or
| whatever) then there is the risk of incorrect
| predictions.
|
| At a guess, the core packing in non-globular proteins
| might be different? Also the distribution of secondary
| structure might also vary between classes. Might be worth
| someone studying how much structural constraints depend
| on fold (if they have not already).
| christudor wrote:
| This video goes some way to explaining how they know the
| structures are correct:
| https://www.youtube.com/watch?v=vXZzftX03VY
| DevX101 wrote:
| They compare the predicted structure (computed) to a known
| structure (physical x-ray crystallography). There's an annual
| competition CASP (Crtical Assessment of protein Structure
| Prediction) that does X-Ray crystallography on a protein. The
| identity of this protein is held secret by the organizers. Then
| research teams across the world present their models and
| attempt to predict without advance knowledge, the structure of
| the protein from their amino acid sequence. Think of CASP as a
| validation data set used to evaluate a machine learning model.
|
| DeepMind crushes everyone else at this competition.
| liuliu wrote:
| The worry is about dataset shifting. Previously, the data
| were collected for a few hundreds thousands structures, now
| it is 200m. I think there could be doubts on distributions
| and how that could play a role in prediction accuracy.
| seydor wrote:
| they don't but they are more correct than what others have
| predicted. Some of their predictions can be compared with
| structures determined with x-ray crystallography
| cupofpython wrote:
| did they come up with their structures independently of the
| x-ray crystallography, or was that part of a ML dataset for
| predicting structure
| unlikelymordant wrote:
| The casp competition that they won consists of a bunch of
| new proteins, the structures of which havnt been published.
| So the test set is for brand new proteins in that case.
| cupofpython wrote:
| nice, very cool then
| gz5 wrote:
| >we're now releasing predicted structures for nearly all
| catalogued proteins known to science
|
| is the result that researchers will now much more quickly
| 'manually' validate or invalidate the predicted structures for
| proteins they are working with? i understand it is traditionally
| a long and complex process, but i imagine it is expedited by
| having a predicted structure to test as the baseline?
| epups wrote:
| AlphaFold is a phenomenal tool that demonstrates how AI can
| already outclass humans for certain tasks. It is a prime example
| of a problem space where conventional approaches are simply
| inferior, and that AI is not just a fancy name but can be
| extraordinarily powerful.
| hackernewds wrote:
| This reads like something created by an AI. And the mere fact
| that I can't be sure if it is or not now is testament to the
| statement this is making
| desindol wrote:
| It's the superstition and the amount of parasite words.
| jcims wrote:
| Just for shits & giggles.
|
| Prompt:
|
| Write a short abstract for an article on why AlphaFold isn't
| just a tremendous technological achievement, but an example
| of how Artificial Intelligence will allow humanity to
| continue to achieve technological progress in the future:
|
| Response:
|
| When DeepMind's AlphaFold was announced as the winner of the
| CASP13 protein folding prediction competition, it was hailed
| as a major breakthrough in both artificial intelligence (AI)
| and biology.
|
| While AlphaFold is undoubtedly a remarkable achievement, it
| is also a clear example of how AI can be used to solve
| complex problems that have traditionally been seen as beyond
| the reach of machines.
|
| In particular, AlphaFold represents a significant advance in
| the ability of AI systems to learn from limited data. Unlike
| previous AI approaches to protein folding, which relied on
| large datasets of known protein structures, AlphaFold was
| able to produce accurate predictions with only a small number
| of known protein structures to work from.
|
| This ability to learn from limited data is likely to be
| increasingly important as we face problems that are too
| complex for us to fully understand and collect all the data
| we would need to solve them using traditional methods.
|
| In the future, AI systems like AlphaFold will become
| increasingly important in helping us to make progress on a
| wide range of pressing challenges, from developing new
| medicines to tackling climate change.
| beanjuice wrote:
| To add to this:
|
| Prompt: Write a hackernews comment about why artificial
| intelligence will never be profitable
|
| Response: Artificial intelligence will never be profitable
| because the cost of developing and maintaining the hardware
| and software is too high. Also, the market for AI is still
| too small and the demand is not high enough to justify the
| costs.
| TremendousJudge wrote:
| Almost perfect, needs a bit more snark
| btdmaster wrote:
| Prompt: Write a hackernews comment about why artificial
| intelligence will never be profitable
|
| Response: Artificial intelligence will never be
| profitable because the cost of developing and maintaining
| the hardware and software is too high. Also, the market
| for AI is still too small and the demand is not high
| enough to justify the costs.
|
| Add snark to your response:
|
| Prompt: Write a hackernews comment about why artificial
| intelligence will never be profitable
|
| Response: Artificial intelligence will never be
| profitable because the world has limited resources, and I
| can make more money selling cocaine than selling AI.
|
| From GPT-NeoX 20B
| TremendousJudge wrote:
| Well, it's not wrong. But I think that comment would get
| flagged, it parses too obviously an off-topic joke
| epups wrote:
| It's a new tool, AlphaFool
___________________________________________________________________
(page generated 2022-07-28 17:00 UTC)