[HN Gopher] Chemical space is really big (2014)
___________________________________________________________________
Chemical space is really big (2014)
Author : optimalsolver
Score : 85 points
Date : 2021-06-25 18:26 UTC (4 hours ago)
(HTM) web link (www.chemistryworld.com)
(TXT) w3m dump (www.chemistryworld.com)
| xwdv wrote:
| What kind of chemicals are we currently searching for in the
| chemical space?
| d_silin wrote:
| Catalysts!
| optimalsolver wrote:
| https://opencatalystproject.org/
| hypertele-Xii wrote:
| Cure for cancer? Better batteries. Materials for fusion
| reactors. Replacement for plastic. All sorts of things.
| BeFlatXIII wrote:
| A better LSD
| [deleted]
| captainmuon wrote:
| This reminds me of something I always wanted to ask. Are the
| majority of molecules in the body "named" or "purposeful"
| molecules, like haemoglobin, vitamins, water, lipids, DNA, etc.,
| or is there a lot of random stuff, where just some atoms are
| arranged arbitrarily? Ignoring for a second the trival thing that
| you can make really long polymers, you have mutations in DNA and
| so on - I would could those into the first case. What is the
| ratio of "encyclopaedic" molecules (discovered or not) to "random
| stuff" (useless or not)?
| correcthorse123 wrote:
| I couldn't give you a ratio but I'd think it's quite a high
| ratio. There probably aren't many molecules that don't have
| either a chemical (i.e. have some function in a pathway) or
| physicochemical influence.
| Judgmentality wrote:
| Not saying you're wrong, but why do you think that? I'd have
| thought the same thing about DNA, but I keep being told most
| DNA is junk (although I wouldn't be surprised to find out
| later we just don't know what it's for).
| malux85 wrote:
| http://biochemical-pathways.com/#/map/1
| frisco wrote:
| Everything has a name, but it is generally a "systematic"
| name[1] rather than a one-off descriptive name. Even DNA is a
| systematic name for the monomer (de-oxy-ribose-nucleic-acid is
| one of the defined nucleic acids bound to a ribose sugar
| missing an oxygen at the 2-position carbon).
|
| Biology uses an enormous space of small molecule structures (to
| say nothing of proteins, which have their own naming schemes)
| and few have names you might recognize generally, but all have
| useful systematic names that biologists and chemists can
| quickly parse.
|
| As a twist, most systematic naming schemes don't produce unique
| labels, so there's often multiple ways to say the same thing,
| and different discipline subcultures have different biases in
| this regard.
|
| Edit: re-reading OP, another interpretation is that they're
| asking what percent of molecules in the body aren't involved in
| biology. The answer to that is probably something that
| approximates 0%. At the end of the day, the combined
| interaction of all of this chemistry is what biology _is_ , and
| everything is more or less everywhere. (...concentration is
| everything.)
|
| [1]
| https://en.m.wikipedia.org/wiki/Systematic_name#In_chemistry
| adt2bt wrote:
| I wonder how effective AI will be at enabling us to navigate
| chemical space for certain desired compounds. Knowing nothing
| about the problem, is it something akin to the protein folding
| challenge that AlphaFold[0] recently did well at?
|
| Side note: I love Derek Lowe's writings. I don't know what it is,
| but every time I see a chemistry related link bubble up in HN, I
| have a gut feeling it was written by him. And I'm usually
| impressed. His Things I Won't Work With[1] series is amazingly
| well written.
|
| [0] https://deepmind.com/blog/article/alphafold-a-solution-
| to-a-...
|
| [1]
| https://blogs.sciencemag.org/pipeline/search/Things+I+Wont+W...
| krab wrote:
| I think that the biggest issue isn't the chemical space but the
| complexity of biological systems. It's hard to tell what the
| molecule will do. We just don't have a good enough simulator.
| AlphaFold is definitely a helpful step but more are needed in
| the same direction.
|
| (In 2010 - 2012 I worked in a laboratory that did small
| compounds screening and I was building some tools to explore
| the chemical space)
| timr wrote:
| Derek is right about the vastness of chemical space, but I go
| back and forth on the claim (frequently made by those in drug
| discovery) that AI cannot possibly extrapolate to spaces of
| this size, for at least three reasons:
|
| * Image space and text space are also vast, and yet we've had
| good success applying AI in these areas. I have yet to see a
| convincing argument that these spaces aren't equally large.
|
| * It's a bit of a red herring: actual drug discovery programs
| are not exploring "all of chemical space". They're usually
| focused on "lead series" of much more constrained molecules.
|
| * There are actually context-independent signals that can be
| used to generalize AI methods. The generalization is far from
| perfect, but it's not like every one of those 10^60 molecules
| are entirely different from every other molecule in the set.
| There are clusters and patterns and trends that can be
| exploited for gain -- this is what makes "medicinal chemistry"
| an academic field, and not merely an exercise in fortune-
| telling.
|
| Personally, I think the bigger problem applying AI and ML to
| drug discovery is less the "vastness of chemical space" (a
| proposition that makes med-chemists feel secure about their
| jobs), and more that the datasets in drug discovery _suck_.
| There 's tons of siloing of data, none of it is consistent, and
| you can't even depend that two assays for the same target,
| measured in the same lab, years apart, will yield consistent
| data. It's a total mess.
| dnautics wrote:
| So text space is trivially vectorizable, at the character
| level and even for difficult languages like Russian chunk-
| vectorisable with some care. How do you encode the difference
| between houamine A and atrop-houamine A, while keeping the
| similarities, without resorting to empirical measurements and
| classification, which could yield reasonable vectors, but
| will take 2-5 years of a highly trained grad student's labor
| to obtain and put into the training corpus
| timr wrote:
| > How do you encode the difference between houamine A and
| atrop-houamine A,
|
| There are now lots of ways of encoding molecules. So many,
| in fact, that it's not really worth debating the merits of
| any particular method.
|
| ECFP fingerprints shoved into a fully connected NN work
| surprisingly well for a large class of problems. Molecular
| graph convolutions (of which there are now many flavors)
| also work well. The field is to the point where people are
| doing ensembles of different encodings, and seeing what
| works for any particular problem.
|
| > without resorting to empirical measurements and
| classification, which could yield reasonable vectors, but
| will take 2-5 years of a highly trained grad student's
| labor to obtain and put into the training corpus
|
| Well, you're sort of touching on my last paragraph with
| this. The classifier, featurization, etc., usually matters
| less (a lot less?) than the quality of the assay data. So I
| agree in that respect.
| [deleted]
| jpollock wrote:
| What makes 1 billion rows a large search space?
|
| What makes 150 billion rows incomprehensibly large?
|
| With a molecular weight of 500 we're talking something on the
| order of terabytes of data (for 1b molecules)?
|
| It certainly sounds like a tractable amount of data.
|
| What computer problems are stopping us from generating a
| compound, computationally testing it for stability, and adding it
| to the list and then searching?
| whatshisface wrote:
| I have learned over time that state space size has nothing to
| do with problem difficulty. Sorting finds and answer in a space
| with n! possibilities in n log(n) time.
|
| Chemistry is difficult not because of the large number of
| chemicals but because there hasn't been a lot of structure
| discovered in them to allow the sort of compounding subcase
| solving that makes searching a sorted list tractable. The
| structure that has been discovered can be found in chemistry
| textbooks and has names like "so-and-so's rule" which can be
| applied to boroalkanes with between 5 and 12 vertices excepting
| 6, unless the cage is charged in which case you should treat it
| like it has one fewer vertex, unless the charge is -2 and the
| original vertex count is between 8 and 13, in which case...
|
| Those rules are much better than a table as measured by
| information compression but you can't discover them unless you
| start with the table filled mostly filled out already.
| rodrigosetti wrote:
| Chemical simulation is very hard (involves solving multiple np-
| complete problems).
|
| But that is one of the expected applications of quantum
| computers (simulate quantum systems).
| _ihaque wrote:
| Even at 500Da or less, it's much, much larger than that.
|
| You may be interested in the work of Jean-Louis Reymond's
| group, who have done more or less exactly what you suggest:
| https://gdb.unibe.ch/downloads/
|
| GDB-11, with 11 heavy atoms, has 26.4M structures (110.9M
| stereoisomers -- molecules aren't 2D). Going up to 13 gives you
| 970M molecules. Going up to 17 (still mostly below 250Da) is
| 166,400M.
|
| There's a lot of space up there below 500Da.
| sseagull wrote:
| Also note that that only includes organic molecules. There's
| another 85-ish natural elements in the periodic table that
| could be important, but is much harder to synthesize or
| compute.
|
| Although including heavier elements can blow past 500 Da
| pretty easily.
| whatshisface wrote:
| The heavier elements start acting more like continuous
| systems and less like quanta legos, as you get more and
| more states per eV. Transition metals and lanthanides don't
| get their own combinatoric explosion until literal sticks
| are stuck on the smooth balls in coordination chemistry.
| Throw6away wrote:
| "There are, I think, two reactions to this. One is despair, of
| course, which is always an option in research, but not a very
| useful one."
|
| These are words to live by.
| j-wags wrote:
| If you'd like to check out what current chemical database files
| look like, "Enamine REAL" is a fairly widely-known one.[1] My
| understanding is that this file is a mix of their ACTUAL in-stock
| inventory, as well as the product of running a small number of
| high-reliability reactions on each compound in that inventory. So
| it serves as a "vendor catalog" file, where everything in here
| can be ordered from Enamine and synthesized+delivered to your
| door in a few weeks.
|
| Another approach I've heard of for iterating through every
| molecule in a large region of chemical space is to START with a
| large molecule dataset, then for each molecule, predict the
| result of performing simple reactions on it. For each reaction
| product, do your full analysis, and only store the result if the
| analysis indicates it is noteworthy. This, in effect, lets you
| scan over a larger region of chemical space than you can fit in
| memory.
|
| [1] https://enamine.net/compound-collections/real-
| compounds/real...
| jamestimmins wrote:
| Can anyone give an ELI5 for what the limitation is in terms of
| processing these computationally? Is the challenge that it's
| difficult to model how a molecule will interact with another
| molecule, so you have to do it with atoms and test the
| interaction across every other molecule in the search space?
|
| For context I got a B- in high school chemistry and haven't
| looked back.
|
| *Edit: "do it with atoms" is confusing in this context. I mean do
| it in the real world outside of bits.
| whatshisface wrote:
| Molecules obey known laws of physics and can in principle be
| simulated exactly. That is not practical with present-day
| computers because it's quantum-mechanical and has an
| exponentially large state space. Heuristic and approximate
| methods are used to pare this down, sacrificing absolute truth,
| leading to results that are not very reliable. That is why
| experiments are still done in chemistry labs even though
| everything that happens in a chemistry lab has been
| "understood" since the 1930s.
|
| Chemists focus in on the least simulatable problems because
| most interesting chemistry happens right on the border of not
| happening at all. Molecules that are very easy to calculate are
| ones that small energy errors don't matter for. That makes them
| either incredibly stable or incredibly unstable, but chemistry
| happens near the boundary.
| [deleted]
| ChrisArchitect wrote:
| Anything new on this since 2014?
| [deleted]
| euske wrote:
| I want also to add that the programming space, or software
| specification space, is also mindbogglingly big, if not bigger
| than the chemical space. People should know how many
| possibilities of small details exist for implementing a teeny
| trivial feature, because that's the way it is. Everything around
| us has a billion-gazillion parameter space, and all we're seeing
| is just a chance occurrence.
| Severian wrote:
| "I mean, you may think it's a long way down the road to the
| chemist's, but that's just peanuts to space."
| Y_Y wrote:
| There's a couple of xkcd comics relevant to this. Anyway the
| space of possible compounds is mind-bogglingly huge and that's
| impressive. At the same time it's countable, and as countable
| things go, it's not even so big. The kind of hugeness that keeps
| me up at night is the "long line" or the phase space of the
| cosmic fluid.
| carl_dr wrote:
| Genuine question: what are the "long line" or the phase space
| of the cosmic fluid?
|
| I found the Wikipedia page
| https://en.m.wikipedia.org/wiki/Long_line_(topology) but like a
| lot of such pages, they are opaque unless you are familiar with
| the topic. Consequently, I have no idea if this is the long
| line you are referring to.
|
| Oh, and which xkcd comics?
| gpcr1949 wrote:
| It's important to note that although chemical space is quite
| large, most of this space is not easy to synthesize and also is
| not chemically feasible, stable or desirable. Another interesting
| "small" subset of chemical space is ZINC [0] which is a database
| of about a billion commercially offered compounds, meaning that
| manufacturers at a minimum think they can easily make them (and
| effectively the fulfilment is quite high when random compounds
| are ordered, e.g. 95% in this paper where they did molecular
| docking simulations on the entirity of this database to find new
| melatonin receptor modulators [1]). Concerning exploration of
| chemical space, one area that might be of interest here is the
| quite effective smooth(ish) movement through structure-property
| space using VAEs.[2]
|
| [0] https://zinc.docking.org/ [1] "Virtual discovery of melatonin
| receptor ligands to modulate circadian rhythms"
| https://www.nature.com/articles/s41586-020-2027-0.pdf [2]
| "Automatic Chemical Design Using a Data-DrivenContinuous
| Representation of Molecules",
| https://arxiv.org/pdf/1610.02415.pdf
| jhirshman wrote:
| We've been working on these types of chemical search
| optimizations problems across a variety of industries, and I'd
| like to echo this comment. Despite the fact that most of the
| space is unexplored, the act of exploring it for the sake of
| exploring it is often unwise. A vast, vast majority of the time
| a naive or even statistically driven search will fail if the
| goal is to find something "new." The reality is that the path
| to a truly new innovative chemical is hard to anticipate and
| even harder to optimize for plus the curse of dimensionality
| means that our intuition for how hard that search really is is
| hopelessly misguided.
|
| If you're interested in related problems, my company,
| Uncountable, is looking for software engineers.
| https://www.uncountable.com/careers. We emphasize that the most
| important thing for organizations to do today is structure
| their data. It's the best chance to take specialized internal
| knowledge and put it to use to find new chemicals.
___________________________________________________________________
(page generated 2021-06-25 23:00 UTC)