[HN Gopher] FunSearch: Making new discoveries in mathematical sc...
       ___________________________________________________________________
        
       FunSearch: Making new discoveries in mathematical sciences using
       LLMs
        
       Author : reqo
       Score  : 197 points
       Date   : 2023-12-14 16:24 UTC (6 hours ago)
        
 (HTM) web link (deepmind.google)
 (TXT) w3m dump (deepmind.google)
        
       | mejutoco wrote:
       | I wonder how someone will integrate symbolic reasoning with LLMs,
       | or if it will be possible.
        
         | Malp wrote:
         | Ditto, this seems to have some parallels to the neurosymbolic
         | ideas being explored by Lab V2 at ASU
        
         | nimski wrote:
         | This is what we're doing. I think it's not only possible, but
         | also necessary for applications beyond trial-and-error
         | generation.
        
         | 1024core wrote:
         | LEAN
        
       | ignoramous wrote:
       | Related commentary on "self-play" by Subbarao:
       | https://twitter.com/rao2z/status/1728121216479949048
       | 
       | From the article:                 FunSearch uses an evolutionary
       | method powered by LLMs, which promotes and develops the highest
       | scoring ideas. These ideas are expressed as computer programs, so
       | that they can be run and evaluated automatically.            The
       | user writes a description of the problem in the form of code.
       | This description comprises a procedure to evaluate programs, and
       | a seed program used to initialize a pool of programs.
       | At each iteration, FunSearch selects some programs from the
       | current pool. The LLM creatively builds upon these, and generates
       | new programs, which are automatically evaluated. The best ones
       | are added back to the pool of existing programs, creating a self-
       | improving loop.
       | 
       | For websearch, I ( _evaluator_ ) use _pplx.ai_ and _phind.com_ in
       | a similar manner. Ask it a question ( _seed_ ) and see what
       | references it brings up (web links). Refine my question or ask
       | follow-ups ( _iterate_ ) so it pulls up different or more in-
       | depth references ( _improve_ ). Works better in unearthing gems
       | than sifting through reddit or Google.
       | 
       | Given _Tech Twitter_ has amazing content too, looking forward to
       | using Grok for research, now that it is open to all.
        
       | alphabetting wrote:
       | https://twitter.com/gfodor/status/1735348301812383906
       | 
       | >If DeepMind just definitively proved neural networks can
       | generate genuinely new knowledge then it's the most important
       | discovery since fire.
       | 
       | If this were actually the case why wouldn't everyone be talking
       | about this? I am impressed it was done on Palm 2 given that's
       | less advanced than GPT-4 and Gemini. Will be wild to see what the
       | next few generations of models can do utilizing methods like
       | this.
        
         | dougmwne wrote:
         | I said "WOW!" out loud.
         | 
         | An LLM can discover a new solution in high dimensional geometry
         | that hasn't advanced in 20 years!? That goes way beyond glueing
         | little bits of plagiarized training data together in a
         | plausible way.
         | 
         | This suggest that there are hidden depths to LLMs' capabilities
         | if we can just figure out how to prompt and evaluate them
         | correctly.
         | 
         | This significantly broke my expectations. Who knows what
         | discovery could be hiding behind the next prompt and random
         | seed.
        
           | bendergarcia wrote:
           | It's kind of like humans: seed of two people and a random
           | interest to pursue, what could they do?!? It makes poverty
           | and children dying unnecessarily even more depressing.
        
           | riku_iki wrote:
           | > An LLM can discover a new solution in high dimensional
           | geometry that hasn't advanced in 20 years!?
           | 
           | LLM + brute force coded by humans.
        
             | dougmwne wrote:
             | And that not only created a human interpretable method, but
             | beat out 20 years of mathematicians working on a high
             | profile open question.
             | 
             | Who's to say our brains themselves aren't brute force
             | solution searchers?
        
               | riku_iki wrote:
               | > Who's to say our brains themselves aren't brute force
               | solution searchers?
               | 
               | yes, its just much slower (many trillions times) for
               | numbers crunching.
        
           | nybsjytm wrote:
           | > An LLM can discover a new solution in high dimensional
           | geometry that hasn't advanced in 20 years!? That goes way
           | beyond glueing little bits of plagiarized training data
           | together in a plausible way.
           | 
           | The first advance in 20 years was by Fred Tyrrell last year
           | https://arxiv.org/abs/2209.10045, who showed that the
           | combinatorics quantity in question is between 2.218 and
           | 2.756, improving the previous lower bound of 2.2174. DeepMind
           | has now shown that the number is between 2.2202 and 2.756.
           | 
           | That's why the DeepMind authors describe it as "the largest
           | increase in the size of cap sets in the past 20 years," not
           | as the only increase. The arithmetic is that 2.2202-2.218 is
           | larger than 2.218-2.2174. (If you consider this a very
           | meaningful comparison to make, you might work for DeepMind.)
        
         | Imnimo wrote:
         | The heavy lifting here is done by the evolutionary algorithm.
         | The LLM is just being asked "propose some reasonable edits to
         | these 20 lines of python" to replace a random mutation
         | operator. It feels generous to credit the neural network with
         | the knowledge generation here.
         | 
         | It's also highly dependent on the nature of the problem, even
         | beyond the need to have the "hard to generate, easy to
         | evaluate" structure. You have to be able to decompose the
         | problem in such a way that a very short Python function is all
         | that you want to evolve.
        
           | og_kalu wrote:
           | There's nothing generous about it. It wouldn't be possible
           | without the LLM. Traditional Computational solvers fall well
           | short as they say and demonstrate.
           | 
           | >The LLM is just being asked "propose some reasonable edits
           | to these 20 lines of python" to replace a random mutation
           | operator.
           | 
           | Hahaha "Just".
        
             | Imnimo wrote:
             | We can see from the ablations that simply asking the LLM to
             | generate a large number of potential solutions without the
             | evolutionary search performs terribly.
             | 
             | It's certainly true that the LLM mutator produces more
             | reasonable edits than a random mutator. But the value the
             | LLM is bringing is that it knows how to produce reasonable-
             | looking programs, not that it has some special
             | understanding of bin packing or capset finding.
             | 
             | The only thing the LLM sees is two previously generated
             | code samples, labeled <function>_v0 and <function>_v1 and
             | then a header for <function>_v2 which it fills in. Look at
             | Extended Data Fig. 1.
        
               | og_kalu wrote:
               | With the current competency or state of the models today,
               | both are necessary, nobody is denying that.
               | 
               | It doesn't take a genius however to see who the more
               | valuable contributor to the system is. The authors say as
               | much when they anticipate much of any improvement to come
               | from capabilities of better models than any change to the
               | search algorithm. Gpt-4 may well have reduced the search
               | time by half and so on.
               | 
               | >But the value the LLM is bringing is that it knows how
               | to produce reasonable-looking programs, not that it has
               | some special understanding of bin packing or capset
               | finding.
               | 
               | So it is a general problem solver then, great. That's
               | what we want.
        
               | Imnimo wrote:
               | >It doesn't take a genius however to see who the more
               | valuable contributor to the system is.
               | 
               | Indeed, we can look at the ablations and see that
               | swapping out a weaker code LLM (540B Codey to 15B
               | Starcoder) makes little difference:
               | 
               | "This illustrates that FunSearch is robust to the choice
               | of the model as long as it has been trained sufficiently
               | well to generate code."
               | 
               | >So it is a general problem solver then, great. That's
               | what we want.
               | 
               | In the sense that it does not rely on the LLM having any
               | understanding of the problem, yes. But not in the sense
               | that it can be applied to problems that are not naturally
               | decomposable into a single short Python function.
        
               | og_kalu wrote:
               | >Indeed, we can look at the ablations and see that
               | swapping out a weaker code LLM (540B Codey to 15B
               | Starcoder) makes little difference
               | 
               | Really stretching the meaning of "little difference" here
               | 
               | "While 'StarCoder' is not able to find the full-sized
               | admissible set in any of the five runs, it still finds
               | large admissible sets that improve upon the previous
               | state of the art lower bound on the cap set capacity."
               | 
               | https://imgur.com/a/xerOOLn
               | 
               | Literally telling you how much more valuable the LLM (any
               | coding LLM) is than alternatives, while demonstrating
               | prediction competence still matters.
        
               | Imnimo wrote:
               | I don't think five runs is sufficient to reach the
               | conclusion you're reaching here, especially when we can
               | see that one of the Codey runs is worse than all of the
               | StarCoder runs. If Codey is so much more valuable, why is
               | this happening?
               | 
               | It's certainly true that LLM's are more valuable than the
               | alternative tested - which is a set of hand-crafted code
               | mutation rules. But it's important to think about _why_
               | an LLM is better. There are two big pitfalls for the
               | hand-crafted rules. First, the hand-crafted rule has to
               | make local changes - it 's vanishingly improbable to be
               | able to do something like "change < to <= in each of this
               | series of if-statements" or "increase all constants by
               | 1", whereas those are natural simple changes that an LLM
               | might make. The second is that the random mutations have
               | no concept of parsimonious code. There's nothing
               | preventing it from generating code that does stuff like
               | compute values and then neglect to return them, or
               | multiply a previous computation by zero, or any number of
               | other obviously useless variations.
               | 
               | What the LLM brings to the table here is the ability to
               | avoid pitfalls like the above. It writes code that is
               | shaped sensibly - it can make a more natural set of edits
               | than "just randomly delete an operator". But that's all
               | it's doing. That's not, like the tweet quoted above
               | called it, "generating genuinely new knowledge".
               | 
               | Put it this way - assuming for a moment that Codey ending
               | up with a higher score than StarCoder is not random
               | chance, do you think it's because Codey has some greater
               | understanding of the admissible-set problem, or because
               | Codey generates a different set of minor code edits?
        
             | peddling-brink wrote:
             | Agreed. If the phrase "yeah well you could also do that
             | with a group of moderately trained people" applies, you
             | have a use case for an LLM.
        
           | colechristensen wrote:
           | People might argue about the weights of credit to the various
           | parts. LLMs as they are by themselves are quite limited in
           | their creative potential to do _new_ things or generally make
           | progress, but if you couple their capabilities (as is done
           | here) with other tech then their powers really shine.
           | 
           | Likewise when that coupled "tech" is a person. LLMs don't do
           | my job or further my projects by replacing me, but I can use
           | it to generate examples in seconds that would take minutes or
           | hours for me and it can provide ideas for options moving
           | forward.
        
           | nopinsight wrote:
           | Having the right hunches on what to try based on accumulated
           | knowledge & experiences is a key thing that distinguishes
           | masters from apprentices.
           | 
           | A fun story from a UCLA math PhD: "terry tao was on both my
           | and my brother's committee.
           | 
           | he solved both our dissertation problems before we were done
           | talking, each of us got "wouldn't it have been easier
           | to...outline of entire proof""
           | 
           | https://twitter.com/AAAzzam/status/1735070386792825334
           | 
           | Current LLMs are far from Terence Tao but Tao himself wrote
           | this:
           | 
           | "The 2023-level AI can already generate suggestive hints and
           | promising leads to a working mathematician and participate
           | actively in the decision-making process. When integrated with
           | tools such as formal proof verifiers, internet search, and
           | symbolic math packages, I expect, say, 2026-level AI, when
           | used properly, will be a trustworthy co-author in
           | mathematical research, and in many other fields as well."
           | 
           | https://unlocked.microsoft.com/ai-anthology/terence-tao/
        
             | Imnimo wrote:
             | Do you think the role the LLM plays in this system is
             | analogous to what Tao is talking about?
        
               | nopinsight wrote:
               | What Tao does when proposing an idea most likely
               | encapsulates much more than what a current LLM does. I'm
               | no Terence Tao but I sometimes come up with useful ideas.
               | In a more complex case, I revise those ideas in my head
               | and sometimes on paper several times before they become
               | useful (analogous to using evolutionary algorithms).
               | 
               | However, it is impractical to think consciously of all
               | possible variations. So the brain only surfaces ones
               | likely to be useful. This is the role an LLM plays here.
               | 
               | An expert or an LLM with more relevant experiences would
               | be better at suggesting these variations to try. Chess
               | grandmasters often don't consciously simulate more
               | possibilities than novices.
        
               | Imnimo wrote:
               | Critically, though, the LLM is not acting as a domain
               | expert here. It's acting as a code mutator. The expertise
               | it brings to the table is not mathematical - Codey
               | doesn't have any special insight into bin packing
               | heuristics. It's not generating "suggestive hints and
               | promising leads", it's just help the evolutionary search
               | avoid nonsense code mutations.
        
               | nopinsight wrote:
               | An LLM serves as a sort of "expert" programmer here.
               | Programming itself is a pretty complex domain in which a
               | purely evolutionary algorithm isn't efficient.
               | 
               | We can imagine LLMs trained in mathematical programming
               | or a different domain playing the same role more
               | effectively in this framework.
        
               | Imnimo wrote:
               | I disagree that the type of coding the LLM is doing here
               | is "expert" level in any meaningful sense. Look for
               | example at the code for the bin-packing heuristics:
               | 
               | https://github.com/google-
               | deepmind/funsearch/blob/main/bin_p...
               | 
               | These aren't complex or insightful programs - they're
               | pretty short simple functions, of the sort you typically
               | get from program evolution. The LLM's role here is just
               | proposing edits, not leveraging specialized knowledge or
               | even really exercising the limits of existing LLMs'
               | coding capabilities.
        
               | nopinsight wrote:
               | To be clearer, the word "expert" in my comment above is
               | in quotation marks because an LLM has more programming
               | expertise than a non-programmer or an evolutionary
               | algorithm, but not anywhere near a true expert
               | programmer.
        
         | og_kalu wrote:
         | Neural networks have been able to "generate new knowledge" for
         | a long time.
         | 
         | So have LLMs,
         | https://www.nature.com/articles/s41587-022-01618-2
        
         | empath-nirvana wrote:
         | In some sense, it's not especially interesting because millions
         | of programmers are doing this with copilot every day. Like I
         | get that in a lot of cases it's _just_ applying common
         | knowledge to new domains, but it is novel code for the most
         | part.
        
         | fatherzine wrote:
         | Don't Look Up
        
         | lossolo wrote:
         | From the paper:
         | 
         | We note that FunSearch currently works best for problems having
         | the following characteristics: a) availability of an efficient
         | evaluator; b) a "rich" scoring feedback quantifying the
         | improvements (as opposed to a binary signal); c) ability to
         | provide a skeleton with an isolated part to be evolved. For
         | example, the problem of generating proofs for theorems [52-54]
         | falls outside this scope, since it is unclear how to provide a
         | rich enough scoring signal.
        
       | karencarits wrote:
       | One of the problems they were approaching was the cap set problem
       | 
       | https://en.m.wikipedia.org/wiki/Cap_set
       | 
       | > The problem consists of finding the largest set of points
       | (called a cap set) in a high-dimensional grid, where no three
       | points lie on a line. This problem is important because it serves
       | as a model for other problems in extremal combinatorics - the
       | study of how large or small a collection of numbers, graphs or
       | other objects could be. Brute-force computing approaches to this
       | problem don't work - the number of possibilities to consider
       | quickly becomes greater than the number of atoms in the universe.
       | 
       | > FunSearch generated solutions - in the form of programs - that
       | in some settings discovered the largest cap sets ever found. This
       | represents the largest increase in the size of cap sets in the
       | past 20 years. Moreover, FunSearch outperformed state-of-the-art
       | computational solvers, as this problem scales well beyond their
       | current capabilities.
        
       | aconz2 wrote:
       | summary: given a program template/skeleton and a fitness function
       | (# correct results, shorter programs, etc), generate a population
       | of programs with LLM, use a prompt that generates a new program
       | from k other versions (they found k=2 is good, kinda biological
       | eh), run the programs on inputs and score them with the fitness
       | function, uses the island model for evolution
       | 
       | I think the prompt looks in principle something like
       | def foo_v1(a, b): ...       def foo_v2(a, b): ...       #
       | generate me a new function using foo_v1 and foo_v2. You can only
       | change things inside two double curly braces like THIS in {{ THIS
       | }}       # idk not a "prompt engineer"       def foo(a, b):
       | return a + {{}}
       | 
       | They achieved the new results with only ~1e6 LLM calls (I think
       | I'm reading that right) which seems impressively low. They talk
       | about evaluating/scoring taking minutes. Interesting to think
       | about the depth vs breadth tradeoff here which is tied to the
       | latency vs throughput of scoring an individual vs population.
       | What if you memoize across all programs. Can you keep the loss
       | function multidimensional (1d per input or input bucket) so that
       | you might find a population of programs that do well in different
       | areas first and then it can work on combining them.
       | 
       | Did we have any prior on how rare the cap set thing is? Had there
       | been previous computational efforts at this to no avail? Cool
       | nonetheless
        
       | stevenhuang wrote:
       | Wonder where the goalposts will be moved now by the "stochastic
       | parrot" parroters.
        
         | lgessler wrote:
         | Idk if this work really bears on that argument. I can imagine
         | Bender reacting to this by observing that this is a clever way
         | of finding the needle (a good heuristic algorithm for an NP
         | hard problem) in a haystack (millions of other heuristic
         | algorithms that are bad, very bad, or maybe not even
         | syntactically well-formed), and then saying that the fact that
         | the model still produced so much garbage is proof that the LLM
         | still doesn't really know how to reason or access meaning.
         | 
         | And I think that'd be a good argument. OTOH, this system is not
         | just an LLM, but an LLM and a bunch of additional components on
         | top of it, and there might be different things to say about
         | this system considered as a whole.
        
           | airstrike wrote:
           | We don't need LLMs to reason or to access meaning for them to
           | be useful
        
             | lgessler wrote:
             | Sure, but I don't think even Bender or Marcus would deny
             | their potential for practical utility, right? I think they
             | mean to say that LLMs are not exactly as miraculous as they
             | might seem, and very capable of misbehavior.
        
               | cbb330 wrote:
               | your mind also conjectures many unreasonable and
               | illogical solutions to things. then separately your mind
               | has an "evaluator" to deduce a set of solutions down to
               | things that make sense.
               | 
               | Is anyone saying LLMs in isolation and without contextual
               | tooling are miraculous? Even the text box for chatGPT is
               | a form of wrapping LLM in a UX that is incrementally more
               | miraculous.
        
           | empath-nirvana wrote:
           | I'm not sure why people always want to consider the LLM in
           | isolation. Like if someone built an apparently fully sentient
           | android which completely mimics the behavior of a sentient
           | human in every respect, but it was comprised of a bunch of
           | LLMs and other systems wired together, would you say that
           | it's not intelligent because the LLMs aren't fully sentient?
        
         | marmakoide wrote:
         | A stochastic parrot used to generate variation of a program, to
         | do hill climbing on programs.
         | 
         | The magician lose some its charm when you know his trick. The
         | Earth used to be the center of the universe, now it's circling
         | an average star. Intelligence used to be a sacred mystical
         | thing, we might be in our way to make it less sacred.
        
       | zackmorris wrote:
       | FunSearch is more along the lines of how I wanted AI to evolve
       | over the last 20 years or so, after reading Genetic Programming
       | III by John Koza:
       | 
       | https://www.amazon.com/Genetic-Programming-III-Darwinian-Inv...
       | 
       | I wanted to use genetic algorithms (GAs) to come up with random
       | programs run against unit tests that specify expected behavior.
       | It sounds like they are doing something similar, finding
       | potential solutions with neural nets (NNs)/LLMs and grading them
       | against an "evaluator" (wish they added more details about how it
       | works).
       | 
       | What the article didn't mention is that above a certain level of
       | complexity, this method begins to pull away from human
       | supervisors to create and verify programs faster than we can
       | review them. When they were playing with Lisp GAs back in the
       | 1990s on Beowulf clusters, they found that the technique works
       | extremely well, but it's difficult to tune GA parameters to
       | evolve the best solutions reliably in the fastest time. So volume
       | III was about re-running those experiments multiple times on
       | clusters about 1000 times faster in the 2000s, to find
       | correlations between parameters and outcomes. Something similar
       | was also needed to understand how tuning NN parameters affects
       | outcomes, but I haven't seen a good paper on whether that
       | relationship is understood any better today.
       | 
       | Also GPU/SIMD hardware isn't good for GAs, since video cards are
       | designed to run one wide algorithm instead of thousands or
       | millions of narrow ones with subtle differences like on a cluster
       | of CPUs. So I feel that progress on that front has been hindered
       | for about 25 years, since I first started looking at programming
       | FPGAs to run thousands of MIPS cores (probably ARM or RISC-V
       | today). In other words, the perpetual AI winter we've been in for
       | 50 years is more about poor hardware decisions and socioeconomic
       | factors than technical challenges with the algorithms.
       | 
       | So I'm certain now that some combination of these old approaches
       | will deliver AGI within 10 years. I'm just frustrated with myself
       | that I never got to participate, since I spent all of those years
       | writing CRUD apps or otherwise hustling in the struggle to make
       | rent, with nothing to show for it except a roof over my head. And
       | I'm disappointed in the wealthy for hoarding their money and not
       | seeing the potential of the countless millions of other people as
       | smart as they are who are trapped in wage slavery. IMHO this is
       | the great problem of our time (explained by the pumping gas scene
       | in Fight Club), although since AGI is the last problem in
       | computer science, we might even see wealth inequality defeated
       | sometime in the 2030s. Either that or we become Borg!
        
         | wolverine876 wrote:
         | > not seeing the potential of the countless millions of other
         | people as smart as they are who are trapped in wage slavery.
         | 
         | Think of the change from what built Silicon Valley, that anyone
         | with a good idea could work hard and change the world. Many of
         | SV's wealthy began that way.
         | 
         | But for you: They say the best time to plant a tree is 20 years
         | ago; the second best time is now.
        
       | lgessler wrote:
       | Tl;dr in my words after a skim: this is a method for using LLMs
       | to find new, SOTA (or at least very good?) heuristic algorithms
       | for NP hard combinatorial optimization problems. They achieve
       | this not by making the LLM itself smarter but by finding a way to
       | keep only its best ideas. (Haven't had a chance to read carefully
       | yet so caveat lector.)
       | 
       | Pretty impressive, but don't panic--we're not at the singularity
       | yet.
        
       | nybsjytm wrote:
       | Some important context: the discovery is that a certain number in
       | combinatorics is now known to be between 2.2202 and 2.756, not
       | just between 2.218 and 2.756 as discovered last year. The
       | improvement is by finding some particular sequences of numbers
       | which have some special properties, not by a logic-heavy
       | mathematical proof. (That doesn't mean it's unrigorous.)
       | 
       | But it's an interesting and possibly useful method of coming up
       | with examples, pretty much a genetic algorithm with LLMs.
        
       | mmaunder wrote:
       | Regardless of whether this is verifiably new knowledge, it's an
       | interesting case study when you consider limiting access to AI
       | based on model size or some other regulatory measure, and the
       | unfair advantage that confers on corporations that do discover
       | new knowledge or laws of nature, and can monetize them without
       | sharing.
        
       ___________________________________________________________________
       (page generated 2023-12-14 23:00 UTC)