[HN Gopher] FunSearch: Making new discoveries in mathematical sc...
___________________________________________________________________
FunSearch: Making new discoveries in mathematical sciences using
LLMs
Author : reqo
Score : 197 points
Date : 2023-12-14 16:24 UTC (6 hours ago)
(HTM) web link (deepmind.google)
(TXT) w3m dump (deepmind.google)
| mejutoco wrote:
| I wonder how someone will integrate symbolic reasoning with LLMs,
| or if it will be possible.
| Malp wrote:
| Ditto, this seems to have some parallels to the neurosymbolic
| ideas being explored by Lab V2 at ASU
| nimski wrote:
| This is what we're doing. I think it's not only possible, but
| also necessary for applications beyond trial-and-error
| generation.
| 1024core wrote:
| LEAN
| ignoramous wrote:
| Related commentary on "self-play" by Subbarao:
| https://twitter.com/rao2z/status/1728121216479949048
|
| From the article: FunSearch uses an evolutionary
| method powered by LLMs, which promotes and develops the highest
| scoring ideas. These ideas are expressed as computer programs, so
| that they can be run and evaluated automatically. The
| user writes a description of the problem in the form of code.
| This description comprises a procedure to evaluate programs, and
| a seed program used to initialize a pool of programs.
| At each iteration, FunSearch selects some programs from the
| current pool. The LLM creatively builds upon these, and generates
| new programs, which are automatically evaluated. The best ones
| are added back to the pool of existing programs, creating a self-
| improving loop.
|
| For websearch, I ( _evaluator_ ) use _pplx.ai_ and _phind.com_ in
| a similar manner. Ask it a question ( _seed_ ) and see what
| references it brings up (web links). Refine my question or ask
| follow-ups ( _iterate_ ) so it pulls up different or more in-
| depth references ( _improve_ ). Works better in unearthing gems
| than sifting through reddit or Google.
|
| Given _Tech Twitter_ has amazing content too, looking forward to
| using Grok for research, now that it is open to all.
| alphabetting wrote:
| https://twitter.com/gfodor/status/1735348301812383906
|
| >If DeepMind just definitively proved neural networks can
| generate genuinely new knowledge then it's the most important
| discovery since fire.
|
| If this were actually the case why wouldn't everyone be talking
| about this? I am impressed it was done on Palm 2 given that's
| less advanced than GPT-4 and Gemini. Will be wild to see what the
| next few generations of models can do utilizing methods like
| this.
| dougmwne wrote:
| I said "WOW!" out loud.
|
| An LLM can discover a new solution in high dimensional geometry
| that hasn't advanced in 20 years!? That goes way beyond glueing
| little bits of plagiarized training data together in a
| plausible way.
|
| This suggest that there are hidden depths to LLMs' capabilities
| if we can just figure out how to prompt and evaluate them
| correctly.
|
| This significantly broke my expectations. Who knows what
| discovery could be hiding behind the next prompt and random
| seed.
| bendergarcia wrote:
| It's kind of like humans: seed of two people and a random
| interest to pursue, what could they do?!? It makes poverty
| and children dying unnecessarily even more depressing.
| riku_iki wrote:
| > An LLM can discover a new solution in high dimensional
| geometry that hasn't advanced in 20 years!?
|
| LLM + brute force coded by humans.
| dougmwne wrote:
| And that not only created a human interpretable method, but
| beat out 20 years of mathematicians working on a high
| profile open question.
|
| Who's to say our brains themselves aren't brute force
| solution searchers?
| riku_iki wrote:
| > Who's to say our brains themselves aren't brute force
| solution searchers?
|
| yes, its just much slower (many trillions times) for
| numbers crunching.
| nybsjytm wrote:
| > An LLM can discover a new solution in high dimensional
| geometry that hasn't advanced in 20 years!? That goes way
| beyond glueing little bits of plagiarized training data
| together in a plausible way.
|
| The first advance in 20 years was by Fred Tyrrell last year
| https://arxiv.org/abs/2209.10045, who showed that the
| combinatorics quantity in question is between 2.218 and
| 2.756, improving the previous lower bound of 2.2174. DeepMind
| has now shown that the number is between 2.2202 and 2.756.
|
| That's why the DeepMind authors describe it as "the largest
| increase in the size of cap sets in the past 20 years," not
| as the only increase. The arithmetic is that 2.2202-2.218 is
| larger than 2.218-2.2174. (If you consider this a very
| meaningful comparison to make, you might work for DeepMind.)
| Imnimo wrote:
| The heavy lifting here is done by the evolutionary algorithm.
| The LLM is just being asked "propose some reasonable edits to
| these 20 lines of python" to replace a random mutation
| operator. It feels generous to credit the neural network with
| the knowledge generation here.
|
| It's also highly dependent on the nature of the problem, even
| beyond the need to have the "hard to generate, easy to
| evaluate" structure. You have to be able to decompose the
| problem in such a way that a very short Python function is all
| that you want to evolve.
| og_kalu wrote:
| There's nothing generous about it. It wouldn't be possible
| without the LLM. Traditional Computational solvers fall well
| short as they say and demonstrate.
|
| >The LLM is just being asked "propose some reasonable edits
| to these 20 lines of python" to replace a random mutation
| operator.
|
| Hahaha "Just".
| Imnimo wrote:
| We can see from the ablations that simply asking the LLM to
| generate a large number of potential solutions without the
| evolutionary search performs terribly.
|
| It's certainly true that the LLM mutator produces more
| reasonable edits than a random mutator. But the value the
| LLM is bringing is that it knows how to produce reasonable-
| looking programs, not that it has some special
| understanding of bin packing or capset finding.
|
| The only thing the LLM sees is two previously generated
| code samples, labeled <function>_v0 and <function>_v1 and
| then a header for <function>_v2 which it fills in. Look at
| Extended Data Fig. 1.
| og_kalu wrote:
| With the current competency or state of the models today,
| both are necessary, nobody is denying that.
|
| It doesn't take a genius however to see who the more
| valuable contributor to the system is. The authors say as
| much when they anticipate much of any improvement to come
| from capabilities of better models than any change to the
| search algorithm. Gpt-4 may well have reduced the search
| time by half and so on.
|
| >But the value the LLM is bringing is that it knows how
| to produce reasonable-looking programs, not that it has
| some special understanding of bin packing or capset
| finding.
|
| So it is a general problem solver then, great. That's
| what we want.
| Imnimo wrote:
| >It doesn't take a genius however to see who the more
| valuable contributor to the system is.
|
| Indeed, we can look at the ablations and see that
| swapping out a weaker code LLM (540B Codey to 15B
| Starcoder) makes little difference:
|
| "This illustrates that FunSearch is robust to the choice
| of the model as long as it has been trained sufficiently
| well to generate code."
|
| >So it is a general problem solver then, great. That's
| what we want.
|
| In the sense that it does not rely on the LLM having any
| understanding of the problem, yes. But not in the sense
| that it can be applied to problems that are not naturally
| decomposable into a single short Python function.
| og_kalu wrote:
| >Indeed, we can look at the ablations and see that
| swapping out a weaker code LLM (540B Codey to 15B
| Starcoder) makes little difference
|
| Really stretching the meaning of "little difference" here
|
| "While 'StarCoder' is not able to find the full-sized
| admissible set in any of the five runs, it still finds
| large admissible sets that improve upon the previous
| state of the art lower bound on the cap set capacity."
|
| https://imgur.com/a/xerOOLn
|
| Literally telling you how much more valuable the LLM (any
| coding LLM) is than alternatives, while demonstrating
| prediction competence still matters.
| Imnimo wrote:
| I don't think five runs is sufficient to reach the
| conclusion you're reaching here, especially when we can
| see that one of the Codey runs is worse than all of the
| StarCoder runs. If Codey is so much more valuable, why is
| this happening?
|
| It's certainly true that LLM's are more valuable than the
| alternative tested - which is a set of hand-crafted code
| mutation rules. But it's important to think about _why_
| an LLM is better. There are two big pitfalls for the
| hand-crafted rules. First, the hand-crafted rule has to
| make local changes - it 's vanishingly improbable to be
| able to do something like "change < to <= in each of this
| series of if-statements" or "increase all constants by
| 1", whereas those are natural simple changes that an LLM
| might make. The second is that the random mutations have
| no concept of parsimonious code. There's nothing
| preventing it from generating code that does stuff like
| compute values and then neglect to return them, or
| multiply a previous computation by zero, or any number of
| other obviously useless variations.
|
| What the LLM brings to the table here is the ability to
| avoid pitfalls like the above. It writes code that is
| shaped sensibly - it can make a more natural set of edits
| than "just randomly delete an operator". But that's all
| it's doing. That's not, like the tweet quoted above
| called it, "generating genuinely new knowledge".
|
| Put it this way - assuming for a moment that Codey ending
| up with a higher score than StarCoder is not random
| chance, do you think it's because Codey has some greater
| understanding of the admissible-set problem, or because
| Codey generates a different set of minor code edits?
| peddling-brink wrote:
| Agreed. If the phrase "yeah well you could also do that
| with a group of moderately trained people" applies, you
| have a use case for an LLM.
| colechristensen wrote:
| People might argue about the weights of credit to the various
| parts. LLMs as they are by themselves are quite limited in
| their creative potential to do _new_ things or generally make
| progress, but if you couple their capabilities (as is done
| here) with other tech then their powers really shine.
|
| Likewise when that coupled "tech" is a person. LLMs don't do
| my job or further my projects by replacing me, but I can use
| it to generate examples in seconds that would take minutes or
| hours for me and it can provide ideas for options moving
| forward.
| nopinsight wrote:
| Having the right hunches on what to try based on accumulated
| knowledge & experiences is a key thing that distinguishes
| masters from apprentices.
|
| A fun story from a UCLA math PhD: "terry tao was on both my
| and my brother's committee.
|
| he solved both our dissertation problems before we were done
| talking, each of us got "wouldn't it have been easier
| to...outline of entire proof""
|
| https://twitter.com/AAAzzam/status/1735070386792825334
|
| Current LLMs are far from Terence Tao but Tao himself wrote
| this:
|
| "The 2023-level AI can already generate suggestive hints and
| promising leads to a working mathematician and participate
| actively in the decision-making process. When integrated with
| tools such as formal proof verifiers, internet search, and
| symbolic math packages, I expect, say, 2026-level AI, when
| used properly, will be a trustworthy co-author in
| mathematical research, and in many other fields as well."
|
| https://unlocked.microsoft.com/ai-anthology/terence-tao/
| Imnimo wrote:
| Do you think the role the LLM plays in this system is
| analogous to what Tao is talking about?
| nopinsight wrote:
| What Tao does when proposing an idea most likely
| encapsulates much more than what a current LLM does. I'm
| no Terence Tao but I sometimes come up with useful ideas.
| In a more complex case, I revise those ideas in my head
| and sometimes on paper several times before they become
| useful (analogous to using evolutionary algorithms).
|
| However, it is impractical to think consciously of all
| possible variations. So the brain only surfaces ones
| likely to be useful. This is the role an LLM plays here.
|
| An expert or an LLM with more relevant experiences would
| be better at suggesting these variations to try. Chess
| grandmasters often don't consciously simulate more
| possibilities than novices.
| Imnimo wrote:
| Critically, though, the LLM is not acting as a domain
| expert here. It's acting as a code mutator. The expertise
| it brings to the table is not mathematical - Codey
| doesn't have any special insight into bin packing
| heuristics. It's not generating "suggestive hints and
| promising leads", it's just help the evolutionary search
| avoid nonsense code mutations.
| nopinsight wrote:
| An LLM serves as a sort of "expert" programmer here.
| Programming itself is a pretty complex domain in which a
| purely evolutionary algorithm isn't efficient.
|
| We can imagine LLMs trained in mathematical programming
| or a different domain playing the same role more
| effectively in this framework.
| Imnimo wrote:
| I disagree that the type of coding the LLM is doing here
| is "expert" level in any meaningful sense. Look for
| example at the code for the bin-packing heuristics:
|
| https://github.com/google-
| deepmind/funsearch/blob/main/bin_p...
|
| These aren't complex or insightful programs - they're
| pretty short simple functions, of the sort you typically
| get from program evolution. The LLM's role here is just
| proposing edits, not leveraging specialized knowledge or
| even really exercising the limits of existing LLMs'
| coding capabilities.
| nopinsight wrote:
| To be clearer, the word "expert" in my comment above is
| in quotation marks because an LLM has more programming
| expertise than a non-programmer or an evolutionary
| algorithm, but not anywhere near a true expert
| programmer.
| og_kalu wrote:
| Neural networks have been able to "generate new knowledge" for
| a long time.
|
| So have LLMs,
| https://www.nature.com/articles/s41587-022-01618-2
| empath-nirvana wrote:
| In some sense, it's not especially interesting because millions
| of programmers are doing this with copilot every day. Like I
| get that in a lot of cases it's _just_ applying common
| knowledge to new domains, but it is novel code for the most
| part.
| fatherzine wrote:
| Don't Look Up
| lossolo wrote:
| From the paper:
|
| We note that FunSearch currently works best for problems having
| the following characteristics: a) availability of an efficient
| evaluator; b) a "rich" scoring feedback quantifying the
| improvements (as opposed to a binary signal); c) ability to
| provide a skeleton with an isolated part to be evolved. For
| example, the problem of generating proofs for theorems [52-54]
| falls outside this scope, since it is unclear how to provide a
| rich enough scoring signal.
| karencarits wrote:
| One of the problems they were approaching was the cap set problem
|
| https://en.m.wikipedia.org/wiki/Cap_set
|
| > The problem consists of finding the largest set of points
| (called a cap set) in a high-dimensional grid, where no three
| points lie on a line. This problem is important because it serves
| as a model for other problems in extremal combinatorics - the
| study of how large or small a collection of numbers, graphs or
| other objects could be. Brute-force computing approaches to this
| problem don't work - the number of possibilities to consider
| quickly becomes greater than the number of atoms in the universe.
|
| > FunSearch generated solutions - in the form of programs - that
| in some settings discovered the largest cap sets ever found. This
| represents the largest increase in the size of cap sets in the
| past 20 years. Moreover, FunSearch outperformed state-of-the-art
| computational solvers, as this problem scales well beyond their
| current capabilities.
| aconz2 wrote:
| summary: given a program template/skeleton and a fitness function
| (# correct results, shorter programs, etc), generate a population
| of programs with LLM, use a prompt that generates a new program
| from k other versions (they found k=2 is good, kinda biological
| eh), run the programs on inputs and score them with the fitness
| function, uses the island model for evolution
|
| I think the prompt looks in principle something like
| def foo_v1(a, b): ... def foo_v2(a, b): ... #
| generate me a new function using foo_v1 and foo_v2. You can only
| change things inside two double curly braces like THIS in {{ THIS
| }} # idk not a "prompt engineer" def foo(a, b):
| return a + {{}}
|
| They achieved the new results with only ~1e6 LLM calls (I think
| I'm reading that right) which seems impressively low. They talk
| about evaluating/scoring taking minutes. Interesting to think
| about the depth vs breadth tradeoff here which is tied to the
| latency vs throughput of scoring an individual vs population.
| What if you memoize across all programs. Can you keep the loss
| function multidimensional (1d per input or input bucket) so that
| you might find a population of programs that do well in different
| areas first and then it can work on combining them.
|
| Did we have any prior on how rare the cap set thing is? Had there
| been previous computational efforts at this to no avail? Cool
| nonetheless
| stevenhuang wrote:
| Wonder where the goalposts will be moved now by the "stochastic
| parrot" parroters.
| lgessler wrote:
| Idk if this work really bears on that argument. I can imagine
| Bender reacting to this by observing that this is a clever way
| of finding the needle (a good heuristic algorithm for an NP
| hard problem) in a haystack (millions of other heuristic
| algorithms that are bad, very bad, or maybe not even
| syntactically well-formed), and then saying that the fact that
| the model still produced so much garbage is proof that the LLM
| still doesn't really know how to reason or access meaning.
|
| And I think that'd be a good argument. OTOH, this system is not
| just an LLM, but an LLM and a bunch of additional components on
| top of it, and there might be different things to say about
| this system considered as a whole.
| airstrike wrote:
| We don't need LLMs to reason or to access meaning for them to
| be useful
| lgessler wrote:
| Sure, but I don't think even Bender or Marcus would deny
| their potential for practical utility, right? I think they
| mean to say that LLMs are not exactly as miraculous as they
| might seem, and very capable of misbehavior.
| cbb330 wrote:
| your mind also conjectures many unreasonable and
| illogical solutions to things. then separately your mind
| has an "evaluator" to deduce a set of solutions down to
| things that make sense.
|
| Is anyone saying LLMs in isolation and without contextual
| tooling are miraculous? Even the text box for chatGPT is
| a form of wrapping LLM in a UX that is incrementally more
| miraculous.
| empath-nirvana wrote:
| I'm not sure why people always want to consider the LLM in
| isolation. Like if someone built an apparently fully sentient
| android which completely mimics the behavior of a sentient
| human in every respect, but it was comprised of a bunch of
| LLMs and other systems wired together, would you say that
| it's not intelligent because the LLMs aren't fully sentient?
| marmakoide wrote:
| A stochastic parrot used to generate variation of a program, to
| do hill climbing on programs.
|
| The magician lose some its charm when you know his trick. The
| Earth used to be the center of the universe, now it's circling
| an average star. Intelligence used to be a sacred mystical
| thing, we might be in our way to make it less sacred.
| zackmorris wrote:
| FunSearch is more along the lines of how I wanted AI to evolve
| over the last 20 years or so, after reading Genetic Programming
| III by John Koza:
|
| https://www.amazon.com/Genetic-Programming-III-Darwinian-Inv...
|
| I wanted to use genetic algorithms (GAs) to come up with random
| programs run against unit tests that specify expected behavior.
| It sounds like they are doing something similar, finding
| potential solutions with neural nets (NNs)/LLMs and grading them
| against an "evaluator" (wish they added more details about how it
| works).
|
| What the article didn't mention is that above a certain level of
| complexity, this method begins to pull away from human
| supervisors to create and verify programs faster than we can
| review them. When they were playing with Lisp GAs back in the
| 1990s on Beowulf clusters, they found that the technique works
| extremely well, but it's difficult to tune GA parameters to
| evolve the best solutions reliably in the fastest time. So volume
| III was about re-running those experiments multiple times on
| clusters about 1000 times faster in the 2000s, to find
| correlations between parameters and outcomes. Something similar
| was also needed to understand how tuning NN parameters affects
| outcomes, but I haven't seen a good paper on whether that
| relationship is understood any better today.
|
| Also GPU/SIMD hardware isn't good for GAs, since video cards are
| designed to run one wide algorithm instead of thousands or
| millions of narrow ones with subtle differences like on a cluster
| of CPUs. So I feel that progress on that front has been hindered
| for about 25 years, since I first started looking at programming
| FPGAs to run thousands of MIPS cores (probably ARM or RISC-V
| today). In other words, the perpetual AI winter we've been in for
| 50 years is more about poor hardware decisions and socioeconomic
| factors than technical challenges with the algorithms.
|
| So I'm certain now that some combination of these old approaches
| will deliver AGI within 10 years. I'm just frustrated with myself
| that I never got to participate, since I spent all of those years
| writing CRUD apps or otherwise hustling in the struggle to make
| rent, with nothing to show for it except a roof over my head. And
| I'm disappointed in the wealthy for hoarding their money and not
| seeing the potential of the countless millions of other people as
| smart as they are who are trapped in wage slavery. IMHO this is
| the great problem of our time (explained by the pumping gas scene
| in Fight Club), although since AGI is the last problem in
| computer science, we might even see wealth inequality defeated
| sometime in the 2030s. Either that or we become Borg!
| wolverine876 wrote:
| > not seeing the potential of the countless millions of other
| people as smart as they are who are trapped in wage slavery.
|
| Think of the change from what built Silicon Valley, that anyone
| with a good idea could work hard and change the world. Many of
| SV's wealthy began that way.
|
| But for you: They say the best time to plant a tree is 20 years
| ago; the second best time is now.
| lgessler wrote:
| Tl;dr in my words after a skim: this is a method for using LLMs
| to find new, SOTA (or at least very good?) heuristic algorithms
| for NP hard combinatorial optimization problems. They achieve
| this not by making the LLM itself smarter but by finding a way to
| keep only its best ideas. (Haven't had a chance to read carefully
| yet so caveat lector.)
|
| Pretty impressive, but don't panic--we're not at the singularity
| yet.
| nybsjytm wrote:
| Some important context: the discovery is that a certain number in
| combinatorics is now known to be between 2.2202 and 2.756, not
| just between 2.218 and 2.756 as discovered last year. The
| improvement is by finding some particular sequences of numbers
| which have some special properties, not by a logic-heavy
| mathematical proof. (That doesn't mean it's unrigorous.)
|
| But it's an interesting and possibly useful method of coming up
| with examples, pretty much a genetic algorithm with LLMs.
| mmaunder wrote:
| Regardless of whether this is verifiably new knowledge, it's an
| interesting case study when you consider limiting access to AI
| based on model size or some other regulatory measure, and the
| unfair advantage that confers on corporations that do discover
| new knowledge or laws of nature, and can monetize them without
| sharing.
___________________________________________________________________
(page generated 2023-12-14 23:00 UTC)