[HN Gopher] AI Search: The Bitter-Er Lesson
___________________________________________________________________
AI Search: The Bitter-Er Lesson
Author : dwighttk
Score : 78 points
Date : 2024-06-14 18:47 UTC (4 hours ago)
(HTM) web link (yellow-apartment-148.notion.site)
(TXT) w3m dump (yellow-apartment-148.notion.site)
| johnthewise wrote:
| What happened to all the chatter about Q*? I remember reading
| about this train/test time trade-off back then, does anyone have
| good list of recent papers/blogs about this? What is holding back
| this or openai is just running some model 10x longer to estimate
| what they would get if they trained with 10x?
|
| This tweet is relevant:
| https://x.com/polynoamial/status/1676971503261454340
| Kronopath wrote:
| Anything that allows AI to scale to superinteligence quicker is
| going to run into AI alignment issues, since we don't really know
| a foolproof way of controlling AI. With the AI of today, this
| isn't too bad (the worst you get is stuff like AI confidently
| making up fake facts), but with a superintelligence this could be
| disastrous.
|
| It's very irresponsible for this article to advocate and provide
| a pathway to immediate superintelligence (regardless of whether
| or not it actually works) without even discussing the question of
| how you figure out what you're searching _for_ , and how you'll
| prevent that superintelligence from being evil.
| nullc wrote:
| I don't think your response is appropriate. Narrow domain
| "superintelligence" is around us everywhere-- every PID
| controller can drive a process to its target far beyond any
| human capability.
|
| The obvious way to incorporate good search is to have extremely
| fast models that are being used in the search interior loop.
| Such models would be inherently less general, and likely
| trained on the specific problem or at least domain-- just for
| performance sake. The lesson in this article was that a tiny
| superspecialized model inside a powerful transitional search
| framework significantly outperformed a much larger more general
| model.
|
| Use of explicit external search should make the optimization
| system's behavior and objective more transparent and tractable
| than just sampling the output of an auto-regressive model
| alone. If nothing else you can at least look at the branches it
| did and didn't explore. It's also a design that's more easy to
| bolt in varrious kinds of regularizes, code to steer it away
| from parts of the search space you don't want it operating in.
|
| The irony of all the AI scaremongering is that if there is ever
| some evil AI with some LLM as an important part of its
| reasoning process if it is evil it may well be so because being
| evil is a big part of the narrative it was trained on. :D
| coldtea wrote:
| Of course "superintelligence" is just a mythical creature at
| the moment, with no known path to get there, or even a specific
| proof of what it even means - usually it's some hand waving
| about capabilities that sound magical, when IQ might very well
| be subject to diminishing returns.
| mxwsn wrote:
| The effectiveness of search goes hand-in-hand with quality of the
| value function. But today, value functions are incredibly domain-
| specific, and there is weak or no current evidence (as far as I
| know) that we can make value functions that generalize well to
| new domains. This article effectively makes a conceptual leap
| from "chess has good value functions" to "we can make good value
| functions that enable search for AI research". I mean yes, that'd
| be wonderful - a holy grail - but can we really?
|
| In the meantime, 1000x or 10000x inference time cost for running
| an LLM gets you into pretty ridiculous cost territory.
| dsjoerg wrote:
| Self-evaluation might be good enough in some domains? Then the
| AI is doing repeated self-evaluation, trying things out to find
| a response that scores higher according to its self metric.
| dullcrisp wrote:
| Sorry but I have to ask: what makes you think this would be a
| good idea?
| skirmish wrote:
| This will just lead to the evaluatee finding anomalies in
| evaluator and exploiting them for maximum gains. It
| happened many times already where a ML model controled an
| object in a physical world simulator, and all it learned
| was to exploit simulator bugs [1]
|
| [1] https://boingboing.net/2018/11/12/local-optima-r-
| us.html
| cowpig wrote:
| > The effectiveness of search goes hand-in-hand with quality of
| the value function. But today, value functions are incredibly
| domain-specific, and there is weak or no current evidence (as
| far as I know) that we can make value functions that generalize
| well to new domains.
|
| Do you believe that there will be a "general AI" breakthrough?
| I feel as though you have expressed the reason I am so
| skeptical of all these AI researchers who believe we are on the
| cusp of it (what "general AI" means exactly never seems to be
| very well-defined)
| mxwsn wrote:
| I think capitalistic pressures favor narrow superhuman AI
| over general AI. I wrote on this two years ago:
| https://argmax.blog/posts/agi-capitalism/
|
| Since I wrote about this, I would say that OpenAI's
| directional struggles are some confirmation of my hypothesis.
|
| summary: I believe that AGI is possible but will take
| multiple unknown breakthroughs on an unknown timeline, but
| most likely requires long-term concerted effort with much
| less immediate payoff than pursuing narrow superhuman AI,
| such that serious efforts at AGI is not incentivized much in
| capitalism.
| shrimp_emoji wrote:
| But I thought the history of capitalism is an invasion from
| the future by an artificial intelligence that must assemble
| itself entirely from its enemy's resources.
|
| NB: I agree; I think AGI will first be achieved with
| genetic engineering, which is a path of way lesser
| resistance than using silicon hardware (which is probably a
| century plus off at the minimum from being powerful enough
| to emulate a human brain).
| HarHarVeryFunny wrote:
| Yeah, Stockfish is probably evaluating many millions of
| positions when looking 40-ply ahead, even with the limited
| number of legal chess moves in a given position, and with an
| easy criteria for heavy early pruning (once a branch becomes
| losing, not much point continuing it). I can't imagine the cost
| of evaluating millions of LLM continuations, just to select the
| optimal one!
|
| Where tree search might make more sense applied to LLMs is for
| more coarser grained reasoning where the branching isn't based
| on alternate word continuations but on alternate what-if lines
| of thought, but even then it seems costs could easily become
| prohibitive, both for generation and evaluation/pruning, and
| using such a biased approach seems as much to fly in the face
| of the bitter lesson as be suggested by it.
| mxwsn wrote:
| Yes absolutely and well put - a strong property of chess is
| that next states are fast and easy to enumerate, which makes
| search particularly easy and strong, while next states are
| much slower, harder to define, and more expensive to
| enumerate with an LLM
| typon wrote:
| The cost of the LLM isn't the only or even the most
| important cost that matters. Take the example of automating
| AI research: evaluating moves effectively means inventing a
| new architecture or modifying an existing one, launching a
| training run and evaluating the new model on some suite of
| benchmarks. The ASI has to do this in a loop, gather
| feedback and update its priors - what people refer to as
| "Grad student descent". The cost of running each train-eval
| iteration during your search is going to be significantly
| more than generating the code for the next model.
| byteknight wrote:
| Isn't evaluating against different effective "experts" within
| the model effectively what MoE [1] does?
|
| > Mixture of experts (MoE) is a machine learning technique
| where multiple expert networks (learners) are used to divide
| a problem space into homogeneous regions.[1] It differs from
| ensemble techniques in that for MoE, typically only one or a
| few expert models are run for each input, whereas in ensemble
| techniques, all models are run on every input.
|
| [1] https://en.wikipedia.org/wiki/Mixture_of_experts
| HarHarVeryFunny wrote:
| No - MoE is just a way to add more parameters to a model
| without increasing the cost (number of FLOPs) of running
| it.
|
| The way MoE does this is by having multiple alternate
| parallel paths through some parts of the model, together
| with a routing component that decides which path (one only)
| to send each token through. These paths are the "experts",
| but the name doesn't really correspond to any intuitive
| notion of expert. So, rather than having 1 path with N
| parameters, you have M paths (experts) each with N
| parameters, but each token only goes through one of them,
| so number of FLOPs is unchanged.
|
| With tree search, whether for a game like Chess or
| potentially LLMs, you are growing a "tree" of all alternate
| possible branching continuations of the game (sentence),
| and keeping the number of these branches under control by
| evaluating each branch (= sequence of moves) to see if it
| is worth continuing, and if not discarding it ("pruning" it
| off the tree).
|
| With Chess, pruning is easy since you just need to look at
| the board position at the tip of the branch and decide if
| it's a good enough position to continue playing from
| (extending the branches). With an LLM each branch would
| represent an alternate continuation of the input prompt,
| and to decide whether to prune it or not you'd have to pass
| the input + branch to another LLM and have it decide if it
| looked promising or not (easier said than done!).
|
| So, MoE is just a way to cap the cost of running a model,
| while tree search is a way to explore alternate
| continuations and decide which ones to discard, and which
| ones to explore (evaluate) further.
| CooCooCaCha wrote:
| We humans learn our own value function.
|
| If I get hungry for example, my brain will generate a plan to
| satisfy that hunger. The search process and the evaluation
| happen in the same place, my brain.
| fire_lake wrote:
| I didn't understand this piece.
|
| What do they mean by using LLMs with search? Is this simply RAG?
| roca wrote:
| They mean something like the minmax algorithm used in game
| engines.
| Legend2440 wrote:
| "Search" here means trying a bunch of possibilities and seeing
| what works. Like how a sudoku solver or pathfinding algorithm
| does search, not how a search engine does.
| fire_lake wrote:
| But the domain of "AI Research" is broad and imprecise - not
| simple and discrete like chess game states. What is the type
| of each point in the search space for AI Research?
| moffkalast wrote:
| Well if we knew how to implement it, then we'd already have
| it eh?
| timfsu wrote:
| This is a fascinating idea - although I wish the definition of
| search in the LLM context was expanded a bit more. What kind of
| search capability strapped onto current-gen LLMs would give them
| superpowers?
| gwd wrote:
| I think what may be confusing is that the author is using
| "search" here in the AI sense, not in the Google sense: that
| is, having an internal simulator of possible actions and
| possible reactions, like Stockfish's chess move search (if I do
| A, it could do B C or D; if it does B, I can do E F or G, etc).
|
| So think about the restrictions current LLMs have:
|
| * They can't sit and think about an answer; they can "think out
| loud", but they have to start talking, and they can't go back
| and say, "No wait, that's wrong, let's start again."
|
| * If they're composing something, they can't really go back and
| revise what they've written
|
| * Sometimes they can look up reference material, but they can't
| actually sit and digest it; they're expected to skim it and
| then give an answer.
|
| How would you perform under those circumstances? If someone
| were to just come and ask you any question under the sun, and
| you had to just start talking, without taking any time to think
| about your answer, and without being able to say "OK wait, let
| me go back"?
|
| I don't know about you, but there's no way I would be able to
| perform anywhere _close_ to what ChatGPT 4 is able to do.
| People complain that ChatGPT 4 is a "bullshitter", but given
| its constraints that's all you or I would be in the same
| situation -- but it's already way, way better than I could ever
| be.
|
| Given its limitations, ChatGPT is _phenomenal_. So now imagine
| what it could do if it _were_ given time to just "sit and
| think"? To make a plan, to explore the possible solution space
| the same way that Stockfish does? To take notes and revise and
| research and come back and think some more, before having to
| actually answer?
|
| Reading this is honestly the first time in a while I've
| believed that some sort of "AI foom" might be possible.
| cbsmith wrote:
| [delayed]
| cgearhart wrote:
| [1] applied AlphaZero style search with LLMs to achieve
| performance comparable to GPT-4 Turbo with a llama3-8B base
| model. However, what's missing entirely from the paper (and the
| subject article in this thread) is that tree search is
| _massively_ computationally expensive. It works well when the
| value function enables cutting out large portions of the search
| space, but the fact that the LLM version was limited to only 8
| rollouts (I think it was 800 for AlphaZero) implies to me that
| the added complexity is not yet optimized or favorable for
| LLMs.
|
| [1] https://arxiv.org/abs/2406.07394
| 1024core wrote:
| The problem with adding "search" to a model is that the model has
| already seen everything to be "search"ed in its training data.
| There is nothing left.
|
| Imagine if Leela (author's example) had been trained on every
| chess board position out there (I know it's mathematically
| impossible, but bear with me for a second). If Leela had been
| trained on every board position, it may have whupped Stockfish.
| So, adding "search" to Leela would have been pointless, since it
| would have seen every board position out there.
|
| Today's LLMs are trained on every word ever written on the 'net,
| every word ever put down in a book, every word uttered in a video
| on Youtube or a podcast.
| yousif_123123 wrote:
| Still, similar to when you have read 10 textbooks, if you are
| answering a question and have access to the source material, it
| can help you in your answer.
| groby_b wrote:
| You're omitting the somewhat relevant part of recall ability. I
| can train a 50 parameter model on the entire internet, and
| while it's seen it all, it won't be able to recall it. (You can
| likely do the same thing with a 500B model for similar results,
| though it's getting somewhat closer to decent recall)
|
| The whole point of deep learning is that the model learns to
| generalize. It's not to have a perfect storage engine with a
| human language query frontend.
| sebastos wrote:
| Fully agree, although it's interesting to consider the
| perspective that the entire LLM hype cycle is largely built
| around the question "what if we punted on actual thinking and
| instead just tried to memorize everything and then provide a
| human language query frontend? Is that still useful?"
| Arguably it is (sorta), and that's what is driving this
| latest zeitgeist. Compute had quietly scaled in the
| background while we were banging our heads against real
| thinking, until one day we looked up and we still didn't have
| a thinking machine, but it was now approximately possible to
| just do the stupid thing and store "all the text on the
| internet" in a lookup table, where the keys are prompts.
| That's... the opposite of thinking, really, but still
| sometimes useful!
|
| Although to be clear I think actual reasoning systems are
| what we should be trying to create, and this LLM stuff seems
| like a cul-de-sac on that journey.
| salamo wrote:
| If the game was small enough to memorize, like tic tac toe, you
| could definitely train a neural net to 100% accuracy. I've done
| it, it works.
|
| The problem is that for most of the interesting problems out
| there, it isn't possible to see every possibility let alone
| memorize it.
| groby_b wrote:
| While I respect the power of intuition - this may well be a great
| path - it's worth keeping in mind that this is currently just
| that. A hunch. Leela got crushed due to AI directed search, what
| if we can wave a wand and hand all AIs search. Somehow.
| Magically. Which will then somehow magically trounce current LLMs
| at domain-specific task.
|
| There's a kernel of truth in there. See the papers on better
| results via monte carlo search trees (e.g. [1]). See mixture-of-
| LoRA/LoRA-swarm approaches. (I swear there's a startup using the
| approach of tons of domain-specific LoRAs, but my brain's not
| yielding the name)
|
| Augmenting LLM capabilities via _some_ sort of cheaper and more
| reliable exploration is likely a valid path. It's not GPT-8 next
| year, though.
|
| [1] https://arxiv.org/pdf/2309.03224
| hartator wrote:
| Isn't the "search" space infinite though and impossible to
| qualify "success"?
|
| You can't just give LLMs infinite compute time and expect them to
| find answers for like "cure cancer". Even chess moves that seem
| finite and success quantifiable are an also infinite problem and
| the best engines take "shortcuts" in their "thinking". It's
| impossible to do for real world problems.
| cpill wrote:
| the recent episode of Machine Learning Street Talk on control
| theory for LLMs sounds like it's thinking in this direction.
| Say you have 100k agents searching through research papers, and
| then trying every combination of them, 100k^2, to see if there
| is any synergy of ideas, and you keep doing this for all the
| successful combos... some of these might give the researchers
| some good ideas to try out. I can see it happening, if they can
| fine tune a model that becomes good at idea synergy. but then
| again real creativity is hard
| salamo wrote:
| Search is almost certainly necessary, and I think the trillion
| dollar cluster maximalists probably need to talk to people who
| created superhuman chess engines that now can run on smartphones.
| Because one possibility is that someone figures out how to beat
| your trillion dollar cluster with a million dollar cluster, or
| 500k million dollar clusters.
|
| On chess specifically, my takeaway is that the branching factor
| in chess never gets so high that a breadth-first approach is
| unworkable. The median branching factor (i.e. the number of legal
| moves) maxes out at around 40 but generally stays near 30. The
| most moves I have ever found in any position from a real game was
| 147, but at that point almost every move is checkmate anyways.
|
| Creating superhuman go engines was a challenge for a long time
| because the branching factor is so much larger than chess.
|
| Since MCTS is less thorough, it makes sense that a full search
| could find a weakness and exploit it. To me, the question is
| whether we can apply breadth-first approaches to larger games and
| situations, and I think the answer is clearly no. Unlike chess,
| the branching factor of real-world situations is orders of
| magnitude larger.
|
| But also unlike chess, which is highly chaotic (small decisions
| matter a lot for future state), most small decisions _don 't
| matter_. If you're flying from NYC to LA, it matters a lot if you
| drive or fly or walk. It mostly doesn't matter if you walk out
| the door starting with your left foot or your right. It mostly
| doesn't matter if you blink now or in two seconds.
| cpill wrote:
| I think the branching factor for LLMs is around 50k for the
| number of next possible tokens.
| optimalsolver wrote:
| Charlie Steiner pointed this out 5 years ago on Less Wrong:
|
| >If you train GPT-3 on a bunch of medical textbooks and prompt it
| to tell you a cure for Alzheimer's, it won't tell you a cure, it
| will tell you what humans have said about curing Alzheimer's ...
| It would just tell you a plausible story about a situation
| related to the prompt about curing Alzheimer's, based on its
| training data. Rather than a logical Oracle, this image-
| captioning-esque scheme would be an intuitive Oracle, telling you
| things that make sense based on associations already present
| within the training set.
|
| >What am I driving at here, by pointing out that curing
| Alzheimer's is hard? It's that the designs above are missing
| something, and what they're missing is search. I'm not saying
| that getting a neural net to directly output your cure for
| Alzheimer's is impossible. But it seems like it requires there to
| already be a "cure for Alzheimer's" dimension in your learned
| model. The more realistic way to find the cure for Alzheimer's,
| if you don't already know it, is going to involve lots of logical
| steps one after another, slowly moving through a logical space,
| narrowing down the possibilities more and more, and eventually
| finding something that fits the bill. In other words, solving a
| search problem.
|
| >So if your AI can tell you how to cure Alzheimer's, I think
| either it's explicitly doing a search for how to cure Alzheimer's
| (or worlds that match your verbal prompt the best, or whatever),
| or it has some internal state that implicitly performs a search.
|
| https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-super...
___________________________________________________________________
(page generated 2024-06-14 23:00 UTC)