[HN Gopher] AI Search: The Bitter-Er Lesson
       ___________________________________________________________________
        
       AI Search: The Bitter-Er Lesson
        
       Author : dwighttk
       Score  : 78 points
       Date   : 2024-06-14 18:47 UTC (4 hours ago)
        
 (HTM) web link (yellow-apartment-148.notion.site)
 (TXT) w3m dump (yellow-apartment-148.notion.site)
        
       | johnthewise wrote:
       | What happened to all the chatter about Q*? I remember reading
       | about this train/test time trade-off back then, does anyone have
       | good list of recent papers/blogs about this? What is holding back
       | this or openai is just running some model 10x longer to estimate
       | what they would get if they trained with 10x?
       | 
       | This tweet is relevant:
       | https://x.com/polynoamial/status/1676971503261454340
        
       | Kronopath wrote:
       | Anything that allows AI to scale to superinteligence quicker is
       | going to run into AI alignment issues, since we don't really know
       | a foolproof way of controlling AI. With the AI of today, this
       | isn't too bad (the worst you get is stuff like AI confidently
       | making up fake facts), but with a superintelligence this could be
       | disastrous.
       | 
       | It's very irresponsible for this article to advocate and provide
       | a pathway to immediate superintelligence (regardless of whether
       | or not it actually works) without even discussing the question of
       | how you figure out what you're searching _for_ , and how you'll
       | prevent that superintelligence from being evil.
        
         | nullc wrote:
         | I don't think your response is appropriate. Narrow domain
         | "superintelligence" is around us everywhere-- every PID
         | controller can drive a process to its target far beyond any
         | human capability.
         | 
         | The obvious way to incorporate good search is to have extremely
         | fast models that are being used in the search interior loop.
         | Such models would be inherently less general, and likely
         | trained on the specific problem or at least domain-- just for
         | performance sake. The lesson in this article was that a tiny
         | superspecialized model inside a powerful transitional search
         | framework significantly outperformed a much larger more general
         | model.
         | 
         | Use of explicit external search should make the optimization
         | system's behavior and objective more transparent and tractable
         | than just sampling the output of an auto-regressive model
         | alone. If nothing else you can at least look at the branches it
         | did and didn't explore. It's also a design that's more easy to
         | bolt in varrious kinds of regularizes, code to steer it away
         | from parts of the search space you don't want it operating in.
         | 
         | The irony of all the AI scaremongering is that if there is ever
         | some evil AI with some LLM as an important part of its
         | reasoning process if it is evil it may well be so because being
         | evil is a big part of the narrative it was trained on. :D
        
         | coldtea wrote:
         | Of course "superintelligence" is just a mythical creature at
         | the moment, with no known path to get there, or even a specific
         | proof of what it even means - usually it's some hand waving
         | about capabilities that sound magical, when IQ might very well
         | be subject to diminishing returns.
        
       | mxwsn wrote:
       | The effectiveness of search goes hand-in-hand with quality of the
       | value function. But today, value functions are incredibly domain-
       | specific, and there is weak or no current evidence (as far as I
       | know) that we can make value functions that generalize well to
       | new domains. This article effectively makes a conceptual leap
       | from "chess has good value functions" to "we can make good value
       | functions that enable search for AI research". I mean yes, that'd
       | be wonderful - a holy grail - but can we really?
       | 
       | In the meantime, 1000x or 10000x inference time cost for running
       | an LLM gets you into pretty ridiculous cost territory.
        
         | dsjoerg wrote:
         | Self-evaluation might be good enough in some domains? Then the
         | AI is doing repeated self-evaluation, trying things out to find
         | a response that scores higher according to its self metric.
        
           | dullcrisp wrote:
           | Sorry but I have to ask: what makes you think this would be a
           | good idea?
        
             | skirmish wrote:
             | This will just lead to the evaluatee finding anomalies in
             | evaluator and exploiting them for maximum gains. It
             | happened many times already where a ML model controled an
             | object in a physical world simulator, and all it learned
             | was to exploit simulator bugs [1]
             | 
             | [1] https://boingboing.net/2018/11/12/local-optima-r-
             | us.html
        
         | cowpig wrote:
         | > The effectiveness of search goes hand-in-hand with quality of
         | the value function. But today, value functions are incredibly
         | domain-specific, and there is weak or no current evidence (as
         | far as I know) that we can make value functions that generalize
         | well to new domains.
         | 
         | Do you believe that there will be a "general AI" breakthrough?
         | I feel as though you have expressed the reason I am so
         | skeptical of all these AI researchers who believe we are on the
         | cusp of it (what "general AI" means exactly never seems to be
         | very well-defined)
        
           | mxwsn wrote:
           | I think capitalistic pressures favor narrow superhuman AI
           | over general AI. I wrote on this two years ago:
           | https://argmax.blog/posts/agi-capitalism/
           | 
           | Since I wrote about this, I would say that OpenAI's
           | directional struggles are some confirmation of my hypothesis.
           | 
           | summary: I believe that AGI is possible but will take
           | multiple unknown breakthroughs on an unknown timeline, but
           | most likely requires long-term concerted effort with much
           | less immediate payoff than pursuing narrow superhuman AI,
           | such that serious efforts at AGI is not incentivized much in
           | capitalism.
        
             | shrimp_emoji wrote:
             | But I thought the history of capitalism is an invasion from
             | the future by an artificial intelligence that must assemble
             | itself entirely from its enemy's resources.
             | 
             | NB: I agree; I think AGI will first be achieved with
             | genetic engineering, which is a path of way lesser
             | resistance than using silicon hardware (which is probably a
             | century plus off at the minimum from being powerful enough
             | to emulate a human brain).
        
         | HarHarVeryFunny wrote:
         | Yeah, Stockfish is probably evaluating many millions of
         | positions when looking 40-ply ahead, even with the limited
         | number of legal chess moves in a given position, and with an
         | easy criteria for heavy early pruning (once a branch becomes
         | losing, not much point continuing it). I can't imagine the cost
         | of evaluating millions of LLM continuations, just to select the
         | optimal one!
         | 
         | Where tree search might make more sense applied to LLMs is for
         | more coarser grained reasoning where the branching isn't based
         | on alternate word continuations but on alternate what-if lines
         | of thought, but even then it seems costs could easily become
         | prohibitive, both for generation and evaluation/pruning, and
         | using such a biased approach seems as much to fly in the face
         | of the bitter lesson as be suggested by it.
        
           | mxwsn wrote:
           | Yes absolutely and well put - a strong property of chess is
           | that next states are fast and easy to enumerate, which makes
           | search particularly easy and strong, while next states are
           | much slower, harder to define, and more expensive to
           | enumerate with an LLM
        
             | typon wrote:
             | The cost of the LLM isn't the only or even the most
             | important cost that matters. Take the example of automating
             | AI research: evaluating moves effectively means inventing a
             | new architecture or modifying an existing one, launching a
             | training run and evaluating the new model on some suite of
             | benchmarks. The ASI has to do this in a loop, gather
             | feedback and update its priors - what people refer to as
             | "Grad student descent". The cost of running each train-eval
             | iteration during your search is going to be significantly
             | more than generating the code for the next model.
        
           | byteknight wrote:
           | Isn't evaluating against different effective "experts" within
           | the model effectively what MoE [1] does?
           | 
           | > Mixture of experts (MoE) is a machine learning technique
           | where multiple expert networks (learners) are used to divide
           | a problem space into homogeneous regions.[1] It differs from
           | ensemble techniques in that for MoE, typically only one or a
           | few expert models are run for each input, whereas in ensemble
           | techniques, all models are run on every input.
           | 
           | [1] https://en.wikipedia.org/wiki/Mixture_of_experts
        
             | HarHarVeryFunny wrote:
             | No - MoE is just a way to add more parameters to a model
             | without increasing the cost (number of FLOPs) of running
             | it.
             | 
             | The way MoE does this is by having multiple alternate
             | parallel paths through some parts of the model, together
             | with a routing component that decides which path (one only)
             | to send each token through. These paths are the "experts",
             | but the name doesn't really correspond to any intuitive
             | notion of expert. So, rather than having 1 path with N
             | parameters, you have M paths (experts) each with N
             | parameters, but each token only goes through one of them,
             | so number of FLOPs is unchanged.
             | 
             | With tree search, whether for a game like Chess or
             | potentially LLMs, you are growing a "tree" of all alternate
             | possible branching continuations of the game (sentence),
             | and keeping the number of these branches under control by
             | evaluating each branch (= sequence of moves) to see if it
             | is worth continuing, and if not discarding it ("pruning" it
             | off the tree).
             | 
             | With Chess, pruning is easy since you just need to look at
             | the board position at the tip of the branch and decide if
             | it's a good enough position to continue playing from
             | (extending the branches). With an LLM each branch would
             | represent an alternate continuation of the input prompt,
             | and to decide whether to prune it or not you'd have to pass
             | the input + branch to another LLM and have it decide if it
             | looked promising or not (easier said than done!).
             | 
             | So, MoE is just a way to cap the cost of running a model,
             | while tree search is a way to explore alternate
             | continuations and decide which ones to discard, and which
             | ones to explore (evaluate) further.
        
         | CooCooCaCha wrote:
         | We humans learn our own value function.
         | 
         | If I get hungry for example, my brain will generate a plan to
         | satisfy that hunger. The search process and the evaluation
         | happen in the same place, my brain.
        
       | fire_lake wrote:
       | I didn't understand this piece.
       | 
       | What do they mean by using LLMs with search? Is this simply RAG?
        
         | roca wrote:
         | They mean something like the minmax algorithm used in game
         | engines.
        
         | Legend2440 wrote:
         | "Search" here means trying a bunch of possibilities and seeing
         | what works. Like how a sudoku solver or pathfinding algorithm
         | does search, not how a search engine does.
        
           | fire_lake wrote:
           | But the domain of "AI Research" is broad and imprecise - not
           | simple and discrete like chess game states. What is the type
           | of each point in the search space for AI Research?
        
             | moffkalast wrote:
             | Well if we knew how to implement it, then we'd already have
             | it eh?
        
       | timfsu wrote:
       | This is a fascinating idea - although I wish the definition of
       | search in the LLM context was expanded a bit more. What kind of
       | search capability strapped onto current-gen LLMs would give them
       | superpowers?
        
         | gwd wrote:
         | I think what may be confusing is that the author is using
         | "search" here in the AI sense, not in the Google sense: that
         | is, having an internal simulator of possible actions and
         | possible reactions, like Stockfish's chess move search (if I do
         | A, it could do B C or D; if it does B, I can do E F or G, etc).
         | 
         | So think about the restrictions current LLMs have:
         | 
         | * They can't sit and think about an answer; they can "think out
         | loud", but they have to start talking, and they can't go back
         | and say, "No wait, that's wrong, let's start again."
         | 
         | * If they're composing something, they can't really go back and
         | revise what they've written
         | 
         | * Sometimes they can look up reference material, but they can't
         | actually sit and digest it; they're expected to skim it and
         | then give an answer.
         | 
         | How would you perform under those circumstances? If someone
         | were to just come and ask you any question under the sun, and
         | you had to just start talking, without taking any time to think
         | about your answer, and without being able to say "OK wait, let
         | me go back"?
         | 
         | I don't know about you, but there's no way I would be able to
         | perform anywhere _close_ to what ChatGPT 4 is able to do.
         | People complain that ChatGPT 4 is a  "bullshitter", but given
         | its constraints that's all you or I would be in the same
         | situation -- but it's already way, way better than I could ever
         | be.
         | 
         | Given its limitations, ChatGPT is _phenomenal_. So now imagine
         | what it could do if it _were_ given time to just  "sit and
         | think"? To make a plan, to explore the possible solution space
         | the same way that Stockfish does? To take notes and revise and
         | research and come back and think some more, before having to
         | actually answer?
         | 
         | Reading this is honestly the first time in a while I've
         | believed that some sort of "AI foom" might be possible.
        
           | cbsmith wrote:
           | [delayed]
        
         | cgearhart wrote:
         | [1] applied AlphaZero style search with LLMs to achieve
         | performance comparable to GPT-4 Turbo with a llama3-8B base
         | model. However, what's missing entirely from the paper (and the
         | subject article in this thread) is that tree search is
         | _massively_ computationally expensive. It works well when the
         | value function enables cutting out large portions of the search
         | space, but the fact that the LLM version was limited to only 8
         | rollouts (I think it was 800 for AlphaZero) implies to me that
         | the added complexity is not yet optimized or favorable for
         | LLMs.
         | 
         | [1] https://arxiv.org/abs/2406.07394
        
       | 1024core wrote:
       | The problem with adding "search" to a model is that the model has
       | already seen everything to be "search"ed in its training data.
       | There is nothing left.
       | 
       | Imagine if Leela (author's example) had been trained on every
       | chess board position out there (I know it's mathematically
       | impossible, but bear with me for a second). If Leela had been
       | trained on every board position, it may have whupped Stockfish.
       | So, adding "search" to Leela would have been pointless, since it
       | would have seen every board position out there.
       | 
       | Today's LLMs are trained on every word ever written on the 'net,
       | every word ever put down in a book, every word uttered in a video
       | on Youtube or a podcast.
        
         | yousif_123123 wrote:
         | Still, similar to when you have read 10 textbooks, if you are
         | answering a question and have access to the source material, it
         | can help you in your answer.
        
         | groby_b wrote:
         | You're omitting the somewhat relevant part of recall ability. I
         | can train a 50 parameter model on the entire internet, and
         | while it's seen it all, it won't be able to recall it. (You can
         | likely do the same thing with a 500B model for similar results,
         | though it's getting somewhat closer to decent recall)
         | 
         | The whole point of deep learning is that the model learns to
         | generalize. It's not to have a perfect storage engine with a
         | human language query frontend.
        
           | sebastos wrote:
           | Fully agree, although it's interesting to consider the
           | perspective that the entire LLM hype cycle is largely built
           | around the question "what if we punted on actual thinking and
           | instead just tried to memorize everything and then provide a
           | human language query frontend? Is that still useful?"
           | Arguably it is (sorta), and that's what is driving this
           | latest zeitgeist. Compute had quietly scaled in the
           | background while we were banging our heads against real
           | thinking, until one day we looked up and we still didn't have
           | a thinking machine, but it was now approximately possible to
           | just do the stupid thing and store "all the text on the
           | internet" in a lookup table, where the keys are prompts.
           | That's... the opposite of thinking, really, but still
           | sometimes useful!
           | 
           | Although to be clear I think actual reasoning systems are
           | what we should be trying to create, and this LLM stuff seems
           | like a cul-de-sac on that journey.
        
         | salamo wrote:
         | If the game was small enough to memorize, like tic tac toe, you
         | could definitely train a neural net to 100% accuracy. I've done
         | it, it works.
         | 
         | The problem is that for most of the interesting problems out
         | there, it isn't possible to see every possibility let alone
         | memorize it.
        
       | groby_b wrote:
       | While I respect the power of intuition - this may well be a great
       | path - it's worth keeping in mind that this is currently just
       | that. A hunch. Leela got crushed due to AI directed search, what
       | if we can wave a wand and hand all AIs search. Somehow.
       | Magically. Which will then somehow magically trounce current LLMs
       | at domain-specific task.
       | 
       | There's a kernel of truth in there. See the papers on better
       | results via monte carlo search trees (e.g. [1]). See mixture-of-
       | LoRA/LoRA-swarm approaches. (I swear there's a startup using the
       | approach of tons of domain-specific LoRAs, but my brain's not
       | yielding the name)
       | 
       | Augmenting LLM capabilities via _some_ sort of cheaper and more
       | reliable exploration is likely a valid path. It's not GPT-8 next
       | year, though.
       | 
       | [1] https://arxiv.org/pdf/2309.03224
        
       | hartator wrote:
       | Isn't the "search" space infinite though and impossible to
       | qualify "success"?
       | 
       | You can't just give LLMs infinite compute time and expect them to
       | find answers for like "cure cancer". Even chess moves that seem
       | finite and success quantifiable are an also infinite problem and
       | the best engines take "shortcuts" in their "thinking". It's
       | impossible to do for real world problems.
        
         | cpill wrote:
         | the recent episode of Machine Learning Street Talk on control
         | theory for LLMs sounds like it's thinking in this direction.
         | Say you have 100k agents searching through research papers, and
         | then trying every combination of them, 100k^2, to see if there
         | is any synergy of ideas, and you keep doing this for all the
         | successful combos... some of these might give the researchers
         | some good ideas to try out. I can see it happening, if they can
         | fine tune a model that becomes good at idea synergy. but then
         | again real creativity is hard
        
       | salamo wrote:
       | Search is almost certainly necessary, and I think the trillion
       | dollar cluster maximalists probably need to talk to people who
       | created superhuman chess engines that now can run on smartphones.
       | Because one possibility is that someone figures out how to beat
       | your trillion dollar cluster with a million dollar cluster, or
       | 500k million dollar clusters.
       | 
       | On chess specifically, my takeaway is that the branching factor
       | in chess never gets so high that a breadth-first approach is
       | unworkable. The median branching factor (i.e. the number of legal
       | moves) maxes out at around 40 but generally stays near 30. The
       | most moves I have ever found in any position from a real game was
       | 147, but at that point almost every move is checkmate anyways.
       | 
       | Creating superhuman go engines was a challenge for a long time
       | because the branching factor is so much larger than chess.
       | 
       | Since MCTS is less thorough, it makes sense that a full search
       | could find a weakness and exploit it. To me, the question is
       | whether we can apply breadth-first approaches to larger games and
       | situations, and I think the answer is clearly no. Unlike chess,
       | the branching factor of real-world situations is orders of
       | magnitude larger.
       | 
       | But also unlike chess, which is highly chaotic (small decisions
       | matter a lot for future state), most small decisions _don 't
       | matter_. If you're flying from NYC to LA, it matters a lot if you
       | drive or fly or walk. It mostly doesn't matter if you walk out
       | the door starting with your left foot or your right. It mostly
       | doesn't matter if you blink now or in two seconds.
        
         | cpill wrote:
         | I think the branching factor for LLMs is around 50k for the
         | number of next possible tokens.
        
       | optimalsolver wrote:
       | Charlie Steiner pointed this out 5 years ago on Less Wrong:
       | 
       | >If you train GPT-3 on a bunch of medical textbooks and prompt it
       | to tell you a cure for Alzheimer's, it won't tell you a cure, it
       | will tell you what humans have said about curing Alzheimer's ...
       | It would just tell you a plausible story about a situation
       | related to the prompt about curing Alzheimer's, based on its
       | training data. Rather than a logical Oracle, this image-
       | captioning-esque scheme would be an intuitive Oracle, telling you
       | things that make sense based on associations already present
       | within the training set.
       | 
       | >What am I driving at here, by pointing out that curing
       | Alzheimer's is hard? It's that the designs above are missing
       | something, and what they're missing is search. I'm not saying
       | that getting a neural net to directly output your cure for
       | Alzheimer's is impossible. But it seems like it requires there to
       | already be a "cure for Alzheimer's" dimension in your learned
       | model. The more realistic way to find the cure for Alzheimer's,
       | if you don't already know it, is going to involve lots of logical
       | steps one after another, slowly moving through a logical space,
       | narrowing down the possibilities more and more, and eventually
       | finding something that fits the bill. In other words, solving a
       | search problem.
       | 
       | >So if your AI can tell you how to cure Alzheimer's, I think
       | either it's explicitly doing a search for how to cure Alzheimer's
       | (or worlds that match your verbal prompt the best, or whatever),
       | or it has some internal state that implicitly performs a search.
       | 
       | https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-super...
        
       ___________________________________________________________________
       (page generated 2024-06-14 23:00 UTC)