[HN Gopher] Open source inference time compute example from Hugg...
       ___________________________________________________________________
        
       Open source inference time compute example from HuggingFace
        
       Author : burningion
       Score  : 71 points
       Date   : 2024-12-16 20:35 UTC (4 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mentalgear wrote:
       | Happy to see "inference time compute" term being used primarily
       | nowadays - it's a much more precise and appropriate term compared
       | to the unwieldy "test-time compute" that openai used to call it
       | when they thought they "invented" scaling inference time.
        
       | srush wrote:
       | Full blog is here:
       | https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...
       | 
       | Happy to answer any questions about these methods.
        
         | OakNinja wrote:
         | Excellent and interesting post!
         | 
         | Minor gripe - The best-of-n | beam search illustration is not
         | compatible with red-green color blindness. I can literally not
         | see the difference between the Rejected and the Selected dots
         | even if I zoom in.
        
           | srush wrote:
           | Thanks for the feedback, and not minor. Sorry about that.
        
         | dinp wrote:
         | Great work! When I use models like o1, they work better than
         | sonnet and 4o for tasks that require some thinking but the
         | output is often very verbose. Is it possible to get the best of
         | both worlds? The thinking takes place resulting in better
         | performance but the output is straightforward to work with like
         | with sonnet and 4o. Did you observe similar behaviour with the
         | 1B and 3B models? How does the model behaviour change when used
         | for normal tasks that don't require thinking?
         | 
         | Also how well do these models work to extract structured
         | output? Eg- perform ocr on some hand written text with math,
         | convert to html and format formulas correctly etc. Single shot
         | prompting doesn't work well with such problems but splitting
         | the steps into consecutive api calls works well.
        
           | srush wrote:
           | That's a good point. We don't see that in our experiments
           | because it's all in the math domain. However for OAI it's
           | plausible that training for o1 might conflict with standard
           | instruction training, leading to less human preferred output
           | style.
        
           | dimitry12 wrote:
           | In this paper and HF's replication the model used to produce
           | solutions to MATH problems is off-the-shelf. It is induced to
           | produce step-by-step CoT-style solutions by few-shot ICL
           | prompts or by instructions.
           | 
           | Yes, the search process (beam-search of best-of-N) does
           | produce verbose traces because there is branching involved
           | when sampling "thoughts" from base model. These branched
           | traces (including incomplete "abandoned" branches) can be
           | shown to the user or hidden, if the approach is deployed as-
           | is.
        
         | mccoyb wrote:
         | In the blog post, learned verifiers are mentioned. Are these
         | learned offline using data, and is the intent to learn a
         | scoring heuristic to help the search?
        
           | dimitry12 wrote:
           | Verifier is trained with soft values of reward-to-go for each
           | solution-prefix, obtained from monte-carlo rollouts of step-
           | by-step solutions sampled from the "base" model.
           | 
           | In other words: 1) sample step-by-step solutions from "base"
           | model; 2) do it at non-zero temperature so that you can get
           | multiple continuation from each solution-prefix; 3) use MATH-
           | labels to decide if full solution (leaf/terminal node in MC
           | rolloout) has reward `1` or `0`; 4) roll up these rewards to
           | calculate reward-to-go for each intermediate step.
           | 
           | Yes, verifier trained in this manner can be used to score
           | solution-prefixes (as a process verifier) or a full-solution
           | (as an outcome verifier).
           | 
           | In the original paper (https://arxiv.org/abs/2408.03314) they
           | fine-tune a fresh verifier. HF's replication uses an off-the-
           | shelf verifier based on another paper:
           | https://arxiv.org/abs/2312.08935
        
       | bilsbie wrote:
       | Eli5?
        
         | srush wrote:
         | For problems that require multi-step reasoning, standard LLMs
         | seem to be stuck. The field is increasingly interested in
         | models like o1 that output many "guesses" to find the right
         | one. Currently open-source does not know how to do this, but we
         | are reimplementing several possible directions to try. This
         | replicates one important path using search and a verifier
         | model.
        
         | loudmax wrote:
         | If I understand correctly, Hugging Face is exploring approaches
         | to tuning the output quality of a given model by tuning how
         | long to let it run.
         | 
         | Normally when you run an LLM, you set your prompt and whatever
         | tunable parameters, and the LLM software (eg. lamma.cpp) spits
         | out tokens at whatever rate it can. If you want higher quality,
         | you run a bigger model (though you're limited by the amount of
         | memory you have available). If you want higher speed, you run a
         | smaller model. Hugging Face seems to be looking at ways to make
         | this tradeoff without switching between different models.
        
           | yeahwhatever10 wrote:
           | So I can get LLM results from an SLM if I run it long enough?
        
             | vlovich123 wrote:
             | They show Llama 3.2 1B with chain-of-thought that
             | outperforms Llama 3.1 8B and 3.2 3B that outperforms 3.1
             | 70B. It's less clear whether you actually inference time is
             | faster for CoT 3B using 256x generations vs 70B if you have
             | enough RAM. Basically a classical RAM/compute trade off
        
               | dimitry12 wrote:
               | From a practical standpoint, scaling test-time compute
               | does enable datacenter-scale performance on the edge. I
               | can not feasibly run 70B on my iphone, but I can run 3B
               | even if takes a lot of time for it to produce a solution
               | comparable to 70B's 0-shot.
               | 
               | I think it *is* an unlock.
        
             | beardedwizard wrote:
             | I struggle with this idea of "run it long enough", or
             | another description I have heard "give the model time to
             | think" it's not a thing - it takes as long as it takes.
             | What im taking away from this is two things:
             | 
             | 1. the reason for generalizations like 'long enough' and
             | 'think more' are apparently because the methods are
             | somewhat obscure 2. those methods are being explored by
             | hugging face to make them less obscure
             | 
             | am I getting that right? I have been struggling to see past
             | the metaphors and understand exactly what additional
             | computation is being done - and here I read its something
             | like multiple guesses being fed back in and chosen among
             | which means its just multiple inferences in series that are
             | all related to solving 1 problem.
        
         | dimitry12 wrote:
         | To spend more compute at inference time, at least two simple
         | approaches are readily available:
         | 
         | 1) make model output a full solution, step-by-step, then induce
         | it to revise the solution - repeat this as many times as you
         | have token-budget for. You can do this via prompting alone (see
         | Reflexion for example), or you can fine-tune the model to do
         | that. The paper explores fine-tuning of the base model to turn
         | it into self-revision model.
         | 
         | 2) sample step-by-step (one "thought"-sentence per line)
         | solutions from the model, and do it at non-zero temperature to
         | be able to sample multiple next-steps. Then use verifier model
         | to choose between next-step candidates and prefer to continue
         | the rollout of the more promising branches of "thoughts". There
         | are many many methods of exploring such tree when you can score
         | intermediate nodes (beam search is an almost 50 years old
         | algorithm!).
        
       | boroboro4 wrote:
       | What's a point of such inference time compute if verifier is 8B
       | model itself? Am I missing something?
        
         | dimitry12 wrote:
         | I believe this is a valid point: HF's replication indeed uses
         | larger off-the-shelf model as a verifier.
         | 
         | In contrast, in the original paper, verifier is a fine-tune of
         | the exact same base model which is used to sample step-by-step
         | solutions (="solver").
        
           | boroboro4 wrote:
           | Using different 1B model as verifier makes sense, yes. Using
           | Llama 8B finetune as verifier to compare 1B inference time
           | scaled in comparison with 8B makes little sense to me.
           | 
           | Using 3B model with 8B verifier against 70B model would make
           | sense too. This being said their performance barely crossed
           | 70B line with 256 examples. This is 256*(8+3)/70 ~ 40 times
           | more computationally expensive than running 70B model as is.
        
             | dimitry12 wrote:
             | "1B solver + 8B verifier + search" beating 0-shot 70B is
             | nice, agree.
             | 
             | "1B solver + 8B verifier + search" beating 1B-0-shot or
             | 1B-majority as baselines isn't illustrative imo. In other
             | words, by using larger verifier, HF's replication fails to
             | establish a "fair" baseline. Still an awesome blog and
             | release/repository from HF's group - I love it!
        
           | zackangelo wrote:
           | Where did you see that? I thought they used an 8b model for
           | their reward model?
           | 
           | > To guide our search strategies, we used
           | RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model
           | that has been trained using process supervision
        
             | dimitry12 wrote:
             | "Solver" is `meta-llama/Llama-3.2-1B-Instruct` (1B model,
             | and they use 3B for another experiment), and verifier is
             | `RLHFlow/Llama3.1-8B-PRM-Deepseek-Data`.
             | 
             | See https://github.com/huggingface/search-and-
             | learn/blob/b3375f8... and
             | https://github.com/huggingface/search-and-
             | learn/blob/b3375f8...
             | 
             | In the original paper, they use PaLM 2-S* as "solver" and
             | its fine-tune as "verifier".
        
       ___________________________________________________________________
       (page generated 2024-12-20 23:01 UTC)