[HN Gopher] Open source inference time compute example from Hugg...
___________________________________________________________________
Open source inference time compute example from HuggingFace
Author : burningion
Score : 71 points
Date : 2024-12-16 20:35 UTC (4 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mentalgear wrote:
| Happy to see "inference time compute" term being used primarily
| nowadays - it's a much more precise and appropriate term compared
| to the unwieldy "test-time compute" that openai used to call it
| when they thought they "invented" scaling inference time.
| srush wrote:
| Full blog is here:
| https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...
|
| Happy to answer any questions about these methods.
| OakNinja wrote:
| Excellent and interesting post!
|
| Minor gripe - The best-of-n | beam search illustration is not
| compatible with red-green color blindness. I can literally not
| see the difference between the Rejected and the Selected dots
| even if I zoom in.
| srush wrote:
| Thanks for the feedback, and not minor. Sorry about that.
| dinp wrote:
| Great work! When I use models like o1, they work better than
| sonnet and 4o for tasks that require some thinking but the
| output is often very verbose. Is it possible to get the best of
| both worlds? The thinking takes place resulting in better
| performance but the output is straightforward to work with like
| with sonnet and 4o. Did you observe similar behaviour with the
| 1B and 3B models? How does the model behaviour change when used
| for normal tasks that don't require thinking?
|
| Also how well do these models work to extract structured
| output? Eg- perform ocr on some hand written text with math,
| convert to html and format formulas correctly etc. Single shot
| prompting doesn't work well with such problems but splitting
| the steps into consecutive api calls works well.
| srush wrote:
| That's a good point. We don't see that in our experiments
| because it's all in the math domain. However for OAI it's
| plausible that training for o1 might conflict with standard
| instruction training, leading to less human preferred output
| style.
| dimitry12 wrote:
| In this paper and HF's replication the model used to produce
| solutions to MATH problems is off-the-shelf. It is induced to
| produce step-by-step CoT-style solutions by few-shot ICL
| prompts or by instructions.
|
| Yes, the search process (beam-search of best-of-N) does
| produce verbose traces because there is branching involved
| when sampling "thoughts" from base model. These branched
| traces (including incomplete "abandoned" branches) can be
| shown to the user or hidden, if the approach is deployed as-
| is.
| mccoyb wrote:
| In the blog post, learned verifiers are mentioned. Are these
| learned offline using data, and is the intent to learn a
| scoring heuristic to help the search?
| dimitry12 wrote:
| Verifier is trained with soft values of reward-to-go for each
| solution-prefix, obtained from monte-carlo rollouts of step-
| by-step solutions sampled from the "base" model.
|
| In other words: 1) sample step-by-step solutions from "base"
| model; 2) do it at non-zero temperature so that you can get
| multiple continuation from each solution-prefix; 3) use MATH-
| labels to decide if full solution (leaf/terminal node in MC
| rolloout) has reward `1` or `0`; 4) roll up these rewards to
| calculate reward-to-go for each intermediate step.
|
| Yes, verifier trained in this manner can be used to score
| solution-prefixes (as a process verifier) or a full-solution
| (as an outcome verifier).
|
| In the original paper (https://arxiv.org/abs/2408.03314) they
| fine-tune a fresh verifier. HF's replication uses an off-the-
| shelf verifier based on another paper:
| https://arxiv.org/abs/2312.08935
| bilsbie wrote:
| Eli5?
| srush wrote:
| For problems that require multi-step reasoning, standard LLMs
| seem to be stuck. The field is increasingly interested in
| models like o1 that output many "guesses" to find the right
| one. Currently open-source does not know how to do this, but we
| are reimplementing several possible directions to try. This
| replicates one important path using search and a verifier
| model.
| loudmax wrote:
| If I understand correctly, Hugging Face is exploring approaches
| to tuning the output quality of a given model by tuning how
| long to let it run.
|
| Normally when you run an LLM, you set your prompt and whatever
| tunable parameters, and the LLM software (eg. lamma.cpp) spits
| out tokens at whatever rate it can. If you want higher quality,
| you run a bigger model (though you're limited by the amount of
| memory you have available). If you want higher speed, you run a
| smaller model. Hugging Face seems to be looking at ways to make
| this tradeoff without switching between different models.
| yeahwhatever10 wrote:
| So I can get LLM results from an SLM if I run it long enough?
| vlovich123 wrote:
| They show Llama 3.2 1B with chain-of-thought that
| outperforms Llama 3.1 8B and 3.2 3B that outperforms 3.1
| 70B. It's less clear whether you actually inference time is
| faster for CoT 3B using 256x generations vs 70B if you have
| enough RAM. Basically a classical RAM/compute trade off
| dimitry12 wrote:
| From a practical standpoint, scaling test-time compute
| does enable datacenter-scale performance on the edge. I
| can not feasibly run 70B on my iphone, but I can run 3B
| even if takes a lot of time for it to produce a solution
| comparable to 70B's 0-shot.
|
| I think it *is* an unlock.
| beardedwizard wrote:
| I struggle with this idea of "run it long enough", or
| another description I have heard "give the model time to
| think" it's not a thing - it takes as long as it takes.
| What im taking away from this is two things:
|
| 1. the reason for generalizations like 'long enough' and
| 'think more' are apparently because the methods are
| somewhat obscure 2. those methods are being explored by
| hugging face to make them less obscure
|
| am I getting that right? I have been struggling to see past
| the metaphors and understand exactly what additional
| computation is being done - and here I read its something
| like multiple guesses being fed back in and chosen among
| which means its just multiple inferences in series that are
| all related to solving 1 problem.
| dimitry12 wrote:
| To spend more compute at inference time, at least two simple
| approaches are readily available:
|
| 1) make model output a full solution, step-by-step, then induce
| it to revise the solution - repeat this as many times as you
| have token-budget for. You can do this via prompting alone (see
| Reflexion for example), or you can fine-tune the model to do
| that. The paper explores fine-tuning of the base model to turn
| it into self-revision model.
|
| 2) sample step-by-step (one "thought"-sentence per line)
| solutions from the model, and do it at non-zero temperature to
| be able to sample multiple next-steps. Then use verifier model
| to choose between next-step candidates and prefer to continue
| the rollout of the more promising branches of "thoughts". There
| are many many methods of exploring such tree when you can score
| intermediate nodes (beam search is an almost 50 years old
| algorithm!).
| boroboro4 wrote:
| What's a point of such inference time compute if verifier is 8B
| model itself? Am I missing something?
| dimitry12 wrote:
| I believe this is a valid point: HF's replication indeed uses
| larger off-the-shelf model as a verifier.
|
| In contrast, in the original paper, verifier is a fine-tune of
| the exact same base model which is used to sample step-by-step
| solutions (="solver").
| boroboro4 wrote:
| Using different 1B model as verifier makes sense, yes. Using
| Llama 8B finetune as verifier to compare 1B inference time
| scaled in comparison with 8B makes little sense to me.
|
| Using 3B model with 8B verifier against 70B model would make
| sense too. This being said their performance barely crossed
| 70B line with 256 examples. This is 256*(8+3)/70 ~ 40 times
| more computationally expensive than running 70B model as is.
| dimitry12 wrote:
| "1B solver + 8B verifier + search" beating 0-shot 70B is
| nice, agree.
|
| "1B solver + 8B verifier + search" beating 1B-0-shot or
| 1B-majority as baselines isn't illustrative imo. In other
| words, by using larger verifier, HF's replication fails to
| establish a "fair" baseline. Still an awesome blog and
| release/repository from HF's group - I love it!
| zackangelo wrote:
| Where did you see that? I thought they used an 8b model for
| their reward model?
|
| > To guide our search strategies, we used
| RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model
| that has been trained using process supervision
| dimitry12 wrote:
| "Solver" is `meta-llama/Llama-3.2-1B-Instruct` (1B model,
| and they use 3B for another experiment), and verifier is
| `RLHFlow/Llama3.1-8B-PRM-Deepseek-Data`.
|
| See https://github.com/huggingface/search-and-
| learn/blob/b3375f8... and
| https://github.com/huggingface/search-and-
| learn/blob/b3375f8...
|
| In the original paper, they use PaLM 2-S* as "solver" and
| its fine-tune as "verifier".
___________________________________________________________________
(page generated 2024-12-20 23:01 UTC)