[HN Gopher] Reflection 70B, the top open-source model
___________________________________________________________________
Reflection 70B, the top open-source model
Author : GavCo
Score : 101 points
Date : 2024-09-05 19:39 UTC (3 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| GavCo wrote:
| Hugging Face: https://huggingface.co/mattshumer/Reflection-70B
|
| Playground: https://reflection-playground-
| production.up.railway.app/
| nsagent wrote:
| If this does indeed beat all the closed source models, then I'm
| flabbergasted. The amount of time and resources Google, OpenAI,
| and Anthropic have put into improving the models to only be
| beaten in a couple weeks by two people (who as far as I know do
| not have PhDs and years of research experience) would be a pretty
| crazy feat.
|
| That said, I'm withholding judgment on how likely the claims are.
| A friend who developed NoCha [1] is running the model on that
| benchmark, which will really stress test its ability to reason
| over full novels. I'll reserve judgement until then.
|
| [1]: https://novelchallenge.github.io/
| moralestapia wrote:
| >A friend who developed NoCha [1] is running the model on that
| benchmark [...]
|
| Please do update us on the result.
| winddude wrote:
| PhDs aren't relevant. It's more just a certificate that you can
| learn to learn and stay committed to hard and challenging
| things. It does give bonus points to VCs, because it's seems to
| be easier to market to other VCs, same applies for hedge funds.
|
| And with fine tuning, there's zero math needed, it's a bit of
| common sense, and a lot's of data optimization.
| yaj54 wrote:
| Anyone have or know of a list of LLM challenges like this?
| Targeted use cases with unpublished test data?
| m3kw9 wrote:
| Fine tuning needs $$$ and knowledge on how fine tuning works.
| jamesblonde wrote:
| Just tried this out for coding. I asked it to download weather
| data for Dublin into a Pandas Dataframe and write it to
| Hopsworks. Worked as good as GPT-4o - code ran correctly. The
| playground is fast. Impressed!
| JoshMandel wrote:
| I'm surprised this does so well in benchmarks, given the
| intuition I'm getting about its behavior from quick testing.
|
| I gave it a medium-complexity design problem: Design the
| typescript interface for the state of a react app that manages a
| tree of chat turns/responses and displays the current path
| through the tree. (In other words, the kind of state that sits
| logically behind the ChatGPT or Claude Web UI, where previous
| conversation turns can be edited and used as a branching off
| point for new turns.)
|
| Reflection-70B suffered from a bad initial idea, just as Llama
| 70B generally does (proposing to duplicate state between the
| "tree of all messages" and the "path to currently displayed
| message"), which is a very common error. The automated reflection
| process identified a whole bunch of nitpicks but missed the
| glaring logical bug. Furthermore the final output was missing
| many of the details included in the initial reflection / chain-
| of-thought scratchpad, even though the UI hides the scratchpad as
| though it's unimportant for the user to read.
| imjonse wrote:
| Wonder why no Llama-3.1-8B based variant if the new training
| method has such good results. UPDATE: didn't work well
| https://x.com/mattshumer_/status/1831775436420083753?t=flm41...
| viraptor wrote:
| It's answered on Twitter. Not much improvement over other
| similar models at that size.
| og_kalu wrote:
| He said it didn't improve as much
|
| https://x.com/mattshumer_/status/1831775436420083753
| safoex wrote:
| Imagine if it was the reason in big corporations to not to
| investigate further some similar technique :)
| juxtaposicion wrote:
| Like other comments, I was also initially surprised. But I think
| the gains are both real and easy to understand where the
| improvements are coming from.
|
| Under the hood Reflection 70B seems to be a Llama-3.1 finetune
| that encourages the model to add <think>, <reflection> and
| <output> tokens and corresponding phases. This is an evolution of
| Chain-of-Thought's "think step by step" -- but instead of being a
| prompting technique, this fine-tune bakes examples of these
| phases more directly into the model. So the model starts with an
| initial draft and 'reflects' on it before issuing a final output.
|
| The extra effort spent on tokens, which effectively let the model
| 'think more' appears to let it defeat prompts which other strong
| models (4o, 3.5 Sonnet) appear to fumble. So for example, when
| asked "which is greater 9.11 or 9.9" the Reflection 70b model
| initially gets the wrong answer, then <reflects> on it, then
| spits the right output.
|
| Personally, the comparison to Claude and 4o doesn't quite seem
| apples-to-apples. If you were to have 4o/Claude take multiple
| rounds to review and reflect on their initial drafts, would we
| see similar gains? I suspect they would improve massively as
| well.
| rgbrgb wrote:
| > Personally, the comparison to Claude and 4o doesn't quite
| seem apples-to-apples. If you were to have 4o/Claude take
| multiple rounds to review and reflect on their initial drafts,
| would we see similar gains? I suspect they would improve
| massively as well.
|
| They may already implement this technique, we can't know.
| kgeist wrote:
| I suspect GPT4o already has training for CoT. I've noticed it
| often responds by saying something like "let's break it down
| step by step". Or maybe it's the system prompt.
| astrange wrote:
| Claude 3.5 does have some "thinking" ability - I've seen it
| pause and even say it was thinking before. Presumably this is
| just some output it decides not to show you.
| Tiberium wrote:
| That's only in the web version, it's just that they prompt
| it to do some CoT in the antThinking XML tag, and hide the
| output from inside that tag in the UI.
| xianshou wrote:
| Crazy how simple the technique is if this holds up. Just <think>
| and <reflection> plus synthetic data, used to finetune Llama 3.1
| 70B.
|
| Note that there's a threshold for how smart the model has to be
| to take advantage of this flow
| (https://x.com/mattshumer_/status/1831775436420083753) - 8B is
| too dumb.
|
| In which case, what happens if you apply this to a GPT-4o
| finetune, or to Claude 3.5 Sonnet?
|
| What happens if you combine it with variants of tree-based
| reasoning? With AlphaProof
| (https://www.nature.com/articles/s41586-023-06747-5#Sec3)? With
| MCTSr (https://arxiv.org/abs/2406.07394)?
| jug wrote:
| I was just thinking - since GPT-4o and Sonnet are closed
| models, do we know that this method was not already used to
| train them? And that Reflection is simply finding a path for
| greater improvements than they did. Llama 3.1 apparently didn't
| improve as much. It's just a thought though.
| angoragoats wrote:
| Can we please stop allowing links to Twitter? Rationale: the
| artificial limitations on that site around post size mean that
| most announcements (such as this one) are multiple posts. This,
| combined with the questionable design decision of hiding all
| reply tweets when a user is not logged in, means that many posts
| are completely missing crucial context for those of us who don't
| have Twitter accounts.
|
| Alternatively, Twitter links could be rewritten to redirect to
| one of the few Nitter instances that are still functional.
| miki123211 wrote:
| I believe this is against HN's values.
|
| HN allows, and has always allowed, links to paywalled sources,
| sources with geographic restrictions that refuse to display the
| content for some readers, and won't modify a posts URL due to
| the site being slashdotted / suffering from an HN hug of death.
| Twitter is no different, except maybe by being more
| ideologically polarizing.
|
| The place for alternative URLs is, and has always been, the
| comments.
| angoragoats wrote:
| Yeah, I understand this has been the case, but I guess I
| don't understand why it can't be changed, or why it's even a
| good thing.
|
| Seems like most others disagree with me though, so I guess
| I'll just skip over anything posted on Twitter.
| handzhiev wrote:
| Here's an unroll
|
| https://threadreaderapp.com/thread/1831767014341538166.html
| astrange wrote:
| > Rationale: the artificial limitations on that site around
| post size mean that most announcements (such as this one) are
| multiple posts.
|
| That limit actually doesn't apply to premium users/bluechecks,
| and he's using the other features like bold text.
|
| The problem with long posts like that is one, they're annoying
| to read because when you open one up you don't know how much of
| a time commitment they will be, and two, you can't reply to
| just part of them.
| RobotToaster wrote:
| At the risk of sounding like a stuck LLM, it's under the Llama
| licence, which isn't an open source licence because of the
| restrictions on fields of endeavour.
| rspoerri wrote:
| i hope the quantized version doesnt loose to much of it's
| quality.
| YetAnotherNick wrote:
| > we expect it to be the best model in the world.
|
| Have they trained/benchmarked the model? If not why not just
| release the 70B after a week if it is expected to only take a
| week. If it doesn't perform as well as expected, it would damage
| the reputation of 70B model.
| smusamashah wrote:
| We need results from these harder/different benchmarks which give
| pretty bad scores to current top LLMs.
|
| https://www.wolfram.com/llm-benchmarking-project/
|
| https://help.kagi.com/kagi/ai/llm-benchmark.html
|
| Edit : There are few other benchmarks that give pretty low scores
| (<20%) to top LLMs. Can't find them atm. There was a benchmark
| with common sense easy looking questions.
|
| Edit: found two more papers
|
| https://arxiv.org/html/2405.19616
|
| https://arxiv.org/html/2406.02061v1
|
| Edit: How about Wordle?
|
| https://www.strangeloopcanon.com/p/what-can-llms-never-do
|
| https://news.ycombinator.com/item?id=40179232
| freediver wrote:
| I am happy to run the tests on Kagi LLM benchmark. Is there an
| API endpoint for this model anywhere?
| botro wrote:
| "The task consists of going from English-language
| specifications to Wolfram Language code. The test cases are
| exercises from Stephen Wolfram's An Elementary Introduction to
| the Wolfram Language."
|
| I think this benchmark would really only tell me whether
| Wolframs book was in the training data.
| smusamashah wrote:
| Yeah, may be should skip that benchmark.
| spencerchubb wrote:
| I wonder how good it is with multi-turn conversations
| rwl4 wrote:
| Interesting idea!
|
| You can somewhat recreate the essence of this using a system
| prompt with any sufficiently sized model. Here's the prompt I
| tried for anybody who's interested: You are an AI
| assistant designed to provide detailed, step-by-step responses.
| Your outputs should follow this structure: 1. Begin
| with a <thinking> section. Everything in this section is
| invisible to the user. 2. Inside the thinking section:
| a. Briefly analyze the question and outline your approach.
| b. Present a clear plan of steps to solve the problem.
| c. Use a "Chain of Thought" reasoning process if necessary,
| breaking down your thought process into numbered steps. 3.
| Include a <reflection> section for each idea where you:
| a. Review your reasoning. b. Check for potential errors
| or oversights. c. Confirm or adjust your conclusion if
| necessary. 4. Be sure to close all reflection sections.
| 5. Close the thinking section with </thinking>. 6. Provide
| your final answer in an <output> section. Always use
| these tags in your responses. Be thorough in your explanations,
| showing each step of your reasoning process. Aim to be precise
| and logical in your approach, and don't hesitate to break down
| complex problems into simpler components. Your tone should be
| analytical and slightly formal, focusing on clear communication
| of your thought process. Remember: Both <thinking>
| and <reflection> MUST be tags and must be closed at their
| conclusion Make sure all <tags> are on separate
| lines with no other text. Do not include other text on a line
| containing a tag.
| m3kw9 wrote:
| This thing would be overly verbose
| striking wrote:
| You'd hide the contents of the tags in whatever presentation
| layer you're using. It's known that allowing the model to be
| verbose gives it more opportunities to perform computation,
| which may allow it to perform better.
| agucova wrote:
| I mean, this is how the Reflection model works. It's just
| hiding that from you in an interface.
| d_sc wrote:
| Any way to have this work in LM Studio? Not showing up in search
| results.
| m3kw9 wrote:
| May need an update from LM plus someone converting it to gguf
| format
| Bjorkbat wrote:
| Worth mentioning that LlaMa 70b already had pretty high benchmark
| scores to begin with https://ai.meta.com/blog/meta-llama-3-1/
|
| Still impressive that it can beat top models with fine-tuning,
| but now I'm mostly impressed by the fact that the 70b model was
| so good to begin with.
___________________________________________________________________
(page generated 2024-09-05 23:00 UTC)