[HN Gopher] Reflection 70B, the top open-source model
       ___________________________________________________________________
        
       Reflection 70B, the top open-source model
        
       Author : GavCo
       Score  : 101 points
       Date   : 2024-09-05 19:39 UTC (3 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | GavCo wrote:
       | Hugging Face: https://huggingface.co/mattshumer/Reflection-70B
       | 
       | Playground: https://reflection-playground-
       | production.up.railway.app/
        
       | nsagent wrote:
       | If this does indeed beat all the closed source models, then I'm
       | flabbergasted. The amount of time and resources Google, OpenAI,
       | and Anthropic have put into improving the models to only be
       | beaten in a couple weeks by two people (who as far as I know do
       | not have PhDs and years of research experience) would be a pretty
       | crazy feat.
       | 
       | That said, I'm withholding judgment on how likely the claims are.
       | A friend who developed NoCha [1] is running the model on that
       | benchmark, which will really stress test its ability to reason
       | over full novels. I'll reserve judgement until then.
       | 
       | [1]: https://novelchallenge.github.io/
        
         | moralestapia wrote:
         | >A friend who developed NoCha [1] is running the model on that
         | benchmark [...]
         | 
         | Please do update us on the result.
        
         | winddude wrote:
         | PhDs aren't relevant. It's more just a certificate that you can
         | learn to learn and stay committed to hard and challenging
         | things. It does give bonus points to VCs, because it's seems to
         | be easier to market to other VCs, same applies for hedge funds.
         | 
         | And with fine tuning, there's zero math needed, it's a bit of
         | common sense, and a lot's of data optimization.
        
         | yaj54 wrote:
         | Anyone have or know of a list of LLM challenges like this?
         | Targeted use cases with unpublished test data?
        
         | m3kw9 wrote:
         | Fine tuning needs $$$ and knowledge on how fine tuning works.
        
       | jamesblonde wrote:
       | Just tried this out for coding. I asked it to download weather
       | data for Dublin into a Pandas Dataframe and write it to
       | Hopsworks. Worked as good as GPT-4o - code ran correctly. The
       | playground is fast. Impressed!
        
       | JoshMandel wrote:
       | I'm surprised this does so well in benchmarks, given the
       | intuition I'm getting about its behavior from quick testing.
       | 
       | I gave it a medium-complexity design problem: Design the
       | typescript interface for the state of a react app that manages a
       | tree of chat turns/responses and displays the current path
       | through the tree. (In other words, the kind of state that sits
       | logically behind the ChatGPT or Claude Web UI, where previous
       | conversation turns can be edited and used as a branching off
       | point for new turns.)
       | 
       | Reflection-70B suffered from a bad initial idea, just as Llama
       | 70B generally does (proposing to duplicate state between the
       | "tree of all messages" and the "path to currently displayed
       | message"), which is a very common error. The automated reflection
       | process identified a whole bunch of nitpicks but missed the
       | glaring logical bug. Furthermore the final output was missing
       | many of the details included in the initial reflection / chain-
       | of-thought scratchpad, even though the UI hides the scratchpad as
       | though it's unimportant for the user to read.
        
       | imjonse wrote:
       | Wonder why no Llama-3.1-8B based variant if the new training
       | method has such good results. UPDATE: didn't work well
       | https://x.com/mattshumer_/status/1831775436420083753?t=flm41...
        
         | viraptor wrote:
         | It's answered on Twitter. Not much improvement over other
         | similar models at that size.
        
         | og_kalu wrote:
         | He said it didn't improve as much
         | 
         | https://x.com/mattshumer_/status/1831775436420083753
        
         | safoex wrote:
         | Imagine if it was the reason in big corporations to not to
         | investigate further some similar technique :)
        
       | juxtaposicion wrote:
       | Like other comments, I was also initially surprised. But I think
       | the gains are both real and easy to understand where the
       | improvements are coming from.
       | 
       | Under the hood Reflection 70B seems to be a Llama-3.1 finetune
       | that encourages the model to add <think>, <reflection> and
       | <output> tokens and corresponding phases. This is an evolution of
       | Chain-of-Thought's "think step by step" -- but instead of being a
       | prompting technique, this fine-tune bakes examples of these
       | phases more directly into the model. So the model starts with an
       | initial draft and 'reflects' on it before issuing a final output.
       | 
       | The extra effort spent on tokens, which effectively let the model
       | 'think more' appears to let it defeat prompts which other strong
       | models (4o, 3.5 Sonnet) appear to fumble. So for example, when
       | asked "which is greater 9.11 or 9.9" the Reflection 70b model
       | initially gets the wrong answer, then <reflects> on it, then
       | spits the right output.
       | 
       | Personally, the comparison to Claude and 4o doesn't quite seem
       | apples-to-apples. If you were to have 4o/Claude take multiple
       | rounds to review and reflect on their initial drafts, would we
       | see similar gains? I suspect they would improve massively as
       | well.
        
         | rgbrgb wrote:
         | > Personally, the comparison to Claude and 4o doesn't quite
         | seem apples-to-apples. If you were to have 4o/Claude take
         | multiple rounds to review and reflect on their initial drafts,
         | would we see similar gains? I suspect they would improve
         | massively as well.
         | 
         | They may already implement this technique, we can't know.
        
           | kgeist wrote:
           | I suspect GPT4o already has training for CoT. I've noticed it
           | often responds by saying something like "let's break it down
           | step by step". Or maybe it's the system prompt.
        
           | astrange wrote:
           | Claude 3.5 does have some "thinking" ability - I've seen it
           | pause and even say it was thinking before. Presumably this is
           | just some output it decides not to show you.
        
             | Tiberium wrote:
             | That's only in the web version, it's just that they prompt
             | it to do some CoT in the antThinking XML tag, and hide the
             | output from inside that tag in the UI.
        
       | xianshou wrote:
       | Crazy how simple the technique is if this holds up. Just <think>
       | and <reflection> plus synthetic data, used to finetune Llama 3.1
       | 70B.
       | 
       | Note that there's a threshold for how smart the model has to be
       | to take advantage of this flow
       | (https://x.com/mattshumer_/status/1831775436420083753) - 8B is
       | too dumb.
       | 
       | In which case, what happens if you apply this to a GPT-4o
       | finetune, or to Claude 3.5 Sonnet?
       | 
       | What happens if you combine it with variants of tree-based
       | reasoning? With AlphaProof
       | (https://www.nature.com/articles/s41586-023-06747-5#Sec3)? With
       | MCTSr (https://arxiv.org/abs/2406.07394)?
        
         | jug wrote:
         | I was just thinking - since GPT-4o and Sonnet are closed
         | models, do we know that this method was not already used to
         | train them? And that Reflection is simply finding a path for
         | greater improvements than they did. Llama 3.1 apparently didn't
         | improve as much. It's just a thought though.
        
       | angoragoats wrote:
       | Can we please stop allowing links to Twitter? Rationale: the
       | artificial limitations on that site around post size mean that
       | most announcements (such as this one) are multiple posts. This,
       | combined with the questionable design decision of hiding all
       | reply tweets when a user is not logged in, means that many posts
       | are completely missing crucial context for those of us who don't
       | have Twitter accounts.
       | 
       | Alternatively, Twitter links could be rewritten to redirect to
       | one of the few Nitter instances that are still functional.
        
         | miki123211 wrote:
         | I believe this is against HN's values.
         | 
         | HN allows, and has always allowed, links to paywalled sources,
         | sources with geographic restrictions that refuse to display the
         | content for some readers, and won't modify a posts URL due to
         | the site being slashdotted / suffering from an HN hug of death.
         | Twitter is no different, except maybe by being more
         | ideologically polarizing.
         | 
         | The place for alternative URLs is, and has always been, the
         | comments.
        
           | angoragoats wrote:
           | Yeah, I understand this has been the case, but I guess I
           | don't understand why it can't be changed, or why it's even a
           | good thing.
           | 
           | Seems like most others disagree with me though, so I guess
           | I'll just skip over anything posted on Twitter.
        
         | handzhiev wrote:
         | Here's an unroll
         | 
         | https://threadreaderapp.com/thread/1831767014341538166.html
        
         | astrange wrote:
         | > Rationale: the artificial limitations on that site around
         | post size mean that most announcements (such as this one) are
         | multiple posts.
         | 
         | That limit actually doesn't apply to premium users/bluechecks,
         | and he's using the other features like bold text.
         | 
         | The problem with long posts like that is one, they're annoying
         | to read because when you open one up you don't know how much of
         | a time commitment they will be, and two, you can't reply to
         | just part of them.
        
       | RobotToaster wrote:
       | At the risk of sounding like a stuck LLM, it's under the Llama
       | licence, which isn't an open source licence because of the
       | restrictions on fields of endeavour.
        
       | rspoerri wrote:
       | i hope the quantized version doesnt loose to much of it's
       | quality.
        
       | YetAnotherNick wrote:
       | > we expect it to be the best model in the world.
       | 
       | Have they trained/benchmarked the model? If not why not just
       | release the 70B after a week if it is expected to only take a
       | week. If it doesn't perform as well as expected, it would damage
       | the reputation of 70B model.
        
       | smusamashah wrote:
       | We need results from these harder/different benchmarks which give
       | pretty bad scores to current top LLMs.
       | 
       | https://www.wolfram.com/llm-benchmarking-project/
       | 
       | https://help.kagi.com/kagi/ai/llm-benchmark.html
       | 
       | Edit : There are few other benchmarks that give pretty low scores
       | (<20%) to top LLMs. Can't find them atm. There was a benchmark
       | with common sense easy looking questions.
       | 
       | Edit: found two more papers
       | 
       | https://arxiv.org/html/2405.19616
       | 
       | https://arxiv.org/html/2406.02061v1
       | 
       | Edit: How about Wordle?
       | 
       | https://www.strangeloopcanon.com/p/what-can-llms-never-do
       | 
       | https://news.ycombinator.com/item?id=40179232
        
         | freediver wrote:
         | I am happy to run the tests on Kagi LLM benchmark. Is there an
         | API endpoint for this model anywhere?
        
         | botro wrote:
         | "The task consists of going from English-language
         | specifications to Wolfram Language code. The test cases are
         | exercises from Stephen Wolfram's An Elementary Introduction to
         | the Wolfram Language."
         | 
         | I think this benchmark would really only tell me whether
         | Wolframs book was in the training data.
        
           | smusamashah wrote:
           | Yeah, may be should skip that benchmark.
        
       | spencerchubb wrote:
       | I wonder how good it is with multi-turn conversations
        
       | rwl4 wrote:
       | Interesting idea!
       | 
       | You can somewhat recreate the essence of this using a system
       | prompt with any sufficiently sized model. Here's the prompt I
       | tried for anybody who's interested:                 You are an AI
       | assistant designed to provide detailed, step-by-step responses.
       | Your outputs should follow this structure:            1. Begin
       | with a <thinking> section. Everything in this section is
       | invisible to the user.       2. Inside the thinking section:
       | a. Briefly analyze the question and outline your approach.
       | b. Present a clear plan of steps to solve the problem.
       | c. Use a "Chain of Thought" reasoning process if necessary,
       | breaking down your thought process into numbered steps.       3.
       | Include a <reflection> section for each idea where you:
       | a. Review your reasoning.          b. Check for potential errors
       | or oversights.          c. Confirm or adjust your conclusion if
       | necessary.       4. Be sure to close all reflection sections.
       | 5. Close the thinking section with </thinking>.       6. Provide
       | your final answer in an <output> section.              Always use
       | these tags in your responses. Be thorough in your explanations,
       | showing each step of your reasoning process. Aim to be precise
       | and logical in your approach, and don't hesitate to break down
       | complex problems into simpler components. Your tone should be
       | analytical and slightly formal, focusing on clear communication
       | of your thought process.              Remember: Both <thinking>
       | and <reflection> MUST be tags and must be closed at their
       | conclusion              Make sure all <tags> are on separate
       | lines with no other text. Do not include other text on a line
       | containing a tag.
        
         | m3kw9 wrote:
         | This thing would be overly verbose
        
           | striking wrote:
           | You'd hide the contents of the tags in whatever presentation
           | layer you're using. It's known that allowing the model to be
           | verbose gives it more opportunities to perform computation,
           | which may allow it to perform better.
        
           | agucova wrote:
           | I mean, this is how the Reflection model works. It's just
           | hiding that from you in an interface.
        
       | d_sc wrote:
       | Any way to have this work in LM Studio? Not showing up in search
       | results.
        
         | m3kw9 wrote:
         | May need an update from LM plus someone converting it to gguf
         | format
        
       | Bjorkbat wrote:
       | Worth mentioning that LlaMa 70b already had pretty high benchmark
       | scores to begin with https://ai.meta.com/blog/meta-llama-3-1/
       | 
       | Still impressive that it can beat top models with fine-tuning,
       | but now I'm mostly impressed by the fact that the 70b model was
       | so good to begin with.
        
       ___________________________________________________________________
       (page generated 2024-09-05 23:00 UTC)