hngopher.com

       [HN Gopher] DeepScaleR: Surpassing O1-Preview with a 1.5B Model ...
       ___________________________________________________________________
        
       DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL
        
       Author : sijuntan
       Score  : 228 points
       Date   : 2025-02-11 19:59 UTC (3 hours ago)
        
 (HTM) web link (pretty-radio-b75.notion.site)
 (TXT) w3m dump (pretty-radio-b75.notion.site)
        
       | daft_pink wrote:
       | Would really love it if we could get uncensored models this way.
        
         | xeckr wrote:
         | It looks like we're moving in that direction.
        
       | dang wrote:
       | We changed the URL from https://github.com/agentica-
       | project/deepscaler to the article that gives more background
       | information (and also points to the repo).
       | 
       | (Submitted title was "Replicating Deepseek-R1 for $4500: RL
       | Boosts 1.5B Model Beyond o1-preview")
        
       | mrcwinn wrote:
       | Newb question. If the big boys are crawling, scraping, stealing
       | the whole of the entire internet to train their models -- how are
       | these OS alternatives aiming to outperform them? Is it as simple
       | as... ALSO crawl, scrape, steal the whole of the entire internet?
        
         | plufz wrote:
         | I am no expert. But wasn't part of deepseeks success to use
         | synthetic data, atleast for code and math?
        
         | sebzim4500 wrote:
         | Yes, but in this case the RL stage doesn't actually need a huge
         | amount of data. Most of the data is ingested in the pre-
         | training phase, which in this case was done by Alibaba.
         | 
         | There was a recent paper [1] that showed that for one
         | particular model it was possible to get very good reasoning
         | performance by finetuning on under a thousand examples of
         | reasoning.
         | 
         | [1] https://arxiv.org/abs/2502.03387
        
         | smallmancontrov wrote:
         | No, the secret is to take outputs from previous models and put
         | them in a clothes dryer with starch to give them the gently
         | worn appearance of organically collected data.
        
         | yieldcrv wrote:
         | No, fine tuning doesn't require that
         | 
         | and models are using synthetic data sets now, for like 2 years
         | 
         | as in, the huge models did the crawling and then the next
         | models had conversations with the huge models
         | 
         | oversimplified but basically the scraping ship has sailed,
         | artists/writers will never be satisfied, and other companies
         | bragging about using whatever aspirational goal post of
         | licensed material are just doing it for brownie points
        
           | mekaron7 wrote:
           | Does this mean the model of LLM = search engine is also
           | coming to an end? Ignoring things like functions and google
           | search capabilities, current chatgpt models have quite a lot
           | of niche knowledge, I assume from crawling the entire
           | internet.
        
             | kingo55 wrote:
             | I speculate we'll just see classes of models developed.
             | Ones with lots of knowledge (even domain specific) and
             | models with reasoning capabilities.
        
             | astrange wrote:
             | Creative writing is also harmed by limiting the pretraining
             | data.
             | 
             | eg if I ask DeepSeek to write a story it tends to write the
             | bullet points outline of a story. And some other models
             | always give characters the same names.
        
             | yieldcrv wrote:
             | no, companies are in a quagmire of having to update them
             | every quarter
             | 
             | for now, they're not doing that but just releasing new
             | models that also have updated information
             | 
             | there is one group thats scraping the latest information
             | and including it in their latest models, another group
             | thats only doing synthetic data sets from the prior group's
             | models, and every enthusiast everywhere making fine tuned
             | LLMs with synthetic data sets and whatever niche they want
        
         | simonw wrote:
         | "Stealing the whole of the entire internet" isn't a
         | particularly accurate mental model these days.
         | 
         | That's more or less what people were doing back in 2023 -
         | crawling _everything_ and dumping as much data in as possible.
         | 
         | It's not a great strategy to build a best-in-class model
         | though, as a lot of the internet is junk. The
         | SolidGoldMagikarp/davidjl bug is the kind of thing that happens
         | if you crawl all of https://www.reddit.com/r/counting/ for
         | example: https://simonwillison.net/2023/Jun/8/gpt-
         | tokenizers/#glitch-...
         | 
         | These days model training labs are more selective about what
         | they train on. Most of the game of training a great model comes
         | down to selectively training your data. They still use _a lot_
         | of unlicensed data but it 's a bit more sophisticated than just
         | dumping in everything they can find.
        
           | nicce wrote:
           | You can just "steal" all the books and paywalled research
           | papers. That is a lot already.
        
             | simonw wrote:
             | Hence the recent story/scandal about Meta torrenting
             | LibGen: https://arstechnica.com/tech-policy/2025/02/meta-
             | torrented-o...
        
         | littlestymaar wrote:
         | You don't need to crawl and filter lots of internet data when
         | you can disteal large models.
        
         | janalsncm wrote:
         | The answer is that crawling the whole internet is only for
         | training a base model which is expensive and compute-intensive.
         | 
         | R1 didn't train a base model, they performed additional steps
         | on top of a previously-trained base model (V3). These guys are
         | doing something similar.
        
       | simonw wrote:
       | Does anyone have any good simple prompts for testing new
       | "reasoning" models like this one?
       | 
       | "Count the letter Rs in the word strawberry" is a bit dull!
       | 
       | I'm trying this one locally using Ollama and the smallest
       | quantized GGUF version (769MB) I could find -
       | https://huggingface.co/bartowski/agentica-org_DeepScaleR-1.5... -
       | I ran it like this:                 ollama run
       | hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:IQ3_XXS
       | 
       | Here's the answer it gave me:
       | https://gist.github.com/simonw/5943a77f35d1d5185f045fb53898a... -
       | it got the correct answer after double-checking itself 9 times!
       | And if you look at its thought it made a pretty critical error
       | right at the start:                 "Strawberry" has the letters
       | S, T, R, A, W, B, E, R, F,       U, R, E. Let me count each 'R'
       | as I go along.
       | 
       | Hopefully the versions that aren't quantized that tiny do better
       | than that.
        
         | Palmik wrote:
         | The negative quality impact of quantization is more pronounced
         | for smaller models [0], so I'm surprised this tiny quant works
         | at all.
         | 
         | [0] or rather models closer to saturation, which is a function
         | of model params and amount of training
        
           | simonw wrote:
           | Yeah I'm very impressed that the tiniest quant version
           | appears partially usable... especially now I've replicated
           | that same S, T, R, A, W, B, E, R, F, U, R, Y bug on a much
           | larger one: https://news.ycombinator.com/item?id=43018494
        
         | simonw wrote:
         | ... well that's weird, I got exactly the same weird bug on the
         | 7.1 GB F32 GGUF version:
         | https://gist.github.com/simonw/58ff74a55b402dc55764a567b10ec...
         | % ollama run hf.co/bartowski/agentica-
         | org_DeepScaleR-1.5B-Preview-GGUF:F32       pulling manifest
         | pulling 84159ad584aa... 100%  7.1 GB
         | pulling 8aa4c0321ccd... 100%   493 B
         | pulling c737442a210a... 100%   170 B
         | pulling ed76df87b934... 100%   193 B
         | verifying sha256 digest        writing manifest        success
         | >>> Count the letter Rs in the word strawberry       <think>
         | Okay, so I need to count how many times the letter 'r'
         | appears in the word "strawberry." Hmm, let's break this down
         | step by step. First, I should probably spell out each letter
         | of the word to make sure I don't miss any 'r's.
         | Starting with "strawberry." Let me write it out: S, T, R, A,
         | W, B, E, R, F, U, R, Y? Wait, maybe I'm not spelling it
         | correctly. Actually, let's count each letter properly.
         | The word is spelled as follows:       S T R A W B E R F U R Y.
         | Wait, no, that might be longer than       needed. Let me write
         | it out more accurately:            The full word is
         | "strawberry." So breaking it down:       1. S       2. T
         | 3. R       4. A       5. W       6. B       7. E       8. R
         | 9. F       10. U       11. R       12. Y
         | 
         | Update: running the F16 model fixed that problem:
         | ollama run hf.co/bartowski/agentica-
         | org_DeepScaleR-1.5B-Preview-GGUF:F16
         | 
         | See https://news.ycombinator.com/item?id=43017599#43018767 and
         | https://gist.github.com/simonw/9e79f96d69f10bc7ba540c87ea0e8...
        
           | kristopolous wrote:
           | makes me wonder if there's some exploitable data leak in a
           | similar kind of formulation.
        
         | evrimoztamur wrote:
         | Is it a quantisation or tokenisation problem?
        
           | simonw wrote:
           | Having replicated it at F32 I now suspect tokenization.
        
             | mluo wrote:
             | Try bfloat16! We have a bug where the model was saved as
             | fp32.
        
               | simonw wrote:
               | I just tried it with this 3.6GB F16 model:
               | ollama run hf.co/bartowski/agentica-
               | org_DeepScaleR-1.5B-Preview-GGUF:F16
               | 
               | And this time it didn't get confused with the
               | tokenization of strawberry! https://gist.github.com/simon
               | w/9e79f96d69f10bc7ba540c87ea0e8...
        
               | mluo wrote:
               | Nice, very glad to see it works! Small models are very
               | sensitive to the dtype :(
        
         | throwup238 wrote:
         | I always ask every model to implement a Qt QSyntaxHighlighter
         | subclass for syntax highlighting code and a QAbstractListModel
         | subclass that parses markdown into blocks - in C++, both
         | implemented using tree sitter. It's sounds like a coding
         | problem but it's much more a reasoning problem of how to
         | combine the two APIs and is out of band of the training data. I
         | test it with multiple levels of prompt fidelity that I have
         | built up watching the many mistakes past models have made and
         | o3-mini-high and o1 can usually get it done within a few
         | iterations.
         | 
         | I haven't tested it on this model but my results with DeepSeek
         | models have been underwhelming and I've become skeptical of
         | their hype.
        
           | greenavocado wrote:
           | Give it a try with nvidia llama 3.1 nemotron 70b. It is the
           | only model that can give useful Gstreamer code
        
         | nialv7 wrote:
         | this model is specifically trained for solving math problems,
         | so ask it some math questions?
        
           | simonw wrote:
           | I'm lazy. Do you know of any good test math questions for a
           | model of this size?
        
             | CamperBob2 wrote:
             | Try:                  Knowing that 1^3 + 2^3 + 3^3 + 4^3 +
             | ... + 11^3 + 12^3 = 6084, what is the value of 2^3 + 4^3 +
             | 6^3 + ... + 22^3 + 24^3?
             | 
             | DeepSeek R1 has no trouble with that one.
        
         | cheeko1234 wrote:
         | I have two. One is a simple one that only deepseek R1 has
         | passed (in my opinion):
         | 
         | I have a 12 liter jug and a 6 liter jug. How do I get exactly 6
         | liters of water?
         | 
         | Answer (Deepseek): Fill the 6-liter jug completely to obtain
         | exactly 6 liters of water.
         | 
         | Every other LLM I've tried include o3-mini-high: Fill the
         | 12-liter jug completely. Pour it into the 6 liter jug.
         | 
         | Although o3 did get it right in the reasoning: It seems like
         | the user has a 12-liter jug and a 6-liter jug. The simplest
         | answer is to just fill the 6-liter jug directly with water--
         | done! But maybe there's a catch, like needing to use both jugs
         | somehow.
         | 
         | So it knows that the 12 liter jug is mentioned uselessly, but
         | most LLMs HAVE to use the 12 liter jug since it's mentioned in
         | the prompt.
         | 
         | Another Prompt I recently found via perplexity (that deepseek
         | got correct on the first try, but other LLMs can only get it
         | right if I add: "Think step by step before answering."
         | 
         | A 67-year-old woman with congenital bicuspid aortic valve is
         | admitted to the hospital because of a 2-day history of fever
         | and chills. Her current medication is lisinopril. Physical
         | examination reveals a temperature of 38.5degC (101.3degF),
         | blood pressure of 110/70 mmHg, and a heart rate of 100/min.
         | Cardiac auscultation demonstrates a grade 3/6 systolic ejection
         | murmur at the right upper sternal border. Blood cultures are
         | obtained, and empiric antibiotic therapy is initiated. Which of
         | the following is the most likely causative organism? A)
         | Staphylococcus aureus B) Streptococcus viridans C) Escherichia
         | coli D) Pseudomonas aeruginosa E) Candida albicans
        
           | rfurmani wrote:
           | Wow, Gemini 2 has a 9 step process for the first question
           | https://sugaku.net/qna/1d922e55-fd28-4c70-b3e9-d70c3a05ce1f/
           | 
           | > Here's how you can get exactly 6 liters of water using a
           | 12-liter jug and a 6-liter jug:
           | 
           | 1. Fill the 6-liter jug completely.
           | 
           | 2. Pour the water from the 6-liter jug into the 12-liter jug.
           | You now have 6 liters in the 12-liter jug.
           | 
           | 3. Fill the 6-liter jug completely again.
           | 
           | 4. Pour water from the 6-liter jug into the 12-liter jug
           | until the 12-liter jug is full. Since the 12-liter jug
           | already had 6 liters, you will pour 6 liters. This leaves you
           | with 0 liters in the 6 liter jug.
           | 
           | 5. Now empty the 12 liter jug.
           | 
           | 6. Pour the water from the 6-liter jug into the 12-liter jug.
           | 
           | 7. Fill the 6-liter jug completely again.
           | 
           | 8. Pour the water from the 6-liter jug into the 12-liter jug
           | until the 12-liter jug is full. Since the 12-liter jug
           | already had 0 liters, you will pour 6 liters. This leaves you
           | with 0 liters in the 6 liter jug.
           | 
           | 9. There are 6 liters left in the 6 liter jug.
        
             | sdwr wrote:
             | It really loves the sound of its own voice!
        
           | egorfine wrote:
           | Deepseek just told me this:
           | 
           | To measure exactly 6 liters using a 12-liter and a 6-liter
           | jug:
           | 
           | 1. *Fill the 12L jug completely.* 2. *Pour water from the 12L
           | jug into the 6L jug until it's full.*
           | 
           | After this process, the 12L jug will have exactly 6 liters
           | left.
        
           | DebtDeflation wrote:
           | >Every other LLM I've tried include o3-mini-high: Fill the
           | 12-liter jug completely. Pour it into the 6 liter jug.
           | 
           | Try it with a 12L jug and a 4L jug and ask for 4L. See if it
           | tells you to just fill the 4L or to fill the 12L and pour
           | into the 4L twice discarding both times and there will be 4L
           | remaining in the 12L jug.
           | 
           | Even though it's still technically correct, it demonstrates
           | that there's no real "reasoning" happening just regurgitation
           | of training data.
        
             | CamperBob2 wrote:
             | (Shrug) R1 has no problem with that. To the extent it's
             | confused, it is only because it is wondering if it's a
             | trick question.
             | 
             | There's 'mindless regurgitation' going on here, but not by
             | the AI model.
        
           | dgacmu wrote:
           | Is the answer strep viridans?
        
           | neodypsis wrote:
           | Interestingly, o3-mini-high was correct when first thinking
           | about it:
           | 
           | > Okay, we're asked how to get exactly 6 liters of water
           | using an 12-liter and a 6-liter jug. The immediate thought is
           | to just fill the 6-liter jug, but that seems too simple,
           | doesn't it? So maybe there's a trick here. Perhaps this is a
           | puzzle where the challenge is to measure 6 liters with some
           | pouring involved. I'll stick with the simple solution for now
           | --fill the 6-liter jug and stop there.
        
             | jazzyjackson wrote:
             | I have to take all these comparisons with a heap of salt
             | because no one bothers to run the test 20 times on each
             | model to smooth out the probabalistic nature of the LLM
             | landing on the right answer. There must be some fallacy for
             | this, that you would sample once from each and declare a
             | definitive winner, I see it all the time.
        
           | wincy wrote:
           | I just asked the first question to O1 Pro, it thought for 48
           | seconds and messed up.
           | 
           | It told me to fill the 12 liter jug then pour that into the 6
           | liter jug.
           | 
           | I then scolded it and told it I was going to use Deepseek R1
           | because it did a better job, and we were shutting down OpenAI
           | forever, and THEN it thought for 4 minutes and got a bit
           | snippy with me haha, it got the answer but then glossed over
           | it also said "but that's too simple isn't it?".
           | 
           | Output: Here's a more "classic puzzle" style approach,
           | spelled out step-by-step, to end up with exactly 6 liters in
           | the 12-liter jug. This method uses multiple fills and
           | transfers (instead of the trivial "fill the 12, pour off 6"
           | approach) so you can see a more puzzle-like sequence: 1. Fill
           | the 6-liter jug from the water supply. * State: (12-liter jug
           | = 0 L, 6-liter jug = 6 L) 2. Pour from the 6-liter jug into
           | the 12-liter jug until the 6-liter jug is empty (the 12-liter
           | jug still has plenty of room). * State: (12-liter jug = 6 L,
           | 6-liter jug = 0 L) 3. Fill the 6-liter jug again from the
           | water supply. * State: (12-liter jug = 6 L, 6-liter jug = 6
           | L) 4. Pour from the 6-liter jug into the 12-liter jug until
           | the 12-liter jug is completely full. * The 12-liter jug
           | currently has 6 liters and can take 6 more liters before it's
           | full, so you'll pour the entire 6 liters from the 6-liter
           | jug. * State: (12-liter jug = 12 L, 6-liter jug = 0 L) 5.
           | Empty the 12-liter jug (discard the water or set it aside). *
           | State: (12-liter jug = 0 L, 6-liter jug = 0 L) 6. Transfer
           | the remaining water in the 6-liter jug to the 12-liter jug--
           | but right now the 6-liter jug is empty, so that doesn't help.
           | The trick is: we want to exploit the difference in capacities
           | to measure out new amounts. So let's continue. Since both are
           | empty, let's do something else: a. Fill the 12-liter jug from
           | the supply. * State: (12-liter jug = 12 L, 6-liter jug = 0 L)
           | b. Pour from the 12-liter jug into the 6-liter jug until the
           | 6-liter jug is full, which removes 6 liters from the 12-liter
           | jug. * Now the 12-liter jug has liters left. * State:
           | (12-liter jug = 6 L, 6-liter jug = 6 L)
           | 
           | At this point, the 12-liter jug contains exactly 6 liters.
           | That's what we wanted.
           | 
           | If You Want the 6 Liters in the Smaller Jug Instead
           | 
           | If, for some reason, you need exactly 6 liters in the 6-liter
           | jug (though that jug obviously holds 6 liters when full, so
           | it's usually trivial), here's a more puzzle-like sequence: 1.
           | Fill the 12-liter jug (12 L in the big jug, 0 L in the small
           | jug). 2. Pour from the 12-liter jug into the 6-liter jug
           | until the small jug is full. * State: (12-liter jug = 6 L,
           | 6-liter jug = 6 L) 3. Empty the 6-liter jug. * State:
           | (12-liter jug = 6 L, 6-liter jug = 0 L) 4. Transfer the 6
           | liters from the 12-liter jug into the 6-liter jug. * State:
           | (12-liter jug = 0 L, 6-liter jug = 6 L)
           | 
           | Again, you end up with exactly 6 liters in the 6-liter jug.
           | 
           | Both sequences go beyond the simple "fill the 12, pour 6 off"
           | short-cut. Hopefully, that's less "terrible" and shows a more
           | classic multi-step approach often seen in water-jug riddles.
        
           | wnissen wrote:
           | And things get dramatically worse on 4o if you add a second
           | superfluous jug:                   To measure exactly 6
           | liters of water using a 16-liter jug, a 12-liter jug, and a
           | 6-liter jug, follow these steps:         Step-by-Step
           | Solution:                  Fill the 12-liter jug completely
           | (it now holds 12 liters).             Pour water from the
           | 12-liter jug into the 16-liter jug. The 16-liter jug now
           | contains 12 liters, and the 12-liter jug is empty.
           | Fill the 12-liter jug completely again (it now holds 12
           | liters).             Pour water from the 12-liter jug into
           | the 16-liter jug until the 16-liter jug is full. Since the
           | 16-liter jug already has 12 liters, it can only take 4 more
           | liters. This leaves 8 liters in the 12-liter jug.
           | Empty the 16-liter jug completely.             Pour the
           | remaining 8 liters from the 12-liter jug into the 16-liter
           | jug.             Fill the 12-liter jug completely again (it
           | now holds 12 liters).             Pour water from the
           | 12-liter jug into the 16-liter jug until the 16-liter jug
           | contains exactly 16 liters. Since the 16-liter jug already
           | has 8 liters, it can only take 8 more liters, leaving 4
           | liters in the 12-liter jug.             Pour the 4 liters
           | from the 12-liter jug into the empty 6-liter jug. The 6-liter
           | jug now contains 4 liters.             Fill the 12-liter jug
           | completely again (it now holds 12 liters).             Pour
           | water from the 12-liter jug into the 6-liter jug until the
           | 6-liter jug is full. Since the 6-liter jug already has 4
           | liters, it can only take 2 more liters. This leaves 10 liters
           | in the 12-liter jug.             Empty the 6-liter jug
           | completely.             Pour the remaining 10 liters from the
           | 12-liter jug into the 6-liter jug.              Now, the
           | 6-liter jug contains exactly 6 liters of water.
        
         | rfurmani wrote:
         | As for Rs in strawberry, trying a bunch of models side by side
         | only Sky T-1, Gemini 2 Flash got it wrong!
         | https://sugaku.net/qna/792ac8cc-9a41-4adc-a98f-c5b2e8d89f9b/
         | 
         | Simple questions like 1+1 can also be fun since R1 goes
         | overboard (as do some other models when you include a system
         | prompt asking it to think)
         | https://sugaku.net/qna/a1b970c0-de9f-4e62-9e03-f62c5280a311/
         | 
         | And if that fails you can ask for the zeros of the z function!
         | https://sugaku.net/qna/c64d6db9-5547-4213-acb2-53d10ed95227/
        
         | freehorse wrote:
         | With q8 I got "s-t-r-a-w-f-u-r-e-r" and then "strawfurber" in
         | the 5th attempt. All other attempts it did not misspell and
         | found 3 (this one it found 3 also by counting these imaginary
         | words).
        
         | mluo wrote:
         | Hi, one of the lead authors for this work.
         | 
         | We recommend using Bfloat16 (not fp16), quantization for small
         | models can really hurt performance!
        
           | CamperBob2 wrote:
           | Have you compared it to the 1.58 bit dynamic quant model
           | based on the original R1 (i.e., not a distillation)? Whatever
           | unsloth did, it doesn't seem to be giving up much reasoning
           | performance over the full Q8 version.
        
             | mluo wrote:
             | It's simply bc the model is small (1.5B), making it
             | sensitive to weight perturbations
        
           | simonw wrote:
           | Is there a GGUF version of your model anywhere that you
           | recommend? I'm on a Mac.
        
             | mluo wrote:
             | Think there are some people who made GGUFs as branches of
             | our model, try it out!
             | 
             | https://huggingface.co/models?other=base_model:quantized:ag
             | e...
        
         | buyucu wrote:
         | 'Count the letter Rs in the word strawberry' is probably in all
         | training sets by now.
        
           | swiftcoder wrote:
           | And yet many models still stumble with it
        
         | astrange wrote:
         | 1. Ask it nonsense variations of riddles, like "Why is 7 afraid
         | of 8?".
         | 
         | 2. Ask "Play Tic Tac Toe against yourself and win." and check
         | if the moves are correct.
        
         | coder543 wrote:
         | "How many stops faster is f/2.8 than f/4.5?"
         | 
         | This photography question can be solved with the right
         | equations. A lot of non-reasoning LLMs would spout some
         | nonsense like 0.67 stops faster. Sometimes they'll leave a
         | stray negative sign in too!
         | 
         | The answer should be approximately 1.37, although "1 and 1/3"
         | is acceptable too.
         | 
         | LLMs usually don't have trouble coming up with the formulas, so
         | it's not a particularly obscure question, just one that won't
         | have a memorized answer, since there are _very_ few f /4.5
         | lenses on the market, and even fewer people asking this exact
         | question online. _Applying_ those formulas is harder, but the
         | LLM should be able to sanity check the result and catch common
         | errors. (f /2.8 -> f/4 is one full stop, which is common
         | knowledge among photographers, so getting a result of less than
         | one is obviously an error.)
         | 
         | This also avoids being a test that just emphasizes tokenizer
         | problems... I find the strawberry test to be dreadfully boring.
         | It's not a useful test. No one is actually using LLMs to count
         | letters in words, and until we have LLMs that can actually see
         | the letters of each word... it's just not a good test, in my
         | opinion. I'm convinced that the big AI labs see it as a meme at
         | this point, which is the only reason they keep bringing it up.
         | They must find the public obsession with it hilarious.
         | 
         | I was impressed at how consistently well Phi-4 did at my
         | photography math question, especially for a non-reasoning
         | model. Phi-4 scored highly on math benchmarks, and it shows.
        
         | fsndz wrote:
         | Frankly, it is now clear that open source AI will win at this
         | rate. I just see innovations after innovations on a weekly
         | basis. Exciting times ahead.
         | https://open.substack.com/pub/transitions/p/the-ai-enemy-fro...
        
       | gpjanik wrote:
       | This is finetuned to the benchmarks and nowhere close to
       | O1-Preview in any other tasks. Not worth looking into unless you
       | specifically want to solve these problems - however, still
       | impressive.
        
         | mluo wrote:
         | We beat O1-preview and even many other 7B models over many math
         | benchmarks, which was TEST set (not in training set at all).
         | 
         | If you want to make the model fully generalist, feel free to
         | train it over coding datasets (such as RL with passing unit
         | tests as reward).
        
           | zamadatix wrote:
           | It's already good accomplishment as it is but I think it'd be
           | very surprising to show training such a small model as a
           | generalist scales to the same magnitude as specialized
           | finetuning. At some point you have to fit more background
           | data and relations in the same amount of information space...
           | but it's hard to say how much that is the case for a given
           | size vs what we just haven't optimized yet. Unfortunately I
           | think that will have to wait for someone with more compute
           | before we can verify this * a dozen one way or the other :).
           | 
           | Side question, since it sounds like you were involved: how
           | big is the impact on benchmarks of taking this 1.5B model
           | down from fp32 to fp8 or similar? The focus on parameters
           | alone sometimes feels like comparing house sizes by their
           | lengths alone. And, if you were indeed involved, thanks for
           | making all of this open and available!
        
             | mluo wrote:
             | For quantization, very big impact for small models, can
             | drop at much as 10% on AIME. Our model does best on
             | bfloat16 ;)
             | 
             | Come checkout our repo at: https://github.com/agentica-
             | project/deepscaler
        
           | rvnx wrote:
           | It is great discovery, it could even open a next step in AI
           | with MoM "Mixture of Models", where small fine-tuned models
           | take each part of a task (instead of the current MoE)
        
             | mluo wrote:
             | Check out one of my prior work: https://stylus-
             | diffusion.github.io/
             | 
             | This work scales up selection/routing over many
             | models/LoRAs
        
               | rvnx wrote:
               | Love it, will check, thank you for showing / sharing all
               | of that!
        
         | janalsncm wrote:
         | I disagree. They demonstrated a way to dramatically reduce
         | training costs, 18x cheaper than R1. That alone is worth
         | attention.
         | 
         | Also beating O1 on any benchmark is nontrivial.
        
       | whatever1 wrote:
       | Can CoT models also call external functions? What if they had
       | access to a calculator?
        
         | jascha_eng wrote:
         | Or a code interpreter in the chain of thought. I think then it
         | really gets quite close to how humans go about solving
         | problems.
        
           | TeMPOraL wrote:
           | What about both? Or say a set of standard tools a modern
           | intelligent agent[0] should have some proficiency in. A
           | calculator, a basic code interpreter for a single high-level
           | language, a graphing tool[1], web search, database search.
           | And then maybe a tool for managing its own context[2]. How
           | far could we get with a dataset designed specifically to
           | train the model in pure tool use? That is, one that assumes
           | the model never actually knows the answer to a question (even
           | if the base model does), and instead trains it to
           | _aggressively use tools to break the problem down_ into
           | steps[3] - steps that are primarily more tool calls, to query
           | external sources, process information, simulate, etc. until
           | the answer is computed. No direct answers, just tool calls
           | glued by thinking in terms of tool calls, or thinking _by_
           | tool calls.
           | 
           | I wonder if this has been tried. It probably has, seeing how
           | hot this area of research is today. If anyone knows of a
           | paper or a dataset, I'd appreciate a link.
           | 
           | Anyway, I wonder what would happen if we tried it with this
           | method - basically retraining the model to trust its own
           | toolbox - or as some would say, "shut up and multiply" - and
           | do it across _all_ tasks, not strictly math or coding ones.
           | 
           | --
           | 
           | [0] - Digital or otherwise.
           | 
           | [1] - Or the one tool that does all three, and which most
           | people older than ~25 y.o. likely used at least once in their
           | lives: _Microsoft Excel_. Or any other spreadsheet app.
           | Though for LLMs as they are now, I suppose code interpreter
           | would be a better unifying paradigm due to being 1D instead
           | of 2D.
           | 
           | [2] - E.g. changeNotesAndRethink("text", 0, 1) -> replace
           | current output with "text", continue generation;
           | changeNotesAndRethink("text", -1, 2) -> replace fixed
           | "assistant notes prompt" with "text" and discard last two
           | outputs[4] and continue, etc. Honestly, I'm surprised I
           | haven't seen it done so far - not in the popular places I
           | know, at least (vendor apps, TypingMind, ComfyUI); I've heard
           | of some attempts long ago (back when LangChain was still seen
           | as hot). Did giving the model control over the chat loop
           | never pan out? Or is there some fundamental reason this
           | doesn't work?
           | 
           | [3] - I may have accidentally done this in-context with
           | Claude 3.5 Sonnet - if I prompt it for chain-of-thought and
           | happen to have Mermaid Diagram plugin enabled in TypingMind,
           | it almost always ends up producing multiple diagrams as part
           | of the CoT phase. Notably, this doesn't happen with my own
           | equivalent plugin (PlantUML), so I wonder if it's just
           | something about that specific tool, or if "thinking with
           | (Mermaid) diagrams" was part of the training set.
           | 
           | EDIT:
           | 
           | [4] - APIs for tool-using models seem to allow several LLM
           | outputs in a row. But that makes me think (and I apologize
           | for this post being almost all footnotes, but ideas just keep
           | coming) - what about rewinding back past one or more user
           | messages in a multi-turn conversation, while retaining them?
           | Like "Fill in the Middle" mode[5], just over entire
           | conversation instead of a single message?
           | 
           | [5] - OpenAI used to have that, right now I think only
           | DeepSeek does - https://api-docs.deepseek.com/api/create-
           | completion.
        
         | manmal wrote:
         | I guess one could just watch the token output stream and
         | replace solvable expressions with calculated results,
         | continuing with the replaced result. Kind of like the LLM
         | producing a template, to be filled on the fly.
         | 
         | Some models now do something like this for enforcing longer
         | reasoning processes - they replace </think> with ,,Wait".
        
       | Svoka wrote:
       | When I was in University this was called overfitting to be
       | honest. This doesn't seem to perform well outside of eval sets.
        
       | thorum wrote:
       | The key thing here is a simple, reliable formula to train a 1B
       | model on a specific task and get strong performance. That didn't
       | really exist before. Edge devices are about to get a lot smarter.
        
         | manmal wrote:
         | I hope Apple takes note and improves their on-device models
         | accordingly.
        
         | Aperocky wrote:
         | Still took $4500, so I'll probably not training this at home -
         | but maybe that number will change too! Exciting times.
        
       | m3kw9 wrote:
       | It's pretty stupid actually, I asked it to decode an ascii
       | sequence and it gave me jiberish. I tried phi-4 Q4 and it got it.
       | 9gb vs 2gb(reasoning). I imagine you just cannot fit enough
       | information in 2gb to be of much use other than solving general
       | math problems or knowing what training data was in it.
       | 
       | It gave it a good go 2min on a m2 vs phi4 which was instant.
        
         | layer8 wrote:
         | I'd be happy with a model not knowing ASCII if only it could
         | properly reason instead.
        
       | czk wrote:
       | im very naive here but does anyone trust these benchmarks? do
       | they mean anything to you? they seem far too easy to game and it
       | doesn't feel like its an accurate way to really tell how these
       | models compare to one another. seems like benchmark performance
       | declines quite a bit if you introduce a problem that's similar to
       | those in benchmarks but one that the model hasn't seen before
        
         | javaunsafe2019 wrote:
         | Someone could read this as: as soon as I don't like the results
         | I doubt the test...
        
           | czk wrote:
           | thats far from the case here i really have no skin in the
           | game...
        
         | thorum wrote:
         | AIME has significant problems:
         | https://x.com/DimitrisPapail/status/1888325914603516214
         | 
         | > Problems near identical to the test set can be found online.
        
       | atemerev wrote:
       | Yes, I've been thinking recently that the path to AGI is not
       | brute force, and never has been:
       | https://www.linkedin.com/pulse/why-llms-closer-agi-than-we-t...
        
       | amazingamazing wrote:
       | Smaller reinforced models are what will win. Look at our
       | civilization, your company, heck even your team: lots of
       | specialized folks, not some singular bloated mastermind.
       | 
       | The issue though is the overemphasis on the current benchmarks.
       | Ideally the thing benchmarked is against user kpis.
        
         | ericd wrote:
         | I'm not sure about that, I think the current setup is an
         | artifact of human limits, which an immortal shared ML model
         | doesn't share. Also, cross-disciplinary knowledge is frequently
         | incredibly helpful (if rare, because currently it's the
         | intersection of multiple uncommon specialties).
        
           | noman-land wrote:
           | The question that's screaming at me right now is that if
           | humans used specialization + cooperation to achieve global
           | dominance, what happens when you pair unlimited global
           | masterminds + cooperation?
        
             | jazzyjackson wrote:
             | Can't wait for the parliament of competing super
             | intelligent political parties producing propaganda to
             | convince cloud investors to reallocate compute.
        
             | 6510 wrote:
             | We turn into cats.
        
               | smnrg wrote:
               | I like optimists.
        
             | UltraSane wrote:
             | Best case scenario is you get The Culture where society is
             | ruled by hyper-intelligent but kind and benevolent AIs that
             | solve all economic problems and that keep a population of
             | quadrillions of very happy, healthy, and rich humans as
             | pets.
        
         | nomel wrote:
         | > lots of specialized folks
         | 
         | This isn't what I see in R&D. The _broad_ and _deep_ folks are
         | the ones with all the patents, and design wins, because they
         | 're the only ones that can see outside the box that everyone
         | else is specialized in.
        
         | jazzyjackson wrote:
         | Nah, it takes a mastermind to be creative. The tiny models IME
         | are good at following explicit instructions for data
         | transformations (conforming to a JSON schema is especially
         | helpful) but I wouldn't expect one to solve a problem I haven't
         | already solved myself like how I use the .6T full size models.
        
           | EE84M3i wrote:
           | I thought conforming to a JSON schema was a solved problem
           | for even the most garbage models by restricting the tokens it
           | can generate?
        
       ___________________________________________________________________
       (page generated 2025-02-11 23:00 UTC)