[HN Gopher] DeepScaleR: Surpassing O1-Preview with a 1.5B Model ...
___________________________________________________________________
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL
Author : sijuntan
Score : 228 points
Date : 2025-02-11 19:59 UTC (3 hours ago)
(HTM) web link (pretty-radio-b75.notion.site)
(TXT) w3m dump (pretty-radio-b75.notion.site)
| daft_pink wrote:
| Would really love it if we could get uncensored models this way.
| xeckr wrote:
| It looks like we're moving in that direction.
| dang wrote:
| We changed the URL from https://github.com/agentica-
| project/deepscaler to the article that gives more background
| information (and also points to the repo).
|
| (Submitted title was "Replicating Deepseek-R1 for $4500: RL
| Boosts 1.5B Model Beyond o1-preview")
| mrcwinn wrote:
| Newb question. If the big boys are crawling, scraping, stealing
| the whole of the entire internet to train their models -- how are
| these OS alternatives aiming to outperform them? Is it as simple
| as... ALSO crawl, scrape, steal the whole of the entire internet?
| plufz wrote:
| I am no expert. But wasn't part of deepseeks success to use
| synthetic data, atleast for code and math?
| sebzim4500 wrote:
| Yes, but in this case the RL stage doesn't actually need a huge
| amount of data. Most of the data is ingested in the pre-
| training phase, which in this case was done by Alibaba.
|
| There was a recent paper [1] that showed that for one
| particular model it was possible to get very good reasoning
| performance by finetuning on under a thousand examples of
| reasoning.
|
| [1] https://arxiv.org/abs/2502.03387
| smallmancontrov wrote:
| No, the secret is to take outputs from previous models and put
| them in a clothes dryer with starch to give them the gently
| worn appearance of organically collected data.
| yieldcrv wrote:
| No, fine tuning doesn't require that
|
| and models are using synthetic data sets now, for like 2 years
|
| as in, the huge models did the crawling and then the next
| models had conversations with the huge models
|
| oversimplified but basically the scraping ship has sailed,
| artists/writers will never be satisfied, and other companies
| bragging about using whatever aspirational goal post of
| licensed material are just doing it for brownie points
| mekaron7 wrote:
| Does this mean the model of LLM = search engine is also
| coming to an end? Ignoring things like functions and google
| search capabilities, current chatgpt models have quite a lot
| of niche knowledge, I assume from crawling the entire
| internet.
| kingo55 wrote:
| I speculate we'll just see classes of models developed.
| Ones with lots of knowledge (even domain specific) and
| models with reasoning capabilities.
| astrange wrote:
| Creative writing is also harmed by limiting the pretraining
| data.
|
| eg if I ask DeepSeek to write a story it tends to write the
| bullet points outline of a story. And some other models
| always give characters the same names.
| yieldcrv wrote:
| no, companies are in a quagmire of having to update them
| every quarter
|
| for now, they're not doing that but just releasing new
| models that also have updated information
|
| there is one group thats scraping the latest information
| and including it in their latest models, another group
| thats only doing synthetic data sets from the prior group's
| models, and every enthusiast everywhere making fine tuned
| LLMs with synthetic data sets and whatever niche they want
| simonw wrote:
| "Stealing the whole of the entire internet" isn't a
| particularly accurate mental model these days.
|
| That's more or less what people were doing back in 2023 -
| crawling _everything_ and dumping as much data in as possible.
|
| It's not a great strategy to build a best-in-class model
| though, as a lot of the internet is junk. The
| SolidGoldMagikarp/davidjl bug is the kind of thing that happens
| if you crawl all of https://www.reddit.com/r/counting/ for
| example: https://simonwillison.net/2023/Jun/8/gpt-
| tokenizers/#glitch-...
|
| These days model training labs are more selective about what
| they train on. Most of the game of training a great model comes
| down to selectively training your data. They still use _a lot_
| of unlicensed data but it 's a bit more sophisticated than just
| dumping in everything they can find.
| nicce wrote:
| You can just "steal" all the books and paywalled research
| papers. That is a lot already.
| simonw wrote:
| Hence the recent story/scandal about Meta torrenting
| LibGen: https://arstechnica.com/tech-policy/2025/02/meta-
| torrented-o...
| littlestymaar wrote:
| You don't need to crawl and filter lots of internet data when
| you can disteal large models.
| janalsncm wrote:
| The answer is that crawling the whole internet is only for
| training a base model which is expensive and compute-intensive.
|
| R1 didn't train a base model, they performed additional steps
| on top of a previously-trained base model (V3). These guys are
| doing something similar.
| simonw wrote:
| Does anyone have any good simple prompts for testing new
| "reasoning" models like this one?
|
| "Count the letter Rs in the word strawberry" is a bit dull!
|
| I'm trying this one locally using Ollama and the smallest
| quantized GGUF version (769MB) I could find -
| https://huggingface.co/bartowski/agentica-org_DeepScaleR-1.5... -
| I ran it like this: ollama run
| hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:IQ3_XXS
|
| Here's the answer it gave me:
| https://gist.github.com/simonw/5943a77f35d1d5185f045fb53898a... -
| it got the correct answer after double-checking itself 9 times!
| And if you look at its thought it made a pretty critical error
| right at the start: "Strawberry" has the letters
| S, T, R, A, W, B, E, R, F, U, R, E. Let me count each 'R'
| as I go along.
|
| Hopefully the versions that aren't quantized that tiny do better
| than that.
| Palmik wrote:
| The negative quality impact of quantization is more pronounced
| for smaller models [0], so I'm surprised this tiny quant works
| at all.
|
| [0] or rather models closer to saturation, which is a function
| of model params and amount of training
| simonw wrote:
| Yeah I'm very impressed that the tiniest quant version
| appears partially usable... especially now I've replicated
| that same S, T, R, A, W, B, E, R, F, U, R, Y bug on a much
| larger one: https://news.ycombinator.com/item?id=43018494
| simonw wrote:
| ... well that's weird, I got exactly the same weird bug on the
| 7.1 GB F32 GGUF version:
| https://gist.github.com/simonw/58ff74a55b402dc55764a567b10ec...
| % ollama run hf.co/bartowski/agentica-
| org_DeepScaleR-1.5B-Preview-GGUF:F32 pulling manifest
| pulling 84159ad584aa... 100% 7.1 GB
| pulling 8aa4c0321ccd... 100% 493 B
| pulling c737442a210a... 100% 170 B
| pulling ed76df87b934... 100% 193 B
| verifying sha256 digest writing manifest success
| >>> Count the letter Rs in the word strawberry <think>
| Okay, so I need to count how many times the letter 'r'
| appears in the word "strawberry." Hmm, let's break this down
| step by step. First, I should probably spell out each letter
| of the word to make sure I don't miss any 'r's.
| Starting with "strawberry." Let me write it out: S, T, R, A,
| W, B, E, R, F, U, R, Y? Wait, maybe I'm not spelling it
| correctly. Actually, let's count each letter properly.
| The word is spelled as follows: S T R A W B E R F U R Y.
| Wait, no, that might be longer than needed. Let me write
| it out more accurately: The full word is
| "strawberry." So breaking it down: 1. S 2. T
| 3. R 4. A 5. W 6. B 7. E 8. R
| 9. F 10. U 11. R 12. Y
|
| Update: running the F16 model fixed that problem:
| ollama run hf.co/bartowski/agentica-
| org_DeepScaleR-1.5B-Preview-GGUF:F16
|
| See https://news.ycombinator.com/item?id=43017599#43018767 and
| https://gist.github.com/simonw/9e79f96d69f10bc7ba540c87ea0e8...
| kristopolous wrote:
| makes me wonder if there's some exploitable data leak in a
| similar kind of formulation.
| evrimoztamur wrote:
| Is it a quantisation or tokenisation problem?
| simonw wrote:
| Having replicated it at F32 I now suspect tokenization.
| mluo wrote:
| Try bfloat16! We have a bug where the model was saved as
| fp32.
| simonw wrote:
| I just tried it with this 3.6GB F16 model:
| ollama run hf.co/bartowski/agentica-
| org_DeepScaleR-1.5B-Preview-GGUF:F16
|
| And this time it didn't get confused with the
| tokenization of strawberry! https://gist.github.com/simon
| w/9e79f96d69f10bc7ba540c87ea0e8...
| mluo wrote:
| Nice, very glad to see it works! Small models are very
| sensitive to the dtype :(
| throwup238 wrote:
| I always ask every model to implement a Qt QSyntaxHighlighter
| subclass for syntax highlighting code and a QAbstractListModel
| subclass that parses markdown into blocks - in C++, both
| implemented using tree sitter. It's sounds like a coding
| problem but it's much more a reasoning problem of how to
| combine the two APIs and is out of band of the training data. I
| test it with multiple levels of prompt fidelity that I have
| built up watching the many mistakes past models have made and
| o3-mini-high and o1 can usually get it done within a few
| iterations.
|
| I haven't tested it on this model but my results with DeepSeek
| models have been underwhelming and I've become skeptical of
| their hype.
| greenavocado wrote:
| Give it a try with nvidia llama 3.1 nemotron 70b. It is the
| only model that can give useful Gstreamer code
| nialv7 wrote:
| this model is specifically trained for solving math problems,
| so ask it some math questions?
| simonw wrote:
| I'm lazy. Do you know of any good test math questions for a
| model of this size?
| CamperBob2 wrote:
| Try: Knowing that 1^3 + 2^3 + 3^3 + 4^3 +
| ... + 11^3 + 12^3 = 6084, what is the value of 2^3 + 4^3 +
| 6^3 + ... + 22^3 + 24^3?
|
| DeepSeek R1 has no trouble with that one.
| cheeko1234 wrote:
| I have two. One is a simple one that only deepseek R1 has
| passed (in my opinion):
|
| I have a 12 liter jug and a 6 liter jug. How do I get exactly 6
| liters of water?
|
| Answer (Deepseek): Fill the 6-liter jug completely to obtain
| exactly 6 liters of water.
|
| Every other LLM I've tried include o3-mini-high: Fill the
| 12-liter jug completely. Pour it into the 6 liter jug.
|
| Although o3 did get it right in the reasoning: It seems like
| the user has a 12-liter jug and a 6-liter jug. The simplest
| answer is to just fill the 6-liter jug directly with water--
| done! But maybe there's a catch, like needing to use both jugs
| somehow.
|
| So it knows that the 12 liter jug is mentioned uselessly, but
| most LLMs HAVE to use the 12 liter jug since it's mentioned in
| the prompt.
|
| Another Prompt I recently found via perplexity (that deepseek
| got correct on the first try, but other LLMs can only get it
| right if I add: "Think step by step before answering."
|
| A 67-year-old woman with congenital bicuspid aortic valve is
| admitted to the hospital because of a 2-day history of fever
| and chills. Her current medication is lisinopril. Physical
| examination reveals a temperature of 38.5degC (101.3degF),
| blood pressure of 110/70 mmHg, and a heart rate of 100/min.
| Cardiac auscultation demonstrates a grade 3/6 systolic ejection
| murmur at the right upper sternal border. Blood cultures are
| obtained, and empiric antibiotic therapy is initiated. Which of
| the following is the most likely causative organism? A)
| Staphylococcus aureus B) Streptococcus viridans C) Escherichia
| coli D) Pseudomonas aeruginosa E) Candida albicans
| rfurmani wrote:
| Wow, Gemini 2 has a 9 step process for the first question
| https://sugaku.net/qna/1d922e55-fd28-4c70-b3e9-d70c3a05ce1f/
|
| > Here's how you can get exactly 6 liters of water using a
| 12-liter jug and a 6-liter jug:
|
| 1. Fill the 6-liter jug completely.
|
| 2. Pour the water from the 6-liter jug into the 12-liter jug.
| You now have 6 liters in the 12-liter jug.
|
| 3. Fill the 6-liter jug completely again.
|
| 4. Pour water from the 6-liter jug into the 12-liter jug
| until the 12-liter jug is full. Since the 12-liter jug
| already had 6 liters, you will pour 6 liters. This leaves you
| with 0 liters in the 6 liter jug.
|
| 5. Now empty the 12 liter jug.
|
| 6. Pour the water from the 6-liter jug into the 12-liter jug.
|
| 7. Fill the 6-liter jug completely again.
|
| 8. Pour the water from the 6-liter jug into the 12-liter jug
| until the 12-liter jug is full. Since the 12-liter jug
| already had 0 liters, you will pour 6 liters. This leaves you
| with 0 liters in the 6 liter jug.
|
| 9. There are 6 liters left in the 6 liter jug.
| sdwr wrote:
| It really loves the sound of its own voice!
| egorfine wrote:
| Deepseek just told me this:
|
| To measure exactly 6 liters using a 12-liter and a 6-liter
| jug:
|
| 1. *Fill the 12L jug completely.* 2. *Pour water from the 12L
| jug into the 6L jug until it's full.*
|
| After this process, the 12L jug will have exactly 6 liters
| left.
| DebtDeflation wrote:
| >Every other LLM I've tried include o3-mini-high: Fill the
| 12-liter jug completely. Pour it into the 6 liter jug.
|
| Try it with a 12L jug and a 4L jug and ask for 4L. See if it
| tells you to just fill the 4L or to fill the 12L and pour
| into the 4L twice discarding both times and there will be 4L
| remaining in the 12L jug.
|
| Even though it's still technically correct, it demonstrates
| that there's no real "reasoning" happening just regurgitation
| of training data.
| CamperBob2 wrote:
| (Shrug) R1 has no problem with that. To the extent it's
| confused, it is only because it is wondering if it's a
| trick question.
|
| There's 'mindless regurgitation' going on here, but not by
| the AI model.
| dgacmu wrote:
| Is the answer strep viridans?
| neodypsis wrote:
| Interestingly, o3-mini-high was correct when first thinking
| about it:
|
| > Okay, we're asked how to get exactly 6 liters of water
| using an 12-liter and a 6-liter jug. The immediate thought is
| to just fill the 6-liter jug, but that seems too simple,
| doesn't it? So maybe there's a trick here. Perhaps this is a
| puzzle where the challenge is to measure 6 liters with some
| pouring involved. I'll stick with the simple solution for now
| --fill the 6-liter jug and stop there.
| jazzyjackson wrote:
| I have to take all these comparisons with a heap of salt
| because no one bothers to run the test 20 times on each
| model to smooth out the probabalistic nature of the LLM
| landing on the right answer. There must be some fallacy for
| this, that you would sample once from each and declare a
| definitive winner, I see it all the time.
| wincy wrote:
| I just asked the first question to O1 Pro, it thought for 48
| seconds and messed up.
|
| It told me to fill the 12 liter jug then pour that into the 6
| liter jug.
|
| I then scolded it and told it I was going to use Deepseek R1
| because it did a better job, and we were shutting down OpenAI
| forever, and THEN it thought for 4 minutes and got a bit
| snippy with me haha, it got the answer but then glossed over
| it also said "but that's too simple isn't it?".
|
| Output: Here's a more "classic puzzle" style approach,
| spelled out step-by-step, to end up with exactly 6 liters in
| the 12-liter jug. This method uses multiple fills and
| transfers (instead of the trivial "fill the 12, pour off 6"
| approach) so you can see a more puzzle-like sequence: 1. Fill
| the 6-liter jug from the water supply. * State: (12-liter jug
| = 0 L, 6-liter jug = 6 L) 2. Pour from the 6-liter jug into
| the 12-liter jug until the 6-liter jug is empty (the 12-liter
| jug still has plenty of room). * State: (12-liter jug = 6 L,
| 6-liter jug = 0 L) 3. Fill the 6-liter jug again from the
| water supply. * State: (12-liter jug = 6 L, 6-liter jug = 6
| L) 4. Pour from the 6-liter jug into the 12-liter jug until
| the 12-liter jug is completely full. * The 12-liter jug
| currently has 6 liters and can take 6 more liters before it's
| full, so you'll pour the entire 6 liters from the 6-liter
| jug. * State: (12-liter jug = 12 L, 6-liter jug = 0 L) 5.
| Empty the 12-liter jug (discard the water or set it aside). *
| State: (12-liter jug = 0 L, 6-liter jug = 0 L) 6. Transfer
| the remaining water in the 6-liter jug to the 12-liter jug--
| but right now the 6-liter jug is empty, so that doesn't help.
| The trick is: we want to exploit the difference in capacities
| to measure out new amounts. So let's continue. Since both are
| empty, let's do something else: a. Fill the 12-liter jug from
| the supply. * State: (12-liter jug = 12 L, 6-liter jug = 0 L)
| b. Pour from the 12-liter jug into the 6-liter jug until the
| 6-liter jug is full, which removes 6 liters from the 12-liter
| jug. * Now the 12-liter jug has liters left. * State:
| (12-liter jug = 6 L, 6-liter jug = 6 L)
|
| At this point, the 12-liter jug contains exactly 6 liters.
| That's what we wanted.
|
| If You Want the 6 Liters in the Smaller Jug Instead
|
| If, for some reason, you need exactly 6 liters in the 6-liter
| jug (though that jug obviously holds 6 liters when full, so
| it's usually trivial), here's a more puzzle-like sequence: 1.
| Fill the 12-liter jug (12 L in the big jug, 0 L in the small
| jug). 2. Pour from the 12-liter jug into the 6-liter jug
| until the small jug is full. * State: (12-liter jug = 6 L,
| 6-liter jug = 6 L) 3. Empty the 6-liter jug. * State:
| (12-liter jug = 6 L, 6-liter jug = 0 L) 4. Transfer the 6
| liters from the 12-liter jug into the 6-liter jug. * State:
| (12-liter jug = 0 L, 6-liter jug = 6 L)
|
| Again, you end up with exactly 6 liters in the 6-liter jug.
|
| Both sequences go beyond the simple "fill the 12, pour 6 off"
| short-cut. Hopefully, that's less "terrible" and shows a more
| classic multi-step approach often seen in water-jug riddles.
| wnissen wrote:
| And things get dramatically worse on 4o if you add a second
| superfluous jug: To measure exactly 6
| liters of water using a 16-liter jug, a 12-liter jug, and a
| 6-liter jug, follow these steps: Step-by-Step
| Solution: Fill the 12-liter jug completely
| (it now holds 12 liters). Pour water from the
| 12-liter jug into the 16-liter jug. The 16-liter jug now
| contains 12 liters, and the 12-liter jug is empty.
| Fill the 12-liter jug completely again (it now holds 12
| liters). Pour water from the 12-liter jug into
| the 16-liter jug until the 16-liter jug is full. Since the
| 16-liter jug already has 12 liters, it can only take 4 more
| liters. This leaves 8 liters in the 12-liter jug.
| Empty the 16-liter jug completely. Pour the
| remaining 8 liters from the 12-liter jug into the 16-liter
| jug. Fill the 12-liter jug completely again (it
| now holds 12 liters). Pour water from the
| 12-liter jug into the 16-liter jug until the 16-liter jug
| contains exactly 16 liters. Since the 16-liter jug already
| has 8 liters, it can only take 8 more liters, leaving 4
| liters in the 12-liter jug. Pour the 4 liters
| from the 12-liter jug into the empty 6-liter jug. The 6-liter
| jug now contains 4 liters. Fill the 12-liter jug
| completely again (it now holds 12 liters). Pour
| water from the 12-liter jug into the 6-liter jug until the
| 6-liter jug is full. Since the 6-liter jug already has 4
| liters, it can only take 2 more liters. This leaves 10 liters
| in the 12-liter jug. Empty the 6-liter jug
| completely. Pour the remaining 10 liters from the
| 12-liter jug into the 6-liter jug. Now, the
| 6-liter jug contains exactly 6 liters of water.
| rfurmani wrote:
| As for Rs in strawberry, trying a bunch of models side by side
| only Sky T-1, Gemini 2 Flash got it wrong!
| https://sugaku.net/qna/792ac8cc-9a41-4adc-a98f-c5b2e8d89f9b/
|
| Simple questions like 1+1 can also be fun since R1 goes
| overboard (as do some other models when you include a system
| prompt asking it to think)
| https://sugaku.net/qna/a1b970c0-de9f-4e62-9e03-f62c5280a311/
|
| And if that fails you can ask for the zeros of the z function!
| https://sugaku.net/qna/c64d6db9-5547-4213-acb2-53d10ed95227/
| freehorse wrote:
| With q8 I got "s-t-r-a-w-f-u-r-e-r" and then "strawfurber" in
| the 5th attempt. All other attempts it did not misspell and
| found 3 (this one it found 3 also by counting these imaginary
| words).
| mluo wrote:
| Hi, one of the lead authors for this work.
|
| We recommend using Bfloat16 (not fp16), quantization for small
| models can really hurt performance!
| CamperBob2 wrote:
| Have you compared it to the 1.58 bit dynamic quant model
| based on the original R1 (i.e., not a distillation)? Whatever
| unsloth did, it doesn't seem to be giving up much reasoning
| performance over the full Q8 version.
| mluo wrote:
| It's simply bc the model is small (1.5B), making it
| sensitive to weight perturbations
| simonw wrote:
| Is there a GGUF version of your model anywhere that you
| recommend? I'm on a Mac.
| mluo wrote:
| Think there are some people who made GGUFs as branches of
| our model, try it out!
|
| https://huggingface.co/models?other=base_model:quantized:ag
| e...
| buyucu wrote:
| 'Count the letter Rs in the word strawberry' is probably in all
| training sets by now.
| swiftcoder wrote:
| And yet many models still stumble with it
| astrange wrote:
| 1. Ask it nonsense variations of riddles, like "Why is 7 afraid
| of 8?".
|
| 2. Ask "Play Tic Tac Toe against yourself and win." and check
| if the moves are correct.
| coder543 wrote:
| "How many stops faster is f/2.8 than f/4.5?"
|
| This photography question can be solved with the right
| equations. A lot of non-reasoning LLMs would spout some
| nonsense like 0.67 stops faster. Sometimes they'll leave a
| stray negative sign in too!
|
| The answer should be approximately 1.37, although "1 and 1/3"
| is acceptable too.
|
| LLMs usually don't have trouble coming up with the formulas, so
| it's not a particularly obscure question, just one that won't
| have a memorized answer, since there are _very_ few f /4.5
| lenses on the market, and even fewer people asking this exact
| question online. _Applying_ those formulas is harder, but the
| LLM should be able to sanity check the result and catch common
| errors. (f /2.8 -> f/4 is one full stop, which is common
| knowledge among photographers, so getting a result of less than
| one is obviously an error.)
|
| This also avoids being a test that just emphasizes tokenizer
| problems... I find the strawberry test to be dreadfully boring.
| It's not a useful test. No one is actually using LLMs to count
| letters in words, and until we have LLMs that can actually see
| the letters of each word... it's just not a good test, in my
| opinion. I'm convinced that the big AI labs see it as a meme at
| this point, which is the only reason they keep bringing it up.
| They must find the public obsession with it hilarious.
|
| I was impressed at how consistently well Phi-4 did at my
| photography math question, especially for a non-reasoning
| model. Phi-4 scored highly on math benchmarks, and it shows.
| fsndz wrote:
| Frankly, it is now clear that open source AI will win at this
| rate. I just see innovations after innovations on a weekly
| basis. Exciting times ahead.
| https://open.substack.com/pub/transitions/p/the-ai-enemy-fro...
| gpjanik wrote:
| This is finetuned to the benchmarks and nowhere close to
| O1-Preview in any other tasks. Not worth looking into unless you
| specifically want to solve these problems - however, still
| impressive.
| mluo wrote:
| We beat O1-preview and even many other 7B models over many math
| benchmarks, which was TEST set (not in training set at all).
|
| If you want to make the model fully generalist, feel free to
| train it over coding datasets (such as RL with passing unit
| tests as reward).
| zamadatix wrote:
| It's already good accomplishment as it is but I think it'd be
| very surprising to show training such a small model as a
| generalist scales to the same magnitude as specialized
| finetuning. At some point you have to fit more background
| data and relations in the same amount of information space...
| but it's hard to say how much that is the case for a given
| size vs what we just haven't optimized yet. Unfortunately I
| think that will have to wait for someone with more compute
| before we can verify this * a dozen one way or the other :).
|
| Side question, since it sounds like you were involved: how
| big is the impact on benchmarks of taking this 1.5B model
| down from fp32 to fp8 or similar? The focus on parameters
| alone sometimes feels like comparing house sizes by their
| lengths alone. And, if you were indeed involved, thanks for
| making all of this open and available!
| mluo wrote:
| For quantization, very big impact for small models, can
| drop at much as 10% on AIME. Our model does best on
| bfloat16 ;)
|
| Come checkout our repo at: https://github.com/agentica-
| project/deepscaler
| rvnx wrote:
| It is great discovery, it could even open a next step in AI
| with MoM "Mixture of Models", where small fine-tuned models
| take each part of a task (instead of the current MoE)
| mluo wrote:
| Check out one of my prior work: https://stylus-
| diffusion.github.io/
|
| This work scales up selection/routing over many
| models/LoRAs
| rvnx wrote:
| Love it, will check, thank you for showing / sharing all
| of that!
| janalsncm wrote:
| I disagree. They demonstrated a way to dramatically reduce
| training costs, 18x cheaper than R1. That alone is worth
| attention.
|
| Also beating O1 on any benchmark is nontrivial.
| whatever1 wrote:
| Can CoT models also call external functions? What if they had
| access to a calculator?
| jascha_eng wrote:
| Or a code interpreter in the chain of thought. I think then it
| really gets quite close to how humans go about solving
| problems.
| TeMPOraL wrote:
| What about both? Or say a set of standard tools a modern
| intelligent agent[0] should have some proficiency in. A
| calculator, a basic code interpreter for a single high-level
| language, a graphing tool[1], web search, database search.
| And then maybe a tool for managing its own context[2]. How
| far could we get with a dataset designed specifically to
| train the model in pure tool use? That is, one that assumes
| the model never actually knows the answer to a question (even
| if the base model does), and instead trains it to
| _aggressively use tools to break the problem down_ into
| steps[3] - steps that are primarily more tool calls, to query
| external sources, process information, simulate, etc. until
| the answer is computed. No direct answers, just tool calls
| glued by thinking in terms of tool calls, or thinking _by_
| tool calls.
|
| I wonder if this has been tried. It probably has, seeing how
| hot this area of research is today. If anyone knows of a
| paper or a dataset, I'd appreciate a link.
|
| Anyway, I wonder what would happen if we tried it with this
| method - basically retraining the model to trust its own
| toolbox - or as some would say, "shut up and multiply" - and
| do it across _all_ tasks, not strictly math or coding ones.
|
| --
|
| [0] - Digital or otherwise.
|
| [1] - Or the one tool that does all three, and which most
| people older than ~25 y.o. likely used at least once in their
| lives: _Microsoft Excel_. Or any other spreadsheet app.
| Though for LLMs as they are now, I suppose code interpreter
| would be a better unifying paradigm due to being 1D instead
| of 2D.
|
| [2] - E.g. changeNotesAndRethink("text", 0, 1) -> replace
| current output with "text", continue generation;
| changeNotesAndRethink("text", -1, 2) -> replace fixed
| "assistant notes prompt" with "text" and discard last two
| outputs[4] and continue, etc. Honestly, I'm surprised I
| haven't seen it done so far - not in the popular places I
| know, at least (vendor apps, TypingMind, ComfyUI); I've heard
| of some attempts long ago (back when LangChain was still seen
| as hot). Did giving the model control over the chat loop
| never pan out? Or is there some fundamental reason this
| doesn't work?
|
| [3] - I may have accidentally done this in-context with
| Claude 3.5 Sonnet - if I prompt it for chain-of-thought and
| happen to have Mermaid Diagram plugin enabled in TypingMind,
| it almost always ends up producing multiple diagrams as part
| of the CoT phase. Notably, this doesn't happen with my own
| equivalent plugin (PlantUML), so I wonder if it's just
| something about that specific tool, or if "thinking with
| (Mermaid) diagrams" was part of the training set.
|
| EDIT:
|
| [4] - APIs for tool-using models seem to allow several LLM
| outputs in a row. But that makes me think (and I apologize
| for this post being almost all footnotes, but ideas just keep
| coming) - what about rewinding back past one or more user
| messages in a multi-turn conversation, while retaining them?
| Like "Fill in the Middle" mode[5], just over entire
| conversation instead of a single message?
|
| [5] - OpenAI used to have that, right now I think only
| DeepSeek does - https://api-docs.deepseek.com/api/create-
| completion.
| manmal wrote:
| I guess one could just watch the token output stream and
| replace solvable expressions with calculated results,
| continuing with the replaced result. Kind of like the LLM
| producing a template, to be filled on the fly.
|
| Some models now do something like this for enforcing longer
| reasoning processes - they replace </think> with ,,Wait".
| Svoka wrote:
| When I was in University this was called overfitting to be
| honest. This doesn't seem to perform well outside of eval sets.
| thorum wrote:
| The key thing here is a simple, reliable formula to train a 1B
| model on a specific task and get strong performance. That didn't
| really exist before. Edge devices are about to get a lot smarter.
| manmal wrote:
| I hope Apple takes note and improves their on-device models
| accordingly.
| Aperocky wrote:
| Still took $4500, so I'll probably not training this at home -
| but maybe that number will change too! Exciting times.
| m3kw9 wrote:
| It's pretty stupid actually, I asked it to decode an ascii
| sequence and it gave me jiberish. I tried phi-4 Q4 and it got it.
| 9gb vs 2gb(reasoning). I imagine you just cannot fit enough
| information in 2gb to be of much use other than solving general
| math problems or knowing what training data was in it.
|
| It gave it a good go 2min on a m2 vs phi4 which was instant.
| layer8 wrote:
| I'd be happy with a model not knowing ASCII if only it could
| properly reason instead.
| czk wrote:
| im very naive here but does anyone trust these benchmarks? do
| they mean anything to you? they seem far too easy to game and it
| doesn't feel like its an accurate way to really tell how these
| models compare to one another. seems like benchmark performance
| declines quite a bit if you introduce a problem that's similar to
| those in benchmarks but one that the model hasn't seen before
| javaunsafe2019 wrote:
| Someone could read this as: as soon as I don't like the results
| I doubt the test...
| czk wrote:
| thats far from the case here i really have no skin in the
| game...
| thorum wrote:
| AIME has significant problems:
| https://x.com/DimitrisPapail/status/1888325914603516214
|
| > Problems near identical to the test set can be found online.
| atemerev wrote:
| Yes, I've been thinking recently that the path to AGI is not
| brute force, and never has been:
| https://www.linkedin.com/pulse/why-llms-closer-agi-than-we-t...
| amazingamazing wrote:
| Smaller reinforced models are what will win. Look at our
| civilization, your company, heck even your team: lots of
| specialized folks, not some singular bloated mastermind.
|
| The issue though is the overemphasis on the current benchmarks.
| Ideally the thing benchmarked is against user kpis.
| ericd wrote:
| I'm not sure about that, I think the current setup is an
| artifact of human limits, which an immortal shared ML model
| doesn't share. Also, cross-disciplinary knowledge is frequently
| incredibly helpful (if rare, because currently it's the
| intersection of multiple uncommon specialties).
| noman-land wrote:
| The question that's screaming at me right now is that if
| humans used specialization + cooperation to achieve global
| dominance, what happens when you pair unlimited global
| masterminds + cooperation?
| jazzyjackson wrote:
| Can't wait for the parliament of competing super
| intelligent political parties producing propaganda to
| convince cloud investors to reallocate compute.
| 6510 wrote:
| We turn into cats.
| smnrg wrote:
| I like optimists.
| UltraSane wrote:
| Best case scenario is you get The Culture where society is
| ruled by hyper-intelligent but kind and benevolent AIs that
| solve all economic problems and that keep a population of
| quadrillions of very happy, healthy, and rich humans as
| pets.
| nomel wrote:
| > lots of specialized folks
|
| This isn't what I see in R&D. The _broad_ and _deep_ folks are
| the ones with all the patents, and design wins, because they
| 're the only ones that can see outside the box that everyone
| else is specialized in.
| jazzyjackson wrote:
| Nah, it takes a mastermind to be creative. The tiny models IME
| are good at following explicit instructions for data
| transformations (conforming to a JSON schema is especially
| helpful) but I wouldn't expect one to solve a problem I haven't
| already solved myself like how I use the .6T full size models.
| EE84M3i wrote:
| I thought conforming to a JSON schema was a solved problem
| for even the most garbage models by restricting the tokens it
| can generate?
___________________________________________________________________
(page generated 2025-02-11 23:00 UTC)