[HN Gopher] Skywork-OR1: new SOTA 32B thinking model with open w...
___________________________________________________________________
Skywork-OR1: new SOTA 32B thinking model with open weight
Author : naomiclarkson
Score : 134 points
Date : 2025-04-13 14:42 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| naomiclarkson wrote:
| github repo: https://github.com/SkyworkAI/Skywork-OR1
|
| blog: https://capricious-hydrogen-41c.notion.site/Skywork-Open-
| Rea...
|
| huggingface: https://huggingface.co/collections/Skywork/skywork-
| or1-67fa1...
| scribu wrote:
| From their Notion page:
|
| > Skywork-OR1-32B-Preview delivers the 671B-parameter Deepseek-R1
| performance on math tasks (AIME24 and AIME25) and coding tasks
| (LiveCodeBench).
|
| Impressive, if true: much better performance than the vanilla
| distills of R1.
|
| Plus it's a fully open-source release (including data selection
| and training code).
| byefruit wrote:
| > Both of our models are trained on top of DeepSeek-R1-Distill-
| Qwen-7B and DeepSeek-R1-Distill-Qwen-32B.
|
| Not to take away from their work but this shouldn't be buried at
| the bottom of the page - there's a gulf between completely new
| models and fine-tuning.
| adamkochanowicz wrote:
| Also, am I reading that right? They trained it not only on
| another model, not only one that is already distilled on
| another model, but one that is much lower in parameters (7B)?
| rahimnathwani wrote:
| They took the best available models for the architecture they
| chose (in two sizes), and fine tuned those models with
| additional training data. They don't say where they got that
| training data, or what combo of SFT and/or RLHF they used.
| It's likely that the training data was generated by larger
| models.
| israrkhan wrote:
| Agreed. Also their name make it seem like it is totally new
| model.
|
| If they needed to assign their own name to it, at least they
| could have included the parent (and grant parent) model names
| in the name.
|
| Just like the name DeepSeek-R1-Distill-Qwen-7B clearly says
| that it is a distilled Qwen model.
| qeternity wrote:
| DeepSeek probably would have done this anyway, but they did
| release a Llama 8B distillation and the Meta terms of use
| require any derivative works to have Llama in the name. So it
| also might have just made sense to do for all of them.
|
| Otoh, there aren't many frontier labs that have actually done
| finetunes.
| diggan wrote:
| > the Meta terms of use require any derivative works to
| have Llama in the name
|
| Technically it requires the derivatives to _begin_ with
| "llama". So "DeepSeek-R1-Distill-Llama-8B" isn't OK by the
| license, while "Llama-3_1-Nemotron-Ultra-253B-v1" would be
| OK.
|
| > [...] If you use the Llama Materials or any outputs or
| results of the Llama Materials to create, train, fine tune,
| or otherwise improve an AI model, which is distributed or
| made available, you shall also include "Llama" at the
| beginning of any such AI model name.
|
| I've previously written a summary that includes all parts
| of the license that I think others are likely to have
| missed: https://notes.victor.earth/youre-probably-breaking-
| the-llama...
| GodelNumbering wrote:
| This happens a lot on r/localLlama since a few months. Big
| headline claims followed by "oh yeah it's a finetune"
| lumost wrote:
| I suspect that we'll see a lot of variations on this, with the
| open models catching up to SOTA - and the foundation models
| being relatively static - there will be many new SOTA's built
| off of existing foundation models.
|
| How many of the latest databases are postgres forks?
| rubymamis wrote:
| I tend to prefer running locally non-thinking models since they
| output the result significantly faster.
| nico wrote:
| Any specific model recommendations for running locally?
|
| Also, what tasks are you using them for?
| genewitch wrote:
| Phi 4. Its fast and reasonable enough, but with local models
| you have to know what you want to do. If you want a chat bot
| you use something with Hermes tunes, if you want code you
| want a coder - a lot of people like the deepseek distill qwen
| instruct for coding.
|
| There's no equivalent to "does everything kinda well" like
| chatgpt or Gemini on local, except maybe the 70B and larger,
| but those are slow without datacenter cards with enough RAM
| to hold them.
|
| I just asked your very question a day or two ago because I
| put _back_ together a machine with a 3060 12GB and wondered
| what sota was on that amount of RAM.
|
| If you use lmstudio it will auto pick which of the quantized
| models to get, but you can pick a larger model quant if you
| want. You pick a model and a parameter size and it will
| choose the "best" quantization for your hardware. Generally.
| nico wrote:
| Thank you for the insightful reply
|
| > There's no equivalent to "does everything kinda well"
| like chatgpt or Gemini on local, except maybe the 70B and
| larger, but those are slow
|
| Is there something like a "prompt router", that can
| automatically decide what model to use based on the type of
| prompt/task?
| tough wrote:
| there's RouteLLM: https://github.com/lm-sys/RouteLLM
|
| nvidia has LLMRouter https://build.nvidia.com/nvidia/llm-
| router
|
| llama-index also supports routing https://docs.llamaindex
| .ai/en/stable/examples/low_level/rout...
|
| semantic router seems interesting
| https://github.com/aurelio-labs/semantic-router/
|
| you could also just use langchain to route https://jimmy-
| wang-gen-ai.medium.com/llm-router-in-langchain...
|
| interesting paper PickLLM: Context-Aware RL-Assisted
| Large Language Model Routing
|
| https://arxiv.org/abs/2412.12170
| nyrikki wrote:
| I have a machine with 3 _108 TI 's that I do a batch
| with, sending the question first to a LLM and then an
| LRM, returning to review the faster results and killing
| the job if they are acceptable. Ollama or just llama.cpp
| on podman makes this trivial.
|
| But _knowing* what model will be better will be
| impossible, only broad heuristics that may or may not be
| correct for any individual prompt could be used.
|
| While there are better options if you were buying them
| today, an old out of date system with out of date GPUs
| works well in this batch model.
|
| gemma-3-27b-it-Q6_K_L works fine with these, and that
| mixed with an additional submit to DeepSeek-R1-Distill-
| Qwen-32B is absolutely fine on that system that would
| just be shut down otherwise.
|
| I have a very bright line about inter-customer leakage
| risk prevention that may be irrational but with that
| mixture I find that I am better looking at scholarly
| papers than trying the commercial models.
|
| My primary task is FP64 throughput limited, and thus I am
| stuck on Titan V as it is ~6 times faster than the 4090
| and 5 times faster than the 5090 is the only reason I
| don't have newer GPUS.
|
| You can add 4 _1080ti at 200w limit with common PSU 's
| and get the memory, but performance is limited by the pci
| bus at 3_1080ti.
|
| As they seem to sell for the same price, I would probably
| buy the Titan V today, but the point being is that if you
| are fine with the even smaller models, you can run them
| queries in parallel or even cross verify, which
| dramatically helps with planning tasks even with the
| foundational models.
|
| But series/parallel runs do a lot, and if you are using
| them for code, running a linter etc... on the structured
| output saves a lot of time evaluating the multiple
| response.
|
| No connection to them at all, but bartowski on hugging
| face puts a massive amount of time and effort into re-
| quantizing models.
|
| If you don't a restriction like my FP64 need, you can get
| 70b models running on two 24Gb gpus without much 'cost'
| to accuracy.
|
| That would be preferable to a _router_ IMHO.
| genewitch wrote:
| > My primary task is FP64 throughput limited, and thus I
| am stuck on Titan V as it is ~6 times faster than the
| 4090 and 5 times faster than the 5090 is the only reason
| I don't have newer GPUS.
|
| interesting. Very interesting. Why fp64 as opposed to
| BF16? different sort of model? i don't even know where to
| find fp64 models (not that i've looked).
|
| also Bartowski may be on huggingface but they're also
| part of the LM Studio group, and frequently chat on that
| discord. actually, at least 3 of the main model converter
| / quant people are on that discord.
|
| I haven't got two 24GB cards, yet, but maybe soon, with
| the way people are hogging the 5000 series.
|
| edit: i realize that they're increasing the marketing
| FLOPS by halving the resolution, the current gen stuff is
| all "fast" at FP16 (or BF16 - brainfloat 16 bit). So when
| nvidia finishes and releases a card with double the FLOPS
| at 8 bit, will that card be 8 times slower at fp64?
| nyrikki wrote:
| My primary task isn't ML, and 64bit is needed for
| numerical stability.
|
| For the Titan V, the F64 was 1/2 of F32, it was the only
| and last _consumer_ generation to have that.
|
| For Titan RTX and newer NVIDIA cards, the ratio from F32
| to F64 is typically 1/64th of FP32.
|
| So the Titan RTX, with 16 FP32 TFlop/s drops to 0.5 FP64
| TFlops/s
|
| While the Titan V, starting at 15 FP32 TFlops/s still has
| 7.5 FP64 TFlops/s
|
| The 5090 TI has 104.9 FP16/32 TFlops/s, but only 1.64
| FP64 TFlops/s.
|
| Basically Nvidia decided most people didn't need FP64,
| and chose to improve quantized performance instead.
|
| If you can run on a GPU, that Titan V has more 64bit
| Flops than even an AMD Ryzen Threadripper PRO 7995WX.
| rubymamis wrote:
| I mostly like to evaluate them whenever I ask a remote model
| (Calude 3.7, ChatGPT 4.5), to see how far they have
| progressed. From my tests qwen 2.5 coder 32b is still the
| best local model for coding tasks. I've also tried Phi 4,
| nemotron, mistral-small, and qwq 32b. I'm using a MacBook Pro
| M4 46GB RAM.
| y2236li wrote:
| Interesting - focusing on the 671B parameter model feels like a
| significant step. It's a compelling contrast to the previous
| models and sets a strong benchmark. It's great that they're
| embracing open weights and data too - that's a crucial aspect for
| innovation.
| CharlesW wrote:
| > _It's great that they're embracing open [...] data too..._
|
| It could be, but as I type this it's currently vaporware:
| https://huggingface.co/datasets/Skywork/Skywork-OR1-RL-Data
| chvid wrote:
| How is the score on AIME2024 relevant if AIME2024 has been used
| to train the model?
| nyrikki wrote:
| That is pretty much a universal problem. If you look at the
| problems anyone's models has solved, they are all well
| represented in the corpus.
|
| Remember that AIME is intended for high schoolers with just
| pencils, erasers, rulers, and compasses to solve in 3 hours.
| There is an entire industry providing supplementary material to
| prepare students for concepts are not directly covered in
| typical school material.
|
| As various blogs and tests often pull from previous years make
| it into all the common sources like stackoverlow/exchange,
| reddit etc.., them explicitly stating to have trained on _AIME
| problems prior to 2024_ explicitly isn 't much different.
|
| Basically expect any model to train on all AIME problems
| available before their knowledge cutoff date.
|
| To me, "How is the score on AIME2024 relevant" is because it is
| still not that high (from a practical consideration) despite
| directly training on it.
|
| Mixed in with all the models success falling dramatically with
| AIME2025 demonstrates the above, and hints that Rao's claim
| that _compiling_ in the verifier in training /scratch-
| space/prompt/fine-tuning etc... in a way the model can reliably
| access is what matters.
| ipsum2 wrote:
| Google Gemini (2.5 pro) made the same "mistake", their data cut
| off is January 2025, and AIME 2024 is in Feburary 2024..
| qwertox wrote:
| I know one can rent consumer GPUs on the internet, where people
| like you and me offer their free GPU time to people who need it
| for a price. They basically get a GPU-enabled VM on your machine.
|
| But is there something like a distributed network akin to
| SETI@home and the likes which is free for training models? Where
| a consensus is made on which model is trained and that any
| derivative works must be open source, including all the tooling
| and hosting platform? Would this even be possible to do, given
| that the latency between nodes is very high and the bandwidth
| limited?
| qeternity wrote:
| > Would this even be possible to do, given that the latency
| between nodes is very high and the bandwidth limited?
|
| Yes, it's possible. But no, it would not be remotely sensible
| given the performance implications. There is a reason why
| Nvidia is a multi trillion dollar company, and it's as much
| about networking as it is about GPUs.
| kmeisthax wrote:
| Back in the early days of AI art, before AI became way too
| cringe to think about, I wondered about this exact thing[0].
| The problem I learned later is that most AI training (and
| inference) is not dependent so much on the GPU _compute_ , but
| on memory bandwidth and communication. A huge chunk of AI
| training is just figuring out how to minimize or hide the
| bottleneck the inter-GPU interconnect imposes so you can scale
| to multiple cards.
|
| The BOINC model of distributed computing is to separate
| everything into little work units that can be sent out to
| multiple machines who then return a result that can be
| integrated back into the whole. If you were to train foundation
| models this way, you'd be packaging up the current model state
| n and a certain amount of trainset items into a work unit, and
| the result would be model weight offsets to be added back into
| model state n+1. But you wouldn't be able to benefit from any
| of the gradients calculated by other users until they submitted
| their work units and n+1 got calculated. So there'd be a lot of
| redundant work and training progress would slow down, versus a
| closely-coupled set of GPUs where they have enough bandwidth to
| exchange gradients every batch.
|
| For the record, I never actually built a distributed training
| cluster. But when I learned what AI actually wants to go fast,
| I realized distributed training probably couldn't work over
| just renting big GPUs.
|
| Most people do not have GPUs with enough RAM to do meaningful
| AI work. Generative AI models work autoregressively: that is,
| all of their weights are repeatedly used in a tight loop. In
| order for a GPU to provide a meaningful speedup it needs to
| have the whole model in GPU memory, because PCIe is slow (high
| latency) and also slow (low bandwidth). Nvidia knows this and
| that's why they are very stingy on GPU VRAM. Furthermore,
| training a model takes more memory than merely running it; I
| _believe_ gradients are something like the number of weights
| times your batch size in terms of memory usage. There 's two
| ways I could see around this, both of which are going to cause
| further problems:
|
| - You could make 'mini' workunits where certain specific layers
| of the model are frozen and do not generate gradients. So you'd
| only train, say, 10% of the model at any one time. This is how
| you train very large models in centralized computing; you put a
| slice of the model on each GPU and exchange activations and
| gradients each batch. But we're on a distributed computer, so
| we don't have that kind of tight coupling, and we converge
| slower or not at all if we do this.
|
| - You can change the model architecture to load specific chunks
| of weights at each layer, with another neural network to decide
| what chunks to load for each token. This is known as a "Mixture
| of Experts" model and it's the most efficient way we know of to
| stream weights in and out of a GPU, but training has to be
| aware of it and you can't change the size of the chunks to fit
| the current GPU. MoE lets a model have access to a lot of
| weights, but the scaling is worse. e.g. an 8x44B parameter MoE
| model is NOT equivalent to a 352B non-MoE model. It also causes
| problems with training that you have to solve for: very common
| bits of knowledge will be replicated across chunks, and certain
| chunks can become favored by the model because they're getting
| more gradients, which causes them to be favored more, so they
| get more gradients.
|
| [0] My specific goal was to train a txt2img model purely on
| public domain Wikimedia Commons data, which failed for
| different reasons having to do with the fact that most of AI is
| just dataset sorting.
___________________________________________________________________
(page generated 2025-04-13 23:00 UTC)