[HN Gopher] Skywork-OR1: new SOTA 32B thinking model with open w...
       ___________________________________________________________________
        
       Skywork-OR1: new SOTA 32B thinking model with open weight
        
       Author : naomiclarkson
       Score  : 134 points
       Date   : 2025-04-13 14:42 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | naomiclarkson wrote:
       | github repo: https://github.com/SkyworkAI/Skywork-OR1
       | 
       | blog: https://capricious-hydrogen-41c.notion.site/Skywork-Open-
       | Rea...
       | 
       | huggingface: https://huggingface.co/collections/Skywork/skywork-
       | or1-67fa1...
        
       | scribu wrote:
       | From their Notion page:
       | 
       | > Skywork-OR1-32B-Preview delivers the 671B-parameter Deepseek-R1
       | performance on math tasks (AIME24 and AIME25) and coding tasks
       | (LiveCodeBench).
       | 
       | Impressive, if true: much better performance than the vanilla
       | distills of R1.
       | 
       | Plus it's a fully open-source release (including data selection
       | and training code).
        
       | byefruit wrote:
       | > Both of our models are trained on top of DeepSeek-R1-Distill-
       | Qwen-7B and DeepSeek-R1-Distill-Qwen-32B.
       | 
       | Not to take away from their work but this shouldn't be buried at
       | the bottom of the page - there's a gulf between completely new
       | models and fine-tuning.
        
         | adamkochanowicz wrote:
         | Also, am I reading that right? They trained it not only on
         | another model, not only one that is already distilled on
         | another model, but one that is much lower in parameters (7B)?
        
           | rahimnathwani wrote:
           | They took the best available models for the architecture they
           | chose (in two sizes), and fine tuned those models with
           | additional training data. They don't say where they got that
           | training data, or what combo of SFT and/or RLHF they used.
           | It's likely that the training data was generated by larger
           | models.
        
         | israrkhan wrote:
         | Agreed. Also their name make it seem like it is totally new
         | model.
         | 
         | If they needed to assign their own name to it, at least they
         | could have included the parent (and grant parent) model names
         | in the name.
         | 
         | Just like the name DeepSeek-R1-Distill-Qwen-7B clearly says
         | that it is a distilled Qwen model.
        
           | qeternity wrote:
           | DeepSeek probably would have done this anyway, but they did
           | release a Llama 8B distillation and the Meta terms of use
           | require any derivative works to have Llama in the name. So it
           | also might have just made sense to do for all of them.
           | 
           | Otoh, there aren't many frontier labs that have actually done
           | finetunes.
        
             | diggan wrote:
             | > the Meta terms of use require any derivative works to
             | have Llama in the name
             | 
             | Technically it requires the derivatives to _begin_ with
             | "llama". So "DeepSeek-R1-Distill-Llama-8B" isn't OK by the
             | license, while "Llama-3_1-Nemotron-Ultra-253B-v1" would be
             | OK.
             | 
             | > [...] If you use the Llama Materials or any outputs or
             | results of the Llama Materials to create, train, fine tune,
             | or otherwise improve an AI model, which is distributed or
             | made available, you shall also include "Llama" at the
             | beginning of any such AI model name.
             | 
             | I've previously written a summary that includes all parts
             | of the license that I think others are likely to have
             | missed: https://notes.victor.earth/youre-probably-breaking-
             | the-llama...
        
         | GodelNumbering wrote:
         | This happens a lot on r/localLlama since a few months. Big
         | headline claims followed by "oh yeah it's a finetune"
        
         | lumost wrote:
         | I suspect that we'll see a lot of variations on this, with the
         | open models catching up to SOTA - and the foundation models
         | being relatively static - there will be many new SOTA's built
         | off of existing foundation models.
         | 
         | How many of the latest databases are postgres forks?
        
       | rubymamis wrote:
       | I tend to prefer running locally non-thinking models since they
       | output the result significantly faster.
        
         | nico wrote:
         | Any specific model recommendations for running locally?
         | 
         | Also, what tasks are you using them for?
        
           | genewitch wrote:
           | Phi 4. Its fast and reasonable enough, but with local models
           | you have to know what you want to do. If you want a chat bot
           | you use something with Hermes tunes, if you want code you
           | want a coder - a lot of people like the deepseek distill qwen
           | instruct for coding.
           | 
           | There's no equivalent to "does everything kinda well" like
           | chatgpt or Gemini on local, except maybe the 70B and larger,
           | but those are slow without datacenter cards with enough RAM
           | to hold them.
           | 
           | I just asked your very question a day or two ago because I
           | put _back_ together a machine with a 3060 12GB and wondered
           | what sota was on that amount of RAM.
           | 
           | If you use lmstudio it will auto pick which of the quantized
           | models to get, but you can pick a larger model quant if you
           | want. You pick a model and a parameter size and it will
           | choose the "best" quantization for your hardware. Generally.
        
             | nico wrote:
             | Thank you for the insightful reply
             | 
             | > There's no equivalent to "does everything kinda well"
             | like chatgpt or Gemini on local, except maybe the 70B and
             | larger, but those are slow
             | 
             | Is there something like a "prompt router", that can
             | automatically decide what model to use based on the type of
             | prompt/task?
        
               | tough wrote:
               | there's RouteLLM: https://github.com/lm-sys/RouteLLM
               | 
               | nvidia has LLMRouter https://build.nvidia.com/nvidia/llm-
               | router
               | 
               | llama-index also supports routing https://docs.llamaindex
               | .ai/en/stable/examples/low_level/rout...
               | 
               | semantic router seems interesting
               | https://github.com/aurelio-labs/semantic-router/
               | 
               | you could also just use langchain to route https://jimmy-
               | wang-gen-ai.medium.com/llm-router-in-langchain...
               | 
               | interesting paper PickLLM: Context-Aware RL-Assisted
               | Large Language Model Routing
               | 
               | https://arxiv.org/abs/2412.12170
        
               | nyrikki wrote:
               | I have a machine with 3 _108 TI 's that I do a batch
               | with, sending the question first to a LLM and then an
               | LRM, returning to review the faster results and killing
               | the job if they are acceptable. Ollama or just llama.cpp
               | on podman makes this trivial.
               | 
               | But _knowing* what model will be better will be
               | impossible, only broad heuristics that may or may not be
               | correct for any individual prompt could be used.
               | 
               | While there are better options if you were buying them
               | today, an old out of date system with out of date GPUs
               | works well in this batch model.
               | 
               | gemma-3-27b-it-Q6_K_L works fine with these, and that
               | mixed with an additional submit to DeepSeek-R1-Distill-
               | Qwen-32B is absolutely fine on that system that would
               | just be shut down otherwise.
               | 
               | I have a very bright line about inter-customer leakage
               | risk prevention that may be irrational but with that
               | mixture I find that I am better looking at scholarly
               | papers than trying the commercial models.
               | 
               | My primary task is FP64 throughput limited, and thus I am
               | stuck on Titan V as it is ~6 times faster than the 4090
               | and 5 times faster than the 5090 is the only reason I
               | don't have newer GPUS.
               | 
               | You can add 4 _1080ti at 200w limit with common PSU 's
               | and get the memory, but performance is limited by the pci
               | bus at 3_1080ti.
               | 
               | As they seem to sell for the same price, I would probably
               | buy the Titan V today, but the point being is that if you
               | are fine with the even smaller models, you can run them
               | queries in parallel or even cross verify, which
               | dramatically helps with planning tasks even with the
               | foundational models.
               | 
               | But series/parallel runs do a lot, and if you are using
               | them for code, running a linter etc... on the structured
               | output saves a lot of time evaluating the multiple
               | response.
               | 
               | No connection to them at all, but bartowski on hugging
               | face puts a massive amount of time and effort into re-
               | quantizing models.
               | 
               | If you don't a restriction like my FP64 need, you can get
               | 70b models running on two 24Gb gpus without much 'cost'
               | to accuracy.
               | 
               | That would be preferable to a _router_ IMHO.
        
               | genewitch wrote:
               | > My primary task is FP64 throughput limited, and thus I
               | am stuck on Titan V as it is ~6 times faster than the
               | 4090 and 5 times faster than the 5090 is the only reason
               | I don't have newer GPUS.
               | 
               | interesting. Very interesting. Why fp64 as opposed to
               | BF16? different sort of model? i don't even know where to
               | find fp64 models (not that i've looked).
               | 
               | also Bartowski may be on huggingface but they're also
               | part of the LM Studio group, and frequently chat on that
               | discord. actually, at least 3 of the main model converter
               | / quant people are on that discord.
               | 
               | I haven't got two 24GB cards, yet, but maybe soon, with
               | the way people are hogging the 5000 series.
               | 
               | edit: i realize that they're increasing the marketing
               | FLOPS by halving the resolution, the current gen stuff is
               | all "fast" at FP16 (or BF16 - brainfloat 16 bit). So when
               | nvidia finishes and releases a card with double the FLOPS
               | at 8 bit, will that card be 8 times slower at fp64?
        
               | nyrikki wrote:
               | My primary task isn't ML, and 64bit is needed for
               | numerical stability.
               | 
               | For the Titan V, the F64 was 1/2 of F32, it was the only
               | and last _consumer_ generation to have that.
               | 
               | For Titan RTX and newer NVIDIA cards, the ratio from F32
               | to F64 is typically 1/64th of FP32.
               | 
               | So the Titan RTX, with 16 FP32 TFlop/s drops to 0.5 FP64
               | TFlops/s
               | 
               | While the Titan V, starting at 15 FP32 TFlops/s still has
               | 7.5 FP64 TFlops/s
               | 
               | The 5090 TI has 104.9 FP16/32 TFlops/s, but only 1.64
               | FP64 TFlops/s.
               | 
               | Basically Nvidia decided most people didn't need FP64,
               | and chose to improve quantized performance instead.
               | 
               | If you can run on a GPU, that Titan V has more 64bit
               | Flops than even an AMD Ryzen Threadripper PRO 7995WX.
        
           | rubymamis wrote:
           | I mostly like to evaluate them whenever I ask a remote model
           | (Calude 3.7, ChatGPT 4.5), to see how far they have
           | progressed. From my tests qwen 2.5 coder 32b is still the
           | best local model for coding tasks. I've also tried Phi 4,
           | nemotron, mistral-small, and qwq 32b. I'm using a MacBook Pro
           | M4 46GB RAM.
        
       | y2236li wrote:
       | Interesting - focusing on the 671B parameter model feels like a
       | significant step. It's a compelling contrast to the previous
       | models and sets a strong benchmark. It's great that they're
       | embracing open weights and data too - that's a crucial aspect for
       | innovation.
        
         | CharlesW wrote:
         | > _It's great that they're embracing open [...] data too..._
         | 
         | It could be, but as I type this it's currently vaporware:
         | https://huggingface.co/datasets/Skywork/Skywork-OR1-RL-Data
        
       | chvid wrote:
       | How is the score on AIME2024 relevant if AIME2024 has been used
       | to train the model?
        
         | nyrikki wrote:
         | That is pretty much a universal problem. If you look at the
         | problems anyone's models has solved, they are all well
         | represented in the corpus.
         | 
         | Remember that AIME is intended for high schoolers with just
         | pencils, erasers, rulers, and compasses to solve in 3 hours.
         | There is an entire industry providing supplementary material to
         | prepare students for concepts are not directly covered in
         | typical school material.
         | 
         | As various blogs and tests often pull from previous years make
         | it into all the common sources like stackoverlow/exchange,
         | reddit etc.., them explicitly stating to have trained on _AIME
         | problems prior to 2024_ explicitly isn 't much different.
         | 
         | Basically expect any model to train on all AIME problems
         | available before their knowledge cutoff date.
         | 
         | To me, "How is the score on AIME2024 relevant" is because it is
         | still not that high (from a practical consideration) despite
         | directly training on it.
         | 
         | Mixed in with all the models success falling dramatically with
         | AIME2025 demonstrates the above, and hints that Rao's claim
         | that _compiling_ in the verifier in training /scratch-
         | space/prompt/fine-tuning etc... in a way the model can reliably
         | access is what matters.
        
         | ipsum2 wrote:
         | Google Gemini (2.5 pro) made the same "mistake", their data cut
         | off is January 2025, and AIME 2024 is in Feburary 2024..
        
       | qwertox wrote:
       | I know one can rent consumer GPUs on the internet, where people
       | like you and me offer their free GPU time to people who need it
       | for a price. They basically get a GPU-enabled VM on your machine.
       | 
       | But is there something like a distributed network akin to
       | SETI@home and the likes which is free for training models? Where
       | a consensus is made on which model is trained and that any
       | derivative works must be open source, including all the tooling
       | and hosting platform? Would this even be possible to do, given
       | that the latency between nodes is very high and the bandwidth
       | limited?
        
         | qeternity wrote:
         | > Would this even be possible to do, given that the latency
         | between nodes is very high and the bandwidth limited?
         | 
         | Yes, it's possible. But no, it would not be remotely sensible
         | given the performance implications. There is a reason why
         | Nvidia is a multi trillion dollar company, and it's as much
         | about networking as it is about GPUs.
        
         | kmeisthax wrote:
         | Back in the early days of AI art, before AI became way too
         | cringe to think about, I wondered about this exact thing[0].
         | The problem I learned later is that most AI training (and
         | inference) is not dependent so much on the GPU _compute_ , but
         | on memory bandwidth and communication. A huge chunk of AI
         | training is just figuring out how to minimize or hide the
         | bottleneck the inter-GPU interconnect imposes so you can scale
         | to multiple cards.
         | 
         | The BOINC model of distributed computing is to separate
         | everything into little work units that can be sent out to
         | multiple machines who then return a result that can be
         | integrated back into the whole. If you were to train foundation
         | models this way, you'd be packaging up the current model state
         | n and a certain amount of trainset items into a work unit, and
         | the result would be model weight offsets to be added back into
         | model state n+1. But you wouldn't be able to benefit from any
         | of the gradients calculated by other users until they submitted
         | their work units and n+1 got calculated. So there'd be a lot of
         | redundant work and training progress would slow down, versus a
         | closely-coupled set of GPUs where they have enough bandwidth to
         | exchange gradients every batch.
         | 
         | For the record, I never actually built a distributed training
         | cluster. But when I learned what AI actually wants to go fast,
         | I realized distributed training probably couldn't work over
         | just renting big GPUs.
         | 
         | Most people do not have GPUs with enough RAM to do meaningful
         | AI work. Generative AI models work autoregressively: that is,
         | all of their weights are repeatedly used in a tight loop. In
         | order for a GPU to provide a meaningful speedup it needs to
         | have the whole model in GPU memory, because PCIe is slow (high
         | latency) and also slow (low bandwidth). Nvidia knows this and
         | that's why they are very stingy on GPU VRAM. Furthermore,
         | training a model takes more memory than merely running it; I
         | _believe_ gradients are something like the number of weights
         | times your batch size in terms of memory usage. There 's two
         | ways I could see around this, both of which are going to cause
         | further problems:
         | 
         | - You could make 'mini' workunits where certain specific layers
         | of the model are frozen and do not generate gradients. So you'd
         | only train, say, 10% of the model at any one time. This is how
         | you train very large models in centralized computing; you put a
         | slice of the model on each GPU and exchange activations and
         | gradients each batch. But we're on a distributed computer, so
         | we don't have that kind of tight coupling, and we converge
         | slower or not at all if we do this.
         | 
         | - You can change the model architecture to load specific chunks
         | of weights at each layer, with another neural network to decide
         | what chunks to load for each token. This is known as a "Mixture
         | of Experts" model and it's the most efficient way we know of to
         | stream weights in and out of a GPU, but training has to be
         | aware of it and you can't change the size of the chunks to fit
         | the current GPU. MoE lets a model have access to a lot of
         | weights, but the scaling is worse. e.g. an 8x44B parameter MoE
         | model is NOT equivalent to a 352B non-MoE model. It also causes
         | problems with training that you have to solve for: very common
         | bits of knowledge will be replicated across chunks, and certain
         | chunks can become favored by the model because they're getting
         | more gradients, which causes them to be favored more, so they
         | get more gradients.
         | 
         | [0] My specific goal was to train a txt2img model purely on
         | public domain Wikimedia Commons data, which failed for
         | different reasons having to do with the fact that most of AI is
         | just dataset sorting.
        
       ___________________________________________________________________
       (page generated 2025-04-13 23:00 UTC)