[HN Gopher] DeepSeek-Prover-V2
       ___________________________________________________________________
        
       DeepSeek-Prover-V2
        
       Author : meetpateltech
       Score  : 270 points
       Date   : 2025-04-30 16:23 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | simianwords wrote:
       | related: I imagine in the future we might several "expert" LLM's
       | and a wrapper can delegate tasks as needed as if it were a
       | "tool". That way we can have segregation of expertise - each
       | individual model can excel at one single thing.
       | 
       | A prover model might be used as a tool in the coming future.
        
         | koakuma-chan wrote:
         | > related: I imagine in the future we might several "expert"
         | LLM's and a wrapper can delegate tasks as needed as if it were
         | a "tool". That way we can have segregation of expertise - each
         | individual model can excel at one single thing.
         | 
         | In the future? I'm pretty sure people do that already.
        
           | energy123 wrote:
           | It's crudely done though.
        
           | kratom_sandwich wrote:
           | Mistrals model is a mixture-of-experts model
        
           | simianwords wrote:
           | No I disagree. I would want ChatGPT to abstract away expert
           | models - biochemistry model, coding model, physics model and
           | maybe O3 would use these models as tools to come up with an
           | answer.
           | 
           | The point being that a separate expert model would be better
           | at its own field than a single model that tries to be good at
           | everything. Intuitively it makes sense, in practice I have
           | seen anecdotes where finetuning a small model on domain data
           | makes the model lose coherence on other topics.
        
             | koakuma-chan wrote:
             | > have seen anecdotes where finetuning a small model on
             | domain data makes the model lose coherence on other topics
             | 
             | This is expected behaviour.
        
               | simianwords wrote:
               | i know. so why don't we have domain specific models as
               | tools in consumer llm products
        
         | Arcuru wrote:
         | For a concrete example today, see
         | https://openrouter.ai/openrouter/auto
        
           | simianwords wrote:
           | thats nice but imagine first having models that are expert in
           | specific domains. routing seems to be the easy part (just
           | feed the available models as tools to your wrapper LLM)
        
         | someguy101010 wrote:
         | The No Free Lunch Theorem implies that something like this is
         | inevitable
         | https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...
        
           | repsilat wrote:
           | A system of n experts is no different to a single expert wrt
           | the NFLT. The theorem is entirely indifferent to (ie "equally
           | skeptical of") the idea.
        
         | samvaran wrote:
         | Is that not what MoE models already do?
        
           | oofbaroomf wrote:
           | No. Each expert is not separately trained, and while they may
           | store different concepts, they are not meant to be different
           | experts in specific domains. However, there are certain
           | technologies to route requests to different domain expert
           | LLMs or even fine-tuning adapters, such as RouteLLM.
        
             | retinaros wrote:
             | That might already happen behind what they call test time
             | compute
        
             | woah wrote:
             | Why do you think that a hand-configured selection between
             | "different domains" is better than the training-based
             | approach in MoE?
        
           | AlexCoventry wrote:
           | MoE models route _each token_ , in every transformer layer,
           | to a set of specialized feed-forward networks (fully-
           | connected perceptrons, basically), based on a score derived
           | from the token's current representation.
        
       | smusamashah wrote:
       | That Putnam bench graph (middle one) is showing 49/658 solve
       | rate.
       | 
       | > The resulting model, DeepSeek-Prover-V2-671B, achieves state-
       | of-the-art performance in neural theorem proving, reaching 88.9%
       | pass ratio on the MiniF2F-test and solving 49 out of 658 problems
       | from PutnamBench.
       | 
       | Which is 0.07% (edit: 7%) for PutnamBench
        
         | darkmighty wrote:
         | 49/658 is 7%
        
           | smusamashah wrote:
           | Sorry, forgot multiply by 100
        
             | booi wrote:
             | I bet DeepSeek-Prover-V2 wouldn't have made that mistake
        
             | gallerdude wrote:
             | classic human hallucination
        
         | HappyPanacea wrote:
         | How likely is it that Putnam answers were in DeepSeek's
         | training data?
        
           | EvgeniyZh wrote:
           | The solutions weren't published anywhere. There is also no
           | good automatic way to generate solutions as far as I know,
           | even expensive ones (previous sota was 10 solutions and one
           | before was 8 using pass@3200 for 7b model). Potentially the
           | developers could've paid some people who are good in putnam-
           | level math problems and lean to write solutions for LLMs. It
           | is hard to estimate likelihood of that but it sounds like
           | waste of money given relatively marginal problem/benchmark.
        
             | HappyPanacea wrote:
             | AoPS seems to have a forum dedicated to Putnam (including
             | 2024):
             | https://artofproblemsolving.com/community/c3249_putnam and
             | here is a pdf with solutions to Putnam 2023:
             | https://kskedlaya.org/putnam-archive/2023s.pdf
        
               | EvgeniyZh wrote:
               | These are still need to be formalized in Lean which can
               | be harder than solving the problem sometimes
        
       | islewis wrote:
       | > The cold-start training procedure begins by prompting
       | DeepSeek-V3 to decompose complex problems into a series of
       | subgoals
       | 
       | It feels pretty intuitive to me that the ability for an LLM to
       | break a complex problem down into smaller, more easily solvable
       | pieces will unlock the next level of complexity.
       | 
       | This pattern feels like a technique often taught to junior
       | engineers- how to break up a multi-week project into bitesized
       | tasks. This model is obviously math focused, but I see no reason
       | why this wouldn't be incredibly powerful for code based problem
       | solving.
        
         | cadamsdotcom wrote:
         | And it should be powerful for breaking down reasoning chains of
         | thought too.
        
         | criley2 wrote:
         | Imo current models can already break things up into bite sized
         | pieces. The limiter I've seen is twofold
         | 
         | 1) Maintaining context of the overall project and goals while
         | working in the weeds on a subtask of a task on an epic (so to
         | speak) both in terms of what has been accomplished already and
         | what still needs to be accomplished
         | 
         | and 2) Getting an agentic coding tool which can actually handle
         | the scale of doing 50 small projects back to back. With these
         | agentic tools I find they start projects off really strong but
         | by task #5 they're just making a mess with every change.
         | 
         | I've played with keeping basically a dev-progress.md file and
         | implementation-plan.md file that I keep in context for every
         | request and end each task by updating files. But me manually
         | keeping all this context isn't solving all my problems.
         | 
         | And all the while, tools like Cline are gobbling up 2M tokens
         | to make small changes.
        
           | jhrmnn wrote:
           | > Maintaining context of the overall project and goals while
           | working in the weeds on a subtask of a task on an epic (so to
           | speak) both in terms of what has been accomplished already
           | and what still needs to be accomplished
           | 
           | This is a struggle for every human I've ever worked with
        
             | pertymcpert wrote:
             | Yes. I wonder if the path forward will be to create systems
             | of agents that work as a team, with an "architect" or
             | "technical lead" AI directing the work of more specialized
             | execution AIs. This could alleviate the issue of context
             | pollution as the technical lead doesn't have to hold all of
             | the context when working on a small problem, and vice
             | versa.
             | 
             | Shit. Do we need agile AI now?
        
               | Rudybega wrote:
               | This is kind of what the modes in roo code do now. I'm
               | having great success with them and having them as a
               | default just rolled out a couple days ago.
               | 
               | There are a default set of modes (orchestrator, code,
               | architect, debug, and ask) and you can create your own
               | custom ones (or have roo do it for you, which is kind of
               | a fun meta play).
               | 
               | Orchestrator basically consults the others and uses them
               | when appropriate, feeding in a sensible amount of task
               | definition and context into the sub task. You can use
               | different LLMs for different modes as well (I like Gemini
               | 2.5 Pro for most of the thinking style ones and gpt
               | o4-mini for the coding).
               | 
               | I've done some reasonably complicated things and haven't
               | really had an orchestrator task creep past ~400k tokens
               | before I was finished and able to start a new task.
               | 
               | There are some people out there who do really cool stuff
               | with memory banks (basically logging and progress
               | tracking), but I haven't played a ton with that yet.
               | 
               | Basic overview:
               | https://docs.roocode.com/features/boomerang-tasks
               | 
               | Custom Modes: https://docs.roocode.com/features/custom-
               | modes
        
         | bearjaws wrote:
         | It's actually pretty hilarious how far into detail they can go.
         | 
         | For example, I made a bot that you could give it a problem
         | statement, and then it would return an array of steps to
         | accomplish it.
         | 
         | Then you could take the steps, and click on them to break them
         | down and add them to the list. If you just kept clicking you
         | would get to excruciating detail.
         | 
         | For example taking out the trash can become over ~70 individual
         | steps if you really drill into the details.
         | 
         | Some of the steps:
         | 
         | Stand close to the trash can - Position yourself so you have
         | stable footing and easy access.
         | 
         | Place one hand on the rim of the can - Use your non-dominant
         | hand to press lightly on the edge of the trash can to hold it
         | in place.
         | 
         | Grip the top edge of the bag with your other hand - Find the
         | part of the bag that extends past the rim.
         | 
         | Gently lift the bag upward - While your one hand stabilizes the
         | can, slowly pull the bag up with the other.
         | 
         | Tilt the can slightly if needed - If the bag sticks or creates
         | suction, rock or tilt the can slightly while continuing to
         | lift.
         | 
         | Avoid jerking motions - Move steadily to prevent tears or
         | spills
        
           | eightysixfour wrote:
           | This used to be part of one of the intro to engineering
           | courses at my school - write an XX page document describing
           | how to make a peanut butter and jelly sandwich.
        
           | lugu wrote:
           | This is how I imagine llms are used in robotics, with one or
           | two more levels of description.
        
           | thrance wrote:
           | This feels like a manual for infiltrated aliens: "How to pass
           | as humans, Vol. I"
        
             | roywiggins wrote:
             | or for goblins:
             | 
             | https://goblin.tools/
        
       | jasonjmcghee wrote:
       | Super interesting that they chose 671B and 7B. no like 32B which
       | feels like a "sweet spot"
        
         | bredren wrote:
         | Also notable is the earliest planning for a positive reception
         | release of a new model might include both parameter-based and
         | skill type market segmentation.
         | 
         | --> "In an increasingly crowded field of LLMs, how will our
         | (costly to produce) model stand out?"
        
         | SweetSoftPillow wrote:
         | I feel like this is very logical way to do things. Test
         | hypothesis on small model, play around, get it working, apply
         | findings to big model.
        
       | qoez wrote:
       | The best part about these is that I know the weights are static
       | so I know I won't have to deal with a sassy unusable update for a
       | week suddenly.
        
         | Implicated wrote:
         | Or, like with Claude, it being effectively lobotomized during
         | north american 'business' hours. 3am PST? Cracked. 8am PST? ...
         | mentally challenged.
        
           | navanchauhan wrote:
           | This is pretty interesting. Do you have more information
           | about this?
        
             | devoutsalsa wrote:
             | I'm pretty sure the parent comment is referring to capacity
             | constraints. When the Americans come online in the morning,
             | Claude frequently can't keep up with demand and error
             | messages saying the system is at capacity are common.
        
       | ekez wrote:
       | I wonder if the authors have tried incorporating error feedback
       | from Lean into their models.
       | 
       | Work from 2023 [1] showed general purpose models did better when
       | they were able to incorporate error feedback, humans incorporate
       | error feedback, but none of the SOTA models on minif2f seem to.
       | 
       | [1]: https://arxiv.org/abs/2310.04353
        
       | revskill wrote:
       | The way intelligence works to me, is more about:
       | 
       | - Making correct and smart assumption. Currently all LLM bots are
       | too stupid at making good assumptions. I don't want to explicitly
       | repeat and repeat again my own assumptions while the context is
       | clear enough. Hey bots, try harder.
       | 
       | - LLM bot needs to bring their own secondary and contextual
       | memory in reasoning, i don't want to do it for you, ok ? You're
       | the bot.
       | 
       | - Thinking out of the box. This is the final stage of
       | intelligence. Adapt old technique to make your own technique to
       | solve non-existing before problems.
        
         | nthingtohide wrote:
         | I propose human-AI interaction data must be made public. This
         | is our collective wikipedia of AI era. Otherwise our progress
         | will be blank line after 2022. Just as Egyptians didn't write
         | down process to move giant rocks.
        
       | whatshisface wrote:
       | How much education would a human need to perform at this level on
       | the benchmarks?
        
         | pama wrote:
         | Learning to formalize math to then prove Putnam competition
         | problems rigorously in Lean would require you to have mid-to-
         | advanced college level math and CS background. (Learning to do
         | a small fraction of the Putnam competition without using Lean
         | probably only needs strong highschool math and early undergrad
         | math, with training for competitions a strong bonus.)
        
       | Fokamul wrote:
       | >"an open-source large language model"
       | 
       | Is it really opensource? Something changed?
        
         | eightysixfour wrote:
         | Everyone calls "open weight" models "open source" at this
         | point, it is wrong, but we'll have to find another way to fight
         | this fight. Maybe "open data and pipeline" or something.
        
       | mcshicks wrote:
       | You can run the model for free on openrouter.ai. I have played
       | around with lean, slowly working my way through mathmatics in
       | lean. I was stuck on this problem in section 3.6
       | 
       | theorem convergesTo_unique {s : N - R} {a b : R} (sa :
       | ConvergesTo s a) (sb : ConvergesTo s b) :
       | 
       | For fun I tried it on the free model on openrouter.ai. Got the
       | answer the first time.
       | 
       | https://leanprover-community.github.io/mathematics_in_lean/m...
       | 
       | Here's the answer just to give you a feel.
       | by_contra h       have h1 : a [?] b := h       have h2 : |a - b|
       | > 0 := by         apply abs_pos.mpr         exact sub_ne_zero.mpr
       | h1       -- Use the definition of convergence to find N1 and N2
       | have h3 := sa (|a - b| / 2) (by linarith)       have h4 := sb (|a
       | - b| / 2) (by linarith)       cases' h3 with N1 h3       cases'
       | h4 with N2 h4       -- Choose N to be the maximum of N1 and N2
       | let N := max N1 N2       have h5 := h3 N (by simp [N,
       | le_max_left])       have h6 := h4 N (by simp [N, le_max_right])
       | -- Derive a contradiction using the triangle inequality
       | have h7 : |s N - a| < |a - b| / 2 := by simpa using h5       have
       | h8 : |s N - b| < |a - b| / 2 := by simpa using h6       have h9 :
       | |a - b| < |a - b| := by         calc           |a - b| = |a - s N
       | + (s N - b)| := by ring_nf           _ <= |a - s N| + |s N - b|
       | := by             apply abs_add           _ = |s N - a| + |s N -
       | b| := by             rw [abs_sub_comm]           _ < |a - b| / 2
       | + |a - b| / 2 := by             linarith           _ = |a - b| :=
       | by ring       linarith
        
       ___________________________________________________________________
       (page generated 2025-04-30 23:00 UTC)