[HN Gopher] DeepSeek-Prover-V2
___________________________________________________________________
DeepSeek-Prover-V2
Author : meetpateltech
Score : 270 points
Date : 2025-04-30 16:23 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| simianwords wrote:
| related: I imagine in the future we might several "expert" LLM's
| and a wrapper can delegate tasks as needed as if it were a
| "tool". That way we can have segregation of expertise - each
| individual model can excel at one single thing.
|
| A prover model might be used as a tool in the coming future.
| koakuma-chan wrote:
| > related: I imagine in the future we might several "expert"
| LLM's and a wrapper can delegate tasks as needed as if it were
| a "tool". That way we can have segregation of expertise - each
| individual model can excel at one single thing.
|
| In the future? I'm pretty sure people do that already.
| energy123 wrote:
| It's crudely done though.
| kratom_sandwich wrote:
| Mistrals model is a mixture-of-experts model
| simianwords wrote:
| No I disagree. I would want ChatGPT to abstract away expert
| models - biochemistry model, coding model, physics model and
| maybe O3 would use these models as tools to come up with an
| answer.
|
| The point being that a separate expert model would be better
| at its own field than a single model that tries to be good at
| everything. Intuitively it makes sense, in practice I have
| seen anecdotes where finetuning a small model on domain data
| makes the model lose coherence on other topics.
| koakuma-chan wrote:
| > have seen anecdotes where finetuning a small model on
| domain data makes the model lose coherence on other topics
|
| This is expected behaviour.
| simianwords wrote:
| i know. so why don't we have domain specific models as
| tools in consumer llm products
| Arcuru wrote:
| For a concrete example today, see
| https://openrouter.ai/openrouter/auto
| simianwords wrote:
| thats nice but imagine first having models that are expert in
| specific domains. routing seems to be the easy part (just
| feed the available models as tools to your wrapper LLM)
| someguy101010 wrote:
| The No Free Lunch Theorem implies that something like this is
| inevitable
| https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...
| repsilat wrote:
| A system of n experts is no different to a single expert wrt
| the NFLT. The theorem is entirely indifferent to (ie "equally
| skeptical of") the idea.
| samvaran wrote:
| Is that not what MoE models already do?
| oofbaroomf wrote:
| No. Each expert is not separately trained, and while they may
| store different concepts, they are not meant to be different
| experts in specific domains. However, there are certain
| technologies to route requests to different domain expert
| LLMs or even fine-tuning adapters, such as RouteLLM.
| retinaros wrote:
| That might already happen behind what they call test time
| compute
| woah wrote:
| Why do you think that a hand-configured selection between
| "different domains" is better than the training-based
| approach in MoE?
| AlexCoventry wrote:
| MoE models route _each token_ , in every transformer layer,
| to a set of specialized feed-forward networks (fully-
| connected perceptrons, basically), based on a score derived
| from the token's current representation.
| smusamashah wrote:
| That Putnam bench graph (middle one) is showing 49/658 solve
| rate.
|
| > The resulting model, DeepSeek-Prover-V2-671B, achieves state-
| of-the-art performance in neural theorem proving, reaching 88.9%
| pass ratio on the MiniF2F-test and solving 49 out of 658 problems
| from PutnamBench.
|
| Which is 0.07% (edit: 7%) for PutnamBench
| darkmighty wrote:
| 49/658 is 7%
| smusamashah wrote:
| Sorry, forgot multiply by 100
| booi wrote:
| I bet DeepSeek-Prover-V2 wouldn't have made that mistake
| gallerdude wrote:
| classic human hallucination
| HappyPanacea wrote:
| How likely is it that Putnam answers were in DeepSeek's
| training data?
| EvgeniyZh wrote:
| The solutions weren't published anywhere. There is also no
| good automatic way to generate solutions as far as I know,
| even expensive ones (previous sota was 10 solutions and one
| before was 8 using pass@3200 for 7b model). Potentially the
| developers could've paid some people who are good in putnam-
| level math problems and lean to write solutions for LLMs. It
| is hard to estimate likelihood of that but it sounds like
| waste of money given relatively marginal problem/benchmark.
| HappyPanacea wrote:
| AoPS seems to have a forum dedicated to Putnam (including
| 2024):
| https://artofproblemsolving.com/community/c3249_putnam and
| here is a pdf with solutions to Putnam 2023:
| https://kskedlaya.org/putnam-archive/2023s.pdf
| EvgeniyZh wrote:
| These are still need to be formalized in Lean which can
| be harder than solving the problem sometimes
| islewis wrote:
| > The cold-start training procedure begins by prompting
| DeepSeek-V3 to decompose complex problems into a series of
| subgoals
|
| It feels pretty intuitive to me that the ability for an LLM to
| break a complex problem down into smaller, more easily solvable
| pieces will unlock the next level of complexity.
|
| This pattern feels like a technique often taught to junior
| engineers- how to break up a multi-week project into bitesized
| tasks. This model is obviously math focused, but I see no reason
| why this wouldn't be incredibly powerful for code based problem
| solving.
| cadamsdotcom wrote:
| And it should be powerful for breaking down reasoning chains of
| thought too.
| criley2 wrote:
| Imo current models can already break things up into bite sized
| pieces. The limiter I've seen is twofold
|
| 1) Maintaining context of the overall project and goals while
| working in the weeds on a subtask of a task on an epic (so to
| speak) both in terms of what has been accomplished already and
| what still needs to be accomplished
|
| and 2) Getting an agentic coding tool which can actually handle
| the scale of doing 50 small projects back to back. With these
| agentic tools I find they start projects off really strong but
| by task #5 they're just making a mess with every change.
|
| I've played with keeping basically a dev-progress.md file and
| implementation-plan.md file that I keep in context for every
| request and end each task by updating files. But me manually
| keeping all this context isn't solving all my problems.
|
| And all the while, tools like Cline are gobbling up 2M tokens
| to make small changes.
| jhrmnn wrote:
| > Maintaining context of the overall project and goals while
| working in the weeds on a subtask of a task on an epic (so to
| speak) both in terms of what has been accomplished already
| and what still needs to be accomplished
|
| This is a struggle for every human I've ever worked with
| pertymcpert wrote:
| Yes. I wonder if the path forward will be to create systems
| of agents that work as a team, with an "architect" or
| "technical lead" AI directing the work of more specialized
| execution AIs. This could alleviate the issue of context
| pollution as the technical lead doesn't have to hold all of
| the context when working on a small problem, and vice
| versa.
|
| Shit. Do we need agile AI now?
| Rudybega wrote:
| This is kind of what the modes in roo code do now. I'm
| having great success with them and having them as a
| default just rolled out a couple days ago.
|
| There are a default set of modes (orchestrator, code,
| architect, debug, and ask) and you can create your own
| custom ones (or have roo do it for you, which is kind of
| a fun meta play).
|
| Orchestrator basically consults the others and uses them
| when appropriate, feeding in a sensible amount of task
| definition and context into the sub task. You can use
| different LLMs for different modes as well (I like Gemini
| 2.5 Pro for most of the thinking style ones and gpt
| o4-mini for the coding).
|
| I've done some reasonably complicated things and haven't
| really had an orchestrator task creep past ~400k tokens
| before I was finished and able to start a new task.
|
| There are some people out there who do really cool stuff
| with memory banks (basically logging and progress
| tracking), but I haven't played a ton with that yet.
|
| Basic overview:
| https://docs.roocode.com/features/boomerang-tasks
|
| Custom Modes: https://docs.roocode.com/features/custom-
| modes
| bearjaws wrote:
| It's actually pretty hilarious how far into detail they can go.
|
| For example, I made a bot that you could give it a problem
| statement, and then it would return an array of steps to
| accomplish it.
|
| Then you could take the steps, and click on them to break them
| down and add them to the list. If you just kept clicking you
| would get to excruciating detail.
|
| For example taking out the trash can become over ~70 individual
| steps if you really drill into the details.
|
| Some of the steps:
|
| Stand close to the trash can - Position yourself so you have
| stable footing and easy access.
|
| Place one hand on the rim of the can - Use your non-dominant
| hand to press lightly on the edge of the trash can to hold it
| in place.
|
| Grip the top edge of the bag with your other hand - Find the
| part of the bag that extends past the rim.
|
| Gently lift the bag upward - While your one hand stabilizes the
| can, slowly pull the bag up with the other.
|
| Tilt the can slightly if needed - If the bag sticks or creates
| suction, rock or tilt the can slightly while continuing to
| lift.
|
| Avoid jerking motions - Move steadily to prevent tears or
| spills
| eightysixfour wrote:
| This used to be part of one of the intro to engineering
| courses at my school - write an XX page document describing
| how to make a peanut butter and jelly sandwich.
| lugu wrote:
| This is how I imagine llms are used in robotics, with one or
| two more levels of description.
| thrance wrote:
| This feels like a manual for infiltrated aliens: "How to pass
| as humans, Vol. I"
| roywiggins wrote:
| or for goblins:
|
| https://goblin.tools/
| jasonjmcghee wrote:
| Super interesting that they chose 671B and 7B. no like 32B which
| feels like a "sweet spot"
| bredren wrote:
| Also notable is the earliest planning for a positive reception
| release of a new model might include both parameter-based and
| skill type market segmentation.
|
| --> "In an increasingly crowded field of LLMs, how will our
| (costly to produce) model stand out?"
| SweetSoftPillow wrote:
| I feel like this is very logical way to do things. Test
| hypothesis on small model, play around, get it working, apply
| findings to big model.
| qoez wrote:
| The best part about these is that I know the weights are static
| so I know I won't have to deal with a sassy unusable update for a
| week suddenly.
| Implicated wrote:
| Or, like with Claude, it being effectively lobotomized during
| north american 'business' hours. 3am PST? Cracked. 8am PST? ...
| mentally challenged.
| navanchauhan wrote:
| This is pretty interesting. Do you have more information
| about this?
| devoutsalsa wrote:
| I'm pretty sure the parent comment is referring to capacity
| constraints. When the Americans come online in the morning,
| Claude frequently can't keep up with demand and error
| messages saying the system is at capacity are common.
| ekez wrote:
| I wonder if the authors have tried incorporating error feedback
| from Lean into their models.
|
| Work from 2023 [1] showed general purpose models did better when
| they were able to incorporate error feedback, humans incorporate
| error feedback, but none of the SOTA models on minif2f seem to.
|
| [1]: https://arxiv.org/abs/2310.04353
| revskill wrote:
| The way intelligence works to me, is more about:
|
| - Making correct and smart assumption. Currently all LLM bots are
| too stupid at making good assumptions. I don't want to explicitly
| repeat and repeat again my own assumptions while the context is
| clear enough. Hey bots, try harder.
|
| - LLM bot needs to bring their own secondary and contextual
| memory in reasoning, i don't want to do it for you, ok ? You're
| the bot.
|
| - Thinking out of the box. This is the final stage of
| intelligence. Adapt old technique to make your own technique to
| solve non-existing before problems.
| nthingtohide wrote:
| I propose human-AI interaction data must be made public. This
| is our collective wikipedia of AI era. Otherwise our progress
| will be blank line after 2022. Just as Egyptians didn't write
| down process to move giant rocks.
| whatshisface wrote:
| How much education would a human need to perform at this level on
| the benchmarks?
| pama wrote:
| Learning to formalize math to then prove Putnam competition
| problems rigorously in Lean would require you to have mid-to-
| advanced college level math and CS background. (Learning to do
| a small fraction of the Putnam competition without using Lean
| probably only needs strong highschool math and early undergrad
| math, with training for competitions a strong bonus.)
| Fokamul wrote:
| >"an open-source large language model"
|
| Is it really opensource? Something changed?
| eightysixfour wrote:
| Everyone calls "open weight" models "open source" at this
| point, it is wrong, but we'll have to find another way to fight
| this fight. Maybe "open data and pipeline" or something.
| mcshicks wrote:
| You can run the model for free on openrouter.ai. I have played
| around with lean, slowly working my way through mathmatics in
| lean. I was stuck on this problem in section 3.6
|
| theorem convergesTo_unique {s : N - R} {a b : R} (sa :
| ConvergesTo s a) (sb : ConvergesTo s b) :
|
| For fun I tried it on the free model on openrouter.ai. Got the
| answer the first time.
|
| https://leanprover-community.github.io/mathematics_in_lean/m...
|
| Here's the answer just to give you a feel.
| by_contra h have h1 : a [?] b := h have h2 : |a - b|
| > 0 := by apply abs_pos.mpr exact sub_ne_zero.mpr
| h1 -- Use the definition of convergence to find N1 and N2
| have h3 := sa (|a - b| / 2) (by linarith) have h4 := sb (|a
| - b| / 2) (by linarith) cases' h3 with N1 h3 cases'
| h4 with N2 h4 -- Choose N to be the maximum of N1 and N2
| let N := max N1 N2 have h5 := h3 N (by simp [N,
| le_max_left]) have h6 := h4 N (by simp [N, le_max_right])
| -- Derive a contradiction using the triangle inequality
| have h7 : |s N - a| < |a - b| / 2 := by simpa using h5 have
| h8 : |s N - b| < |a - b| / 2 := by simpa using h6 have h9 :
| |a - b| < |a - b| := by calc |a - b| = |a - s N
| + (s N - b)| := by ring_nf _ <= |a - s N| + |s N - b|
| := by apply abs_add _ = |s N - a| + |s N -
| b| := by rw [abs_sub_comm] _ < |a - b| / 2
| + |a - b| / 2 := by linarith _ = |a - b| :=
| by ring linarith
___________________________________________________________________
(page generated 2025-04-30 23:00 UTC)