[HN Gopher] ProofOfThought: LLM-based reasoning using Z3 theorem...
___________________________________________________________________
ProofOfThought: LLM-based reasoning using Z3 theorem proving
https://arxiv.org/abs/2409.17270
Author : barthelomew
Score : 143 points
Date : 2025-10-04 18:34 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| LASR wrote:
| This is an interesting approach.
|
| My team has been prototyping something very similar with encoding
| business operations policies with LEAN. We have some internal
| knowledge bases (google docs / wiki pages) that we first convert
| to LEAN using LLMs.
|
| Then we run the solver to verify consistency.
|
| When a wiki page is changed, the process is run again and it's
| essentially a linter for process.
|
| Can't say it moved beyond the prototyping stage though, since the
| LEAN conversion does require some engineers to look through it at
| least.
|
| But a promising approach indeed, especially when you have a
| domain that requires tight legal / financial compliance.
| barthelomew wrote:
| The autoformalization gap is pretty difficult to bridge indeed.
| We explored uncertainty quantification of autoformalization on
| well-defined grammars in our NeurIPS 2025 paper :
| https://arxiv.org/abs/2505.20047 .
|
| If you ever feel like chatting and discussing more details,
| happy to chat!
| viraptor wrote:
| Could you share an example of such policy? I'm struggling to
| think of something defined well enough in the real world to
| apply in Lean.
| sigmoid10 wrote:
| I always find it amazing how many people seem to fail to use
| current LLMs to the fullest, even though they apparently work
| with them in research settings. This benchmark pipeline simply
| calls the OpenAI API and then painstakingly tries to parse the
| raw text output into a structured json format, when in reality
| the OpenAI API has supported structured outputs for ages now.
| That already ensures your model generates schema compliant output
| without hallucinating keys at the inference level. Today all the
| major providers support this feature either directly or at least
| indirectly via function calling. And if you run open models, you
| can literally write arbitrary schema (i.e. not limited to json
| behind the scenes) adhering inference engines yourself with
| rather manageable effort. I'm constantly using this in my daily
| work and I'm always baffled when people tell me about their
| hallucination problems, because so many of them can be fixed
| trivially these days.
| atrus wrote:
| I wouldn't find it amazing, there are so many new models,
| features, ways to use models that the minute you pause to take
| a deep dive into something specific, 43 other things have
| already passed by you.
| sigmoid10 wrote:
| I would agree if you are a normal dev who doesn't work in the
| field. But even then reading the documentation once a year
| would have brought you insane benefits regarding this
| particular issue. And for ML researchers there is no excuse
| for stuff like that at this point.
| retinaros wrote:
| yes this can also improve the said reasoning.
| sigmoid10 wrote:
| The secret the big companies don't want to tell you is that
| you can turn all their models into reasoning models that way.
| You even have full control over the reasoning process and can
| make it adhere to a specific format, e.g. the ones used in
| legal settings. I've built stuff like that using plain old
| gpt-4o and it was even better than the o series.
| jssmith wrote:
| I see JSON parse errors on occasion when using OpeanAI
| structured outputs that resolve upon retry. It seems it's
| giving instructions to the LLM but validation is still up to
| the caller. Wondering if others see this too.
| barthelomew wrote:
| Hey, yes! This is because the DSL (Domain Specific Language)
| is pretty complex, and the LLM finds it hard. We prototype a
| much more effective version using SMT in our NeurIPS 2025
| paper (https://arxiv.org/abs/2505.20047). We shall soon open
| source that code!
| sigmoid10 wrote:
| Depends on how strictly you define your types. Are you using
| pydantic to pass the information to the API? There are a few
| pitfalls with this, because not everything is fully supported
| and it gets turned into json behind the scenes. But in
| principle, the autoregressive engine will simply not allow
| tokens that break the supplied schema.
| striking wrote:
| Not sure if I've been using it wrong but I've tried using
| the Zod-to-structured-output helper with GPT-5 and often
| gotten weird stuff like trailing commas that break a parse
| or seeing multiple JSON responses in the same response.
|
| Ultimately there are still going to be bugs. For this
| reason and several others you'll still need it wrapped in a
| retry.
| sigmoid10 wrote:
| Yeah that sounds 100% like a user or middleware issue.
| Don't bother with these wrappers, they are always
| outdated anyways. Learn how to use the API directly, it
| will save you a ton of headaches. And it's really not
| that hard.
| striking wrote:
| No, we're using the OpenAI vendored version of zod-to-
| json-schema via https://github.com/transitive-
| bullshit/openai-zod-to-json-sc..., and applying it
| directly to the `json_schema` field of the OpenAI API.
| Maybe we have a subtle bug somewhere but I'd expect a 400
| response if we were genuinely sending a malformed
| request.
| eric-burel wrote:
| Yep from time to time.
| barthelomew wrote:
| Hey there! I mostly designed and wrote most of the actual
| interpreter during my internship at Microsoft Research last
| summer. Constrained decoding for GPT-4 wasn't available when we
| started designing the DSL, and besides, creating a regex to
| constrain this specific DSL is quite challenging.
|
| When the grammar of the language is better defined, like SMT
| (https://arxiv.org/abs/2505.20047) - we are able to do this
| with open source LLMs.
| sigmoid10 wrote:
| What are you talking about? OpenAI has supported structured
| json output in the API since 2023. Only the current
| structured output API was introduced by OpenAI in summer
| 2024, but it was primarily a usability improvement that still
| runs json behind the scenes.
| dang wrote:
| > _What are you talking about?_
|
| Please edit out swipes like this from your HN comments--
| this is in the site guidelines:
| https://news.ycombinator.com/newsguidelines.html. It comes
| across as aggressive, and we want _curious_ conversation
| here.
|
| Your comment would be fine without that bit.
| sigmoid10 wrote:
| This is not meant as snide, I'm literally confused if I
| might have misunderstood the problem here. Because the
| solution would be so obvious.
| dang wrote:
| I believe you! but when an internet reply leads with
| "what are you talking about?", it's likely to pattern-
| match this way for many readers. If that's not your
| intent, it's best to use an alternate wording.
| barthelomew wrote:
| You're right about the 2023 JSON mode, but our project
| required enforcing a much more complex DSL grammar (look in
| Appendix for details), not just ensuring a *valid JSON
| object*. The newer structured output APIs are a significant
| improvement, but the earlier tools weren't a fit for the
| specific constraints we were working under at the time.
| IanCal wrote:
| I'd also be surprised if the models are better at writing code
| in some custom schema (assuming that's not z3s native
| structure) than writing code in something else. Decent models
| can write pretty good code and for a lot of mistakes can fix
| them, plus you get testing/etc setups for free.
| eric-burel wrote:
| It's a relatively new feature, also people need actual
| professional training to become true LLM developers using them
| to their fullest and not just developers that happen to call an
| LLM API here and there. Takes a lot of time and effort.
| ivanbakel wrote:
| The repo is sparse on the details unless you go digging, which
| perhaps makes sense if this is just meant as the artifact for the
| mentioned paper.
|
| Unless I'm wrong, this is mainly an API for trying to get an LLM
| to generate a Z3 program which "logically" represents a real
| query, including known facts, inference rules, and goals. The
| "oversight" this introduces is in the ability to literally read
| the logical statement being evaluated to an answer, and running
| the solver to see if it holds or not.
|
| The natural source of doubt is: who's going to read a bunch of
| SMT rules manually and be able to accurately double-check them
| against real-world understanding? Who double checks the
| constants? What stops the LLM from accidentally (or deliberately,
| for achieving the goal) adding facts or rules that are unsound
| (both logically and from a real-world perspective)?
|
| The paper reports a *51%* false positive rate on a logic
| benchmark! That's shockingly high, and suggests the LLM is either
| bad at logical models or keeps creating unsoundnesses. Sadly, the
| evaluation is a bit thin on the ground about how this stacks up,
| and what causes it to fall short.
| barthelomew wrote:
| Yep. The paper was written last year with GPT-4o. Things have
| become a lot better since then with newer models.
|
| E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the
| performance on text-only vs SMT-only. o3-mini does pretty well
| at mirroring its text reasoning in its SMT, vs Gemini Flash
| 2.0.
|
| Illustration of this can be seen in Fig 14, 15 on Page 29.
|
| In commercially available products like AWS Automated Reasoning
| Checks, you build a model from your domain (e.g. from a PDF
| policy document), cross verify it for correctness, and during
| answer generation, you only cross check whether your Q/A pairs
| from the LLM comply with the policy using a solver with
| guarantees.
|
| This means that they can give you a 99%+ soundness guarantee,
| which basically means that if the service says the Q/A pair is
| valid or guaranteed w.r.t the policy, it is right more than 99%
| of the time.
|
| https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
| measurablefunc wrote:
| This is proof of verifiable logic. Computers can not think so
| calling it proof of thought misrepresents what's actually
| happening.
| aSanchezStern wrote:
| I agree that "proof of thought" is a misleading name, but this
| whole "computers can't think" thing is making LLM skepticism
| seem very unscientific. There is no universally agreed upon
| objective definition of what it means to be able to "think" or
| how you would measure such a thing. The definition that these
| types of positions seem to rely upon is "a thing that only
| humans can do", which is obviously a circular one that isn't
| useful.
| measurablefunc wrote:
| If you believe computers can think then you must be able to
| explain why a chain of dominoes is also thinking when I
| convert an LLM from transistor relay switches into the domino
| equivalent. If you don't fall for the marketing hype & study
| both the philosophical & mathematical literature on
| computation then it is obvious that computers (or any
| mechanical gadget for that matter) can not qualify for any
| reasonable definition of "thinking" unless you agree that all
| functionally equivalent manifestations of arithmetic must be
| considered "thinking", including cascading dominoes that
| correspond to the arithmetic operations in an LLM.
| bobxmax wrote:
| And your definition of thinking is?
| measurablefunc wrote:
| Not arithmetic or boolean algebra. What's your
| definition?
| supern0va wrote:
| >If you believe computers can think then you must be able
| to explain why a chain of dominoes is also thinking when I
| convert an LLM from transistor relay switches into the
| domino equivalent.
|
| Sure, but if you assume that physical reality can be
| simulated by a Turing machine, then (computational
| practicality aside) one could do the same thing with a
| human brain.
|
| Unless you buy into some notion of magical thinking as
| pertains to human consciousness.
| measurablefunc wrote:
| No magic is necessary to understand that carbon & silicon
| are not equivalent. The burden of proof is on those who
| think silicon can be a substitute for carbon & all that
| it entails. I don't buy into magical thinking like Turing
| machines being physically realizable b/c I have studied
| enough math & computer science to not be confused by
| abstractions & their physical realizations.
| Terr_ wrote:
| > this whole "computers can't think" thing is making LLM
| skepticism seem very unscientific.
|
| It's just shorthand for "that's an extraordinary claim and
| nobody has provided any remotely extraordinary evidence to
| support it."
| chpatrick wrote:
| Do you understand human thinking well enough to determine what
| can think and what can't? We have next to no idea how an
| organic brain works.
| measurablefunc wrote:
| I understand computers, software, & the theory of computation
| well enough to know that there is no algorithm or even a
| theoretical algorithmic construction that can be considered
| thought. Unless you are willing to concede that thinking is
| nothing more than any number of models equivalent to a Turing
| machine, e.g. lambda calculus, Post systems, context aware
| grammars, carefully laid out dominoes, permutations of bit
| strings, etc. then you must admit that computers are not
| thinking. If you believe computers are thinking then you must
| also admit dominoes are thinking when falling in a cascading
| chain.
| nextos wrote:
| This is a very interesting area of research. I did something
| similar a couple of years ago using logic and probabilistic logic
| inference engines to make sure conclusions followed from
| premises.
|
| I also used agents to synthesize, formalize, and criticize domain
| knowledge. Obviously, it is not a silver bullet, but it does
| ensure some degree of correctness.
|
| I think introducing some degree of symbolism and agents-as-a-
| judge is a promising way ahead, see e.g.:
| https://arxiv.org/abs/2410.10934
| barthelomew wrote:
| Yep! I have read your work! Pretty cool! I also worked on a
| similar deep research agent for autoformalization this summer
| at AWS ARChecks, building on similar patterns.
|
| Although that work is not public, you can play with the
| generally available product here!
|
| [1] https://aws.amazon.com/blogs/aws/minimize-ai-
| hallucinations-...
| CuriouslyC wrote:
| Agent/LLM as a judge is biased and only good for bootstrapping.
| As capabilities get better LLM as a judge will artificially cap
| your performance, you need to graduate to either expert human
| judges or deterministic oracles.
| jebarker wrote:
| Why does this have to be true? For example, if you have a
| different LLM that is judging than the one being judged then
| their biases could at least be different. Also, as their
| reasoning abilities improve wouldn't LLM judges approach the
| abilities of human judges?
| CuriouslyC wrote:
| LLMs have positional, response length and hedge word biases
| (and that's just what's rigorously demonstrated in papers)
| that wash out differences between high performing answers
| as you approach the limit of your objective. Imagine if you
| were trying to optimize a function and the measurement
| function emitted random biased noise, at some point you
| wouldn't be able to accurately identify the impact of your
| changes.
| sdenton4 wrote:
| Indeed - human judges suck on average. And you can prompt
| an llm judge to look for particular kinds of problems, then
| throw the ensemble of judges at an output to nitpick.
| (Essentially, bake in a diversity of biases through a
| collection of prompts.)
| nakamoto_damacy wrote:
| LLMs lack logical constraints in the generative process; they
| only learn probabilistic constraints. If you apply logic
| verification post-hoc, you're not "ensuring the correctness of
| your LLMs reasoning" (I went down this path a year ago); you're
| classifying whether the LLM's statistically driven pattern
| generation happens to correspond to correct logic or not, where
| the LLMs output may be wrong 100% of the time, and your theorem
| prover simply acts as a classifier, ensuring nothing at all.
| barthelomew wrote:
| Yep, this is a genuine problem, and this is what we term as the
| autoformalization gap in our follow up paper.
| (https://arxiv.org/abs/2505.20047)
|
| Some LLMs are more consistent between text and SMT, while
| others are not. (Tab 1, Fig 14,15)
|
| You can do uncertainty quantification with selective
| verification to reduce the "risk", for e.g. shown as the Area
| Under the Risk Coverage Curve in Tab 4.
| avmich wrote:
| Probabilistic constraints are all around us. You learn that the
| sine function is the ratio of the length of the side of the
| right triangle opposite to the angle to the length of the side
| opposite to the right angle, so obviously the sine is always
| positive. Yet your thinking should be flexible enough to allow
| changing the definition to the ordinate of the point on the
| unit circle where the line corresponding to the given angle and
| drawn from zero intersects that circle. So your knowledge - the
| symbolic one - can also be probabilistic.
| nakamoto_damacy wrote:
| You're thinking along the right track but without
| formalization it goes nowhere fast. By layering of
| differential geometry on top of probability and then maybe
| category theoretic logic on top of that, each layer
| constraining the one below it, and all layers cohering, you
| get somewhere... There is work that's been done in this area,
| and I was recently interviewed by a journalist who published
| a high level article on it on Forbes (Why LLMs are failing)
| and it links to the actual technical work (at first to my
| high level presentation then Prof. L. Thorne McCarty's work):
| https://www.forbes.com/sites/hessiejones/2025/09/30/llms-
| are...
| zwnow wrote:
| Reasoning? LLMs can not reason, why is it always assumed they
| reason? They mimic reasoning.
| elcomet wrote:
| How can you know?
| moffkalast wrote:
| It's so funny to me that people are still adamant about this
| like two years after it's become a completely moot point.
| zwnow wrote:
| Experts are adamant about this. Just take a look at
| https://youtu.be/iRqpsCHqLUI
| moffkalast wrote:
| No such thing as an expert consensus on anything about LLMs
| these days, just different forms of grift.
|
| My point is, the question if an LLM reasons the same way a
| human does is about as useful as "does a submarine swim" or
| "can a telephone talk". The results speak for themselves.
| zwnow wrote:
| LLMs do not reason. Not hard to understand.
| moffkalast wrote:
| Idk, before this people from your camp were saying LLMs
| can't even understand anything. Always moving the
| goalposts. Then it'll be they can't feel or can't
| something else. Anyway, it won't matter.
| Terr_ wrote:
| The normative importance of a fact may _increase_ when more
| number of people start willfully ignoring it for shorter-term
| profit.
|
| Imagine somebody in 2007: "It's so funny to me that people
| are still adamant about _mortgage default risk_ after it 's
| become a completely moot point because nobody cares in this
| housing market."
| dehsge wrote:
| LLMs and its output are bounded by Rices theorem. This is not
| going to ensure correctness it's just going to validate that the
| model can produce an undecidable result.
| everdrive wrote:
| I'm honestly confused why we can't determine how LLMs come to
| their decisions in the general sense. Is it not possible to log
| every step as the neural network / vector db / magic happens? Is
| it merely impractical, or is it actually something that's
| genuinely difficult to do?
| chpatrick wrote:
| Everything happens in an opaque super-high-dimensional
| numerical space that was "organically grown" not engineered, so
| we don't really understand what's going on.
| NotGMan wrote:
| Chat GPT-4 has alegedly 1.8 trillion parameters.
|
| Imagine having a bunch of 2D matrices with a combined 1.8
| trillion total numbers, from which you pick out a blocks of
| numbers in a loop and finally merge them and combine them to
| form a token.
|
| Good luck figuring out what number represents what.
| everdrive wrote:
| Wouldn't that mean it's totally impractical for day-to-day
| usage, but a researcher or team of researchers could solve
| this?
| moffkalast wrote:
| It would be like logging a bunch of random noise from anyone's
| perspective except the LLM's.
| konmok wrote:
| My understanding is that it's neither impractical nor genuinely
| difficult, it's just that the "logging every step" approach
| provides explanations of their "reasoning" that are completely
| meaningless to us, as humans. It's like trying to understand
| why a person likes the color red, but not the color blue, using
| a database recording the position, makeup, and velocity of
| every atom in their brain. Theoretically, yes, that should be
| sufficient to explain their color preferences, in that it fully
| models their brain. But practically, the explanation would be
| phrased in terms of atomic configurations in a way that makes
| much less sense to us than "oh, this person likes red because
| they like roses".
| tonerow wrote:
| Cool research! I went to the repo to see what the DSL looked like
| but it was hard to find a clear example. It would be cool if you
| added a snippet to the README.
| barthelomew wrote:
| Hey! Thank you for the interest! I shall do that. Meanwhile,
| check out Page 11 onwards. We describe a lot of situations!
| (https://arxiv.org/pdf/2409.17270)
| tannhaeuser wrote:
| LLMs are statistical language models (d'uh) not reasoners after
| all. I found generating logic programs, and Prolog source
| specifically, to work unreasonably well, though [1], maybe
| because Prolog was introduced for symbolic natural language
| processing and there's a wealth of translation examples in the
| training set. Might be worth checking out Z3's alternative
| Datalog syntax [2] instead of its Lisp-ish SMTLib syntax.
|
| [1]: https://quantumprolog.sgml.net/llm-demo/part1.html
|
| [2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax
| barthelomew wrote:
| Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in
| our grammars paper because it allowed the most interoperability
| with solvers, but our technique also works with PROLOG; as
| tested our at the behest of reviewers at NeurIPS. I would
| assume that this should also work with datalog [2].
|
| [1] https://arxiv.org/abs/2505.20047 [2]
| https://github.com/antlr/grammars-v4/blob/master/datalog/dat...
| chrchr wrote:
| I had a surprising interaction with Gemini 2.5 Pro that this
| project reminds me of. I was asking the LLM for help using an
| online CAS system to solve a system of equations, and the CAS
| system wasn't working as I expected. After a couple back and
| forths with Gemini about the CAS system, Gemini just gave me the
| solution. I was surprised because it's the kind of thing I don't
| expect LLMs to be good at. It said it used Python's sympy
| symbolic computation package to arrive at the solution. So, yes,
| the marriage of fuzzy LLMs with more rigorous tools can have
| powerful effects.
| TrainedMonkey wrote:
| Just like humans... we are not so good at hard number
| crunching, but we can invent computers that are amazing at it.
| And with a lot of effort we can make a program that uses a
| whole lot of number crunching to be ok at predicting text but
| kind of bad at crunching hard numbers. And then that program
| can predict how to create and use programs which are good at
| number crunching.
| jonplackett wrote:
| Maybe the number crunching program the text generation
| program creates will, with enough effort become good at
| generating text, an will in turn make another number
| crunching computer and then...
| patcon wrote:
| I love this kind of thought. Thanks.
| anotherpaulg wrote:
| I really like LLM+sympy for math. I have the LLM write me a
| sympy program, so I can trust that the symbolic manipulation is
| done correctly.
|
| The code is also a useful artifact that can be iteratively
| edited and improved by both the human and LLM, with git
| history, etc. Running and passing tests/assertions helps to
| build and maintain confidence that the math remains correct.
|
| I use helper functions to easily render from the sympy code to
| latex, etc.
|
| A lot of the math behind this quantum eraser experiment was
| done this way.
|
| https://github.com/paul-gauthier/entangled-pair-quantum-eras...
| DrewADesign wrote:
| I get having it walk you through figuring out a problem with a
| tool: seems like a good idea and it clearly worked even better
| than expected. But deliberately coaxing an LLM into doing math
| correctly instead of a CAS because you've got one handy seems
| like moving apartments with dozens of bus trips rather than
| taking the bus to a truck rental place, just because you've
| already got a bus pass.
| 0xWTF wrote:
| Am I reading this right? Statistical LLM outputs pushed through a
| formal logic model? Wouldn't that be a case of "crap in, crap
| out"?
| varispeed wrote:
| That's subjective. One could argue all the things we invented
| in the past few thousands years were crap. Life would have been
| much easier in the caves, albeit shorter.
| avmich wrote:
| Formal logic serves as a useful filter. In other words, "crap
| in, filtered crap out" - remember, evolution works with
| absolutely random, "crap" mutations, which then are "filtered"
| by the environment.
| Yoric wrote:
| That is exactly the kind of things that I hope LLM will help us
| achieve before the next AI winter.
| Western0 wrote:
| I need this same with Mizar https://wiki.mizar.org/
| sytse wrote:
| So the core idea is to use an LLM to draft reasoning as a
| structured, JSON domain-specific language (DSL), then
| deterministically translate that into first-order logic and
| verify it with a theorem prover (Z3).
|
| Interesting that the final answer is provably entailed (or you
| get a counterexample), instead of being merely persuasive chain-
| of-thought.
___________________________________________________________________
(page generated 2025-10-04 23:00 UTC)