[HN Gopher] ProofOfThought: LLM-based reasoning using Z3 theorem...
       ___________________________________________________________________
        
       ProofOfThought: LLM-based reasoning using Z3 theorem proving
        
       https://arxiv.org/abs/2409.17270
        
       Author : barthelomew
       Score  : 143 points
       Date   : 2025-10-04 18:34 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | LASR wrote:
       | This is an interesting approach.
       | 
       | My team has been prototyping something very similar with encoding
       | business operations policies with LEAN. We have some internal
       | knowledge bases (google docs / wiki pages) that we first convert
       | to LEAN using LLMs.
       | 
       | Then we run the solver to verify consistency.
       | 
       | When a wiki page is changed, the process is run again and it's
       | essentially a linter for process.
       | 
       | Can't say it moved beyond the prototyping stage though, since the
       | LEAN conversion does require some engineers to look through it at
       | least.
       | 
       | But a promising approach indeed, especially when you have a
       | domain that requires tight legal / financial compliance.
        
         | barthelomew wrote:
         | The autoformalization gap is pretty difficult to bridge indeed.
         | We explored uncertainty quantification of autoformalization on
         | well-defined grammars in our NeurIPS 2025 paper :
         | https://arxiv.org/abs/2505.20047 .
         | 
         | If you ever feel like chatting and discussing more details,
         | happy to chat!
        
         | viraptor wrote:
         | Could you share an example of such policy? I'm struggling to
         | think of something defined well enough in the real world to
         | apply in Lean.
        
       | sigmoid10 wrote:
       | I always find it amazing how many people seem to fail to use
       | current LLMs to the fullest, even though they apparently work
       | with them in research settings. This benchmark pipeline simply
       | calls the OpenAI API and then painstakingly tries to parse the
       | raw text output into a structured json format, when in reality
       | the OpenAI API has supported structured outputs for ages now.
       | That already ensures your model generates schema compliant output
       | without hallucinating keys at the inference level. Today all the
       | major providers support this feature either directly or at least
       | indirectly via function calling. And if you run open models, you
       | can literally write arbitrary schema (i.e. not limited to json
       | behind the scenes) adhering inference engines yourself with
       | rather manageable effort. I'm constantly using this in my daily
       | work and I'm always baffled when people tell me about their
       | hallucination problems, because so many of them can be fixed
       | trivially these days.
        
         | atrus wrote:
         | I wouldn't find it amazing, there are so many new models,
         | features, ways to use models that the minute you pause to take
         | a deep dive into something specific, 43 other things have
         | already passed by you.
        
           | sigmoid10 wrote:
           | I would agree if you are a normal dev who doesn't work in the
           | field. But even then reading the documentation once a year
           | would have brought you insane benefits regarding this
           | particular issue. And for ML researchers there is no excuse
           | for stuff like that at this point.
        
         | retinaros wrote:
         | yes this can also improve the said reasoning.
        
           | sigmoid10 wrote:
           | The secret the big companies don't want to tell you is that
           | you can turn all their models into reasoning models that way.
           | You even have full control over the reasoning process and can
           | make it adhere to a specific format, e.g. the ones used in
           | legal settings. I've built stuff like that using plain old
           | gpt-4o and it was even better than the o series.
        
         | jssmith wrote:
         | I see JSON parse errors on occasion when using OpeanAI
         | structured outputs that resolve upon retry. It seems it's
         | giving instructions to the LLM but validation is still up to
         | the caller. Wondering if others see this too.
        
           | barthelomew wrote:
           | Hey, yes! This is because the DSL (Domain Specific Language)
           | is pretty complex, and the LLM finds it hard. We prototype a
           | much more effective version using SMT in our NeurIPS 2025
           | paper (https://arxiv.org/abs/2505.20047). We shall soon open
           | source that code!
        
           | sigmoid10 wrote:
           | Depends on how strictly you define your types. Are you using
           | pydantic to pass the information to the API? There are a few
           | pitfalls with this, because not everything is fully supported
           | and it gets turned into json behind the scenes. But in
           | principle, the autoregressive engine will simply not allow
           | tokens that break the supplied schema.
        
             | striking wrote:
             | Not sure if I've been using it wrong but I've tried using
             | the Zod-to-structured-output helper with GPT-5 and often
             | gotten weird stuff like trailing commas that break a parse
             | or seeing multiple JSON responses in the same response.
             | 
             | Ultimately there are still going to be bugs. For this
             | reason and several others you'll still need it wrapped in a
             | retry.
        
               | sigmoid10 wrote:
               | Yeah that sounds 100% like a user or middleware issue.
               | Don't bother with these wrappers, they are always
               | outdated anyways. Learn how to use the API directly, it
               | will save you a ton of headaches. And it's really not
               | that hard.
        
               | striking wrote:
               | No, we're using the OpenAI vendored version of zod-to-
               | json-schema via https://github.com/transitive-
               | bullshit/openai-zod-to-json-sc..., and applying it
               | directly to the `json_schema` field of the OpenAI API.
               | Maybe we have a subtle bug somewhere but I'd expect a 400
               | response if we were genuinely sending a malformed
               | request.
        
           | eric-burel wrote:
           | Yep from time to time.
        
         | barthelomew wrote:
         | Hey there! I mostly designed and wrote most of the actual
         | interpreter during my internship at Microsoft Research last
         | summer. Constrained decoding for GPT-4 wasn't available when we
         | started designing the DSL, and besides, creating a regex to
         | constrain this specific DSL is quite challenging.
         | 
         | When the grammar of the language is better defined, like SMT
         | (https://arxiv.org/abs/2505.20047) - we are able to do this
         | with open source LLMs.
        
           | sigmoid10 wrote:
           | What are you talking about? OpenAI has supported structured
           | json output in the API since 2023. Only the current
           | structured output API was introduced by OpenAI in summer
           | 2024, but it was primarily a usability improvement that still
           | runs json behind the scenes.
        
             | dang wrote:
             | > _What are you talking about?_
             | 
             | Please edit out swipes like this from your HN comments--
             | this is in the site guidelines:
             | https://news.ycombinator.com/newsguidelines.html. It comes
             | across as aggressive, and we want _curious_ conversation
             | here.
             | 
             | Your comment would be fine without that bit.
        
               | sigmoid10 wrote:
               | This is not meant as snide, I'm literally confused if I
               | might have misunderstood the problem here. Because the
               | solution would be so obvious.
        
               | dang wrote:
               | I believe you! but when an internet reply leads with
               | "what are you talking about?", it's likely to pattern-
               | match this way for many readers. If that's not your
               | intent, it's best to use an alternate wording.
        
             | barthelomew wrote:
             | You're right about the 2023 JSON mode, but our project
             | required enforcing a much more complex DSL grammar (look in
             | Appendix for details), not just ensuring a *valid JSON
             | object*. The newer structured output APIs are a significant
             | improvement, but the earlier tools weren't a fit for the
             | specific constraints we were working under at the time.
        
         | IanCal wrote:
         | I'd also be surprised if the models are better at writing code
         | in some custom schema (assuming that's not z3s native
         | structure) than writing code in something else. Decent models
         | can write pretty good code and for a lot of mistakes can fix
         | them, plus you get testing/etc setups for free.
        
         | eric-burel wrote:
         | It's a relatively new feature, also people need actual
         | professional training to become true LLM developers using them
         | to their fullest and not just developers that happen to call an
         | LLM API here and there. Takes a lot of time and effort.
        
       | ivanbakel wrote:
       | The repo is sparse on the details unless you go digging, which
       | perhaps makes sense if this is just meant as the artifact for the
       | mentioned paper.
       | 
       | Unless I'm wrong, this is mainly an API for trying to get an LLM
       | to generate a Z3 program which "logically" represents a real
       | query, including known facts, inference rules, and goals. The
       | "oversight" this introduces is in the ability to literally read
       | the logical statement being evaluated to an answer, and running
       | the solver to see if it holds or not.
       | 
       | The natural source of doubt is: who's going to read a bunch of
       | SMT rules manually and be able to accurately double-check them
       | against real-world understanding? Who double checks the
       | constants? What stops the LLM from accidentally (or deliberately,
       | for achieving the goal) adding facts or rules that are unsound
       | (both logically and from a real-world perspective)?
       | 
       | The paper reports a *51%* false positive rate on a logic
       | benchmark! That's shockingly high, and suggests the LLM is either
       | bad at logical models or keeps creating unsoundnesses. Sadly, the
       | evaluation is a bit thin on the ground about how this stacks up,
       | and what causes it to fall short.
        
         | barthelomew wrote:
         | Yep. The paper was written last year with GPT-4o. Things have
         | become a lot better since then with newer models.
         | 
         | E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the
         | performance on text-only vs SMT-only. o3-mini does pretty well
         | at mirroring its text reasoning in its SMT, vs Gemini Flash
         | 2.0.
         | 
         | Illustration of this can be seen in Fig 14, 15 on Page 29.
         | 
         | In commercially available products like AWS Automated Reasoning
         | Checks, you build a model from your domain (e.g. from a PDF
         | policy document), cross verify it for correctness, and during
         | answer generation, you only cross check whether your Q/A pairs
         | from the LLM comply with the policy using a solver with
         | guarantees.
         | 
         | This means that they can give you a 99%+ soundness guarantee,
         | which basically means that if the service says the Q/A pair is
         | valid or guaranteed w.r.t the policy, it is right more than 99%
         | of the time.
         | 
         | https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
        
       | measurablefunc wrote:
       | This is proof of verifiable logic. Computers can not think so
       | calling it proof of thought misrepresents what's actually
       | happening.
        
         | aSanchezStern wrote:
         | I agree that "proof of thought" is a misleading name, but this
         | whole "computers can't think" thing is making LLM skepticism
         | seem very unscientific. There is no universally agreed upon
         | objective definition of what it means to be able to "think" or
         | how you would measure such a thing. The definition that these
         | types of positions seem to rely upon is "a thing that only
         | humans can do", which is obviously a circular one that isn't
         | useful.
        
           | measurablefunc wrote:
           | If you believe computers can think then you must be able to
           | explain why a chain of dominoes is also thinking when I
           | convert an LLM from transistor relay switches into the domino
           | equivalent. If you don't fall for the marketing hype & study
           | both the philosophical & mathematical literature on
           | computation then it is obvious that computers (or any
           | mechanical gadget for that matter) can not qualify for any
           | reasonable definition of "thinking" unless you agree that all
           | functionally equivalent manifestations of arithmetic must be
           | considered "thinking", including cascading dominoes that
           | correspond to the arithmetic operations in an LLM.
        
             | bobxmax wrote:
             | And your definition of thinking is?
        
               | measurablefunc wrote:
               | Not arithmetic or boolean algebra. What's your
               | definition?
        
             | supern0va wrote:
             | >If you believe computers can think then you must be able
             | to explain why a chain of dominoes is also thinking when I
             | convert an LLM from transistor relay switches into the
             | domino equivalent.
             | 
             | Sure, but if you assume that physical reality can be
             | simulated by a Turing machine, then (computational
             | practicality aside) one could do the same thing with a
             | human brain.
             | 
             | Unless you buy into some notion of magical thinking as
             | pertains to human consciousness.
        
               | measurablefunc wrote:
               | No magic is necessary to understand that carbon & silicon
               | are not equivalent. The burden of proof is on those who
               | think silicon can be a substitute for carbon & all that
               | it entails. I don't buy into magical thinking like Turing
               | machines being physically realizable b/c I have studied
               | enough math & computer science to not be confused by
               | abstractions & their physical realizations.
        
           | Terr_ wrote:
           | > this whole "computers can't think" thing is making LLM
           | skepticism seem very unscientific.
           | 
           | It's just shorthand for "that's an extraordinary claim and
           | nobody has provided any remotely extraordinary evidence to
           | support it."
        
         | chpatrick wrote:
         | Do you understand human thinking well enough to determine what
         | can think and what can't? We have next to no idea how an
         | organic brain works.
        
           | measurablefunc wrote:
           | I understand computers, software, & the theory of computation
           | well enough to know that there is no algorithm or even a
           | theoretical algorithmic construction that can be considered
           | thought. Unless you are willing to concede that thinking is
           | nothing more than any number of models equivalent to a Turing
           | machine, e.g. lambda calculus, Post systems, context aware
           | grammars, carefully laid out dominoes, permutations of bit
           | strings, etc. then you must admit that computers are not
           | thinking. If you believe computers are thinking then you must
           | also admit dominoes are thinking when falling in a cascading
           | chain.
        
       | nextos wrote:
       | This is a very interesting area of research. I did something
       | similar a couple of years ago using logic and probabilistic logic
       | inference engines to make sure conclusions followed from
       | premises.
       | 
       | I also used agents to synthesize, formalize, and criticize domain
       | knowledge. Obviously, it is not a silver bullet, but it does
       | ensure some degree of correctness.
       | 
       | I think introducing some degree of symbolism and agents-as-a-
       | judge is a promising way ahead, see e.g.:
       | https://arxiv.org/abs/2410.10934
        
         | barthelomew wrote:
         | Yep! I have read your work! Pretty cool! I also worked on a
         | similar deep research agent for autoformalization this summer
         | at AWS ARChecks, building on similar patterns.
         | 
         | Although that work is not public, you can play with the
         | generally available product here!
         | 
         | [1] https://aws.amazon.com/blogs/aws/minimize-ai-
         | hallucinations-...
        
         | CuriouslyC wrote:
         | Agent/LLM as a judge is biased and only good for bootstrapping.
         | As capabilities get better LLM as a judge will artificially cap
         | your performance, you need to graduate to either expert human
         | judges or deterministic oracles.
        
           | jebarker wrote:
           | Why does this have to be true? For example, if you have a
           | different LLM that is judging than the one being judged then
           | their biases could at least be different. Also, as their
           | reasoning abilities improve wouldn't LLM judges approach the
           | abilities of human judges?
        
             | CuriouslyC wrote:
             | LLMs have positional, response length and hedge word biases
             | (and that's just what's rigorously demonstrated in papers)
             | that wash out differences between high performing answers
             | as you approach the limit of your objective. Imagine if you
             | were trying to optimize a function and the measurement
             | function emitted random biased noise, at some point you
             | wouldn't be able to accurately identify the impact of your
             | changes.
        
             | sdenton4 wrote:
             | Indeed - human judges suck on average. And you can prompt
             | an llm judge to look for particular kinds of problems, then
             | throw the ensemble of judges at an output to nitpick.
             | (Essentially, bake in a diversity of biases through a
             | collection of prompts.)
        
       | nakamoto_damacy wrote:
       | LLMs lack logical constraints in the generative process; they
       | only learn probabilistic constraints. If you apply logic
       | verification post-hoc, you're not "ensuring the correctness of
       | your LLMs reasoning" (I went down this path a year ago); you're
       | classifying whether the LLM's statistically driven pattern
       | generation happens to correspond to correct logic or not, where
       | the LLMs output may be wrong 100% of the time, and your theorem
       | prover simply acts as a classifier, ensuring nothing at all.
        
         | barthelomew wrote:
         | Yep, this is a genuine problem, and this is what we term as the
         | autoformalization gap in our follow up paper.
         | (https://arxiv.org/abs/2505.20047)
         | 
         | Some LLMs are more consistent between text and SMT, while
         | others are not. (Tab 1, Fig 14,15)
         | 
         | You can do uncertainty quantification with selective
         | verification to reduce the "risk", for e.g. shown as the Area
         | Under the Risk Coverage Curve in Tab 4.
        
         | avmich wrote:
         | Probabilistic constraints are all around us. You learn that the
         | sine function is the ratio of the length of the side of the
         | right triangle opposite to the angle to the length of the side
         | opposite to the right angle, so obviously the sine is always
         | positive. Yet your thinking should be flexible enough to allow
         | changing the definition to the ordinate of the point on the
         | unit circle where the line corresponding to the given angle and
         | drawn from zero intersects that circle. So your knowledge - the
         | symbolic one - can also be probabilistic.
        
           | nakamoto_damacy wrote:
           | You're thinking along the right track but without
           | formalization it goes nowhere fast. By layering of
           | differential geometry on top of probability and then maybe
           | category theoretic logic on top of that, each layer
           | constraining the one below it, and all layers cohering, you
           | get somewhere... There is work that's been done in this area,
           | and I was recently interviewed by a journalist who published
           | a high level article on it on Forbes (Why LLMs are failing)
           | and it links to the actual technical work (at first to my
           | high level presentation then Prof. L. Thorne McCarty's work):
           | https://www.forbes.com/sites/hessiejones/2025/09/30/llms-
           | are...
        
       | zwnow wrote:
       | Reasoning? LLMs can not reason, why is it always assumed they
       | reason? They mimic reasoning.
        
         | elcomet wrote:
         | How can you know?
        
         | moffkalast wrote:
         | It's so funny to me that people are still adamant about this
         | like two years after it's become a completely moot point.
        
           | zwnow wrote:
           | Experts are adamant about this. Just take a look at
           | https://youtu.be/iRqpsCHqLUI
        
             | moffkalast wrote:
             | No such thing as an expert consensus on anything about LLMs
             | these days, just different forms of grift.
             | 
             | My point is, the question if an LLM reasons the same way a
             | human does is about as useful as "does a submarine swim" or
             | "can a telephone talk". The results speak for themselves.
        
               | zwnow wrote:
               | LLMs do not reason. Not hard to understand.
        
               | moffkalast wrote:
               | Idk, before this people from your camp were saying LLMs
               | can't even understand anything. Always moving the
               | goalposts. Then it'll be they can't feel or can't
               | something else. Anyway, it won't matter.
        
           | Terr_ wrote:
           | The normative importance of a fact may _increase_ when more
           | number of people start willfully ignoring it for shorter-term
           | profit.
           | 
           | Imagine somebody in 2007: "It's so funny to me that people
           | are still adamant about _mortgage default risk_ after it 's
           | become a completely moot point because nobody cares in this
           | housing market."
        
       | dehsge wrote:
       | LLMs and its output are bounded by Rices theorem. This is not
       | going to ensure correctness it's just going to validate that the
       | model can produce an undecidable result.
        
       | everdrive wrote:
       | I'm honestly confused why we can't determine how LLMs come to
       | their decisions in the general sense. Is it not possible to log
       | every step as the neural network / vector db / magic happens? Is
       | it merely impractical, or is it actually something that's
       | genuinely difficult to do?
        
         | chpatrick wrote:
         | Everything happens in an opaque super-high-dimensional
         | numerical space that was "organically grown" not engineered, so
         | we don't really understand what's going on.
        
         | NotGMan wrote:
         | Chat GPT-4 has alegedly 1.8 trillion parameters.
         | 
         | Imagine having a bunch of 2D matrices with a combined 1.8
         | trillion total numbers, from which you pick out a blocks of
         | numbers in a loop and finally merge them and combine them to
         | form a token.
         | 
         | Good luck figuring out what number represents what.
        
           | everdrive wrote:
           | Wouldn't that mean it's totally impractical for day-to-day
           | usage, but a researcher or team of researchers could solve
           | this?
        
         | moffkalast wrote:
         | It would be like logging a bunch of random noise from anyone's
         | perspective except the LLM's.
        
         | konmok wrote:
         | My understanding is that it's neither impractical nor genuinely
         | difficult, it's just that the "logging every step" approach
         | provides explanations of their "reasoning" that are completely
         | meaningless to us, as humans. It's like trying to understand
         | why a person likes the color red, but not the color blue, using
         | a database recording the position, makeup, and velocity of
         | every atom in their brain. Theoretically, yes, that should be
         | sufficient to explain their color preferences, in that it fully
         | models their brain. But practically, the explanation would be
         | phrased in terms of atomic configurations in a way that makes
         | much less sense to us than "oh, this person likes red because
         | they like roses".
        
       | tonerow wrote:
       | Cool research! I went to the repo to see what the DSL looked like
       | but it was hard to find a clear example. It would be cool if you
       | added a snippet to the README.
        
         | barthelomew wrote:
         | Hey! Thank you for the interest! I shall do that. Meanwhile,
         | check out Page 11 onwards. We describe a lot of situations!
         | (https://arxiv.org/pdf/2409.17270)
        
       | tannhaeuser wrote:
       | LLMs are statistical language models (d'uh) not reasoners after
       | all. I found generating logic programs, and Prolog source
       | specifically, to work unreasonably well, though [1], maybe
       | because Prolog was introduced for symbolic natural language
       | processing and there's a wealth of translation examples in the
       | training set. Might be worth checking out Z3's alternative
       | Datalog syntax [2] instead of its Lisp-ish SMTLib syntax.
       | 
       | [1]: https://quantumprolog.sgml.net/llm-demo/part1.html
       | 
       | [2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax
        
         | barthelomew wrote:
         | Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in
         | our grammars paper because it allowed the most interoperability
         | with solvers, but our technique also works with PROLOG; as
         | tested our at the behest of reviewers at NeurIPS. I would
         | assume that this should also work with datalog [2].
         | 
         | [1] https://arxiv.org/abs/2505.20047 [2]
         | https://github.com/antlr/grammars-v4/blob/master/datalog/dat...
        
       | chrchr wrote:
       | I had a surprising interaction with Gemini 2.5 Pro that this
       | project reminds me of. I was asking the LLM for help using an
       | online CAS system to solve a system of equations, and the CAS
       | system wasn't working as I expected. After a couple back and
       | forths with Gemini about the CAS system, Gemini just gave me the
       | solution. I was surprised because it's the kind of thing I don't
       | expect LLMs to be good at. It said it used Python's sympy
       | symbolic computation package to arrive at the solution. So, yes,
       | the marriage of fuzzy LLMs with more rigorous tools can have
       | powerful effects.
        
         | TrainedMonkey wrote:
         | Just like humans... we are not so good at hard number
         | crunching, but we can invent computers that are amazing at it.
         | And with a lot of effort we can make a program that uses a
         | whole lot of number crunching to be ok at predicting text but
         | kind of bad at crunching hard numbers. And then that program
         | can predict how to create and use programs which are good at
         | number crunching.
        
           | jonplackett wrote:
           | Maybe the number crunching program the text generation
           | program creates will, with enough effort become good at
           | generating text, an will in turn make another number
           | crunching computer and then...
        
           | patcon wrote:
           | I love this kind of thought. Thanks.
        
         | anotherpaulg wrote:
         | I really like LLM+sympy for math. I have the LLM write me a
         | sympy program, so I can trust that the symbolic manipulation is
         | done correctly.
         | 
         | The code is also a useful artifact that can be iteratively
         | edited and improved by both the human and LLM, with git
         | history, etc. Running and passing tests/assertions helps to
         | build and maintain confidence that the math remains correct.
         | 
         | I use helper functions to easily render from the sympy code to
         | latex, etc.
         | 
         | A lot of the math behind this quantum eraser experiment was
         | done this way.
         | 
         | https://github.com/paul-gauthier/entangled-pair-quantum-eras...
        
         | DrewADesign wrote:
         | I get having it walk you through figuring out a problem with a
         | tool: seems like a good idea and it clearly worked even better
         | than expected. But deliberately coaxing an LLM into doing math
         | correctly instead of a CAS because you've got one handy seems
         | like moving apartments with dozens of bus trips rather than
         | taking the bus to a truck rental place, just because you've
         | already got a bus pass.
        
       | 0xWTF wrote:
       | Am I reading this right? Statistical LLM outputs pushed through a
       | formal logic model? Wouldn't that be a case of "crap in, crap
       | out"?
        
         | varispeed wrote:
         | That's subjective. One could argue all the things we invented
         | in the past few thousands years were crap. Life would have been
         | much easier in the caves, albeit shorter.
        
         | avmich wrote:
         | Formal logic serves as a useful filter. In other words, "crap
         | in, filtered crap out" - remember, evolution works with
         | absolutely random, "crap" mutations, which then are "filtered"
         | by the environment.
        
       | Yoric wrote:
       | That is exactly the kind of things that I hope LLM will help us
       | achieve before the next AI winter.
        
       | Western0 wrote:
       | I need this same with Mizar https://wiki.mizar.org/
        
       | sytse wrote:
       | So the core idea is to use an LLM to draft reasoning as a
       | structured, JSON domain-specific language (DSL), then
       | deterministically translate that into first-order logic and
       | verify it with a theorem prover (Z3).
       | 
       | Interesting that the final answer is provably entailed (or you
       | get a counterexample), instead of being merely persuasive chain-
       | of-thought.
        
       ___________________________________________________________________
       (page generated 2025-10-04 23:00 UTC)