[HN Gopher] Claude Code Can Debug Low-Level Cryptography
       ___________________________________________________________________
        
       Claude Code Can Debug Low-Level Cryptography
        
       Author : Bogdanp
       Score  : 128 points
       Date   : 2025-11-01 18:41 UTC (4 hours ago)
        
 (HTM) web link (words.filippo.io)
 (TXT) w3m dump (words.filippo.io)
        
       | qsort wrote:
       | This resonates with me a lot:
       | 
       | > As ever, I wish we had better tooling for using LLMs which
       | didn't look like chat or autocomplete
       | 
       | I think part of the reason why I was initially more skeptical
       | than I ought to have been is because chat is such a garbage
       | modality. LLMs started to "click" for me with Claude Code/Codex.
       | 
       | A "continuously running" mode that would ping me would be
       | interesting to try.
        
         | cmrdporcupine wrote:
         | I absolutely agree with this sentiment as well and keep coming
         | back to it. What I want is more of an _actual_ copilot which
         | works in a more paired way and forces me to interact with each
         | of its changes and also involves me more directly in them, and
         | teaches me about what it 's doing along the way, and asks for
         | more input.
         | 
         | A more socratic method, and more augmentic than "agentic".
         | 
         | Hell, if anybody has investment money and energy and shares
         | this vision I'd love to work on creating this tool with you. I
         | think these models are being _misused_ right now in attempt to
         | automate us out of work when their real amazing latent power is
         | the intuition that we 're talking about on this thread.
         | 
         | Misused they have the power to worsen codebases by making
         | developers illiterate about the very thing they're working on
         | because it's all magic behind the scenes. Uncorked they could
         | enhance understanding and help better realize the potential of
         | computing technology.
        
           | mccoyb wrote:
           | I'm working on such a thing, but I'm not interested in money,
           | nor do I have money to offer - I'm interested in a system
           | which I'm proud of.
           | 
           | What are your motivations?
           | 
           | Interested in your work: from your public GitHub repos, I'm
           | perhaps most interested in `moor` -- as it shares many design
           | inclinations that I've leaned towards in thinking about this
           | problem.
        
             | cmrdporcupine wrote:
             | Unfortunately... mooR is my passion project, but I also
             | need to get paid, and nobody is paying me for that.
             | 
             | I'm off work right now, between jobs and have been working
             | 10, 12 hours a day on it. That will shortly have to end. I
             | applied for a grant and got turned down.
             | 
             | My motivations come down to making a living doing the
             | things I love. That is increasingly hard.
        
         | imiric wrote:
         | On the one hand, I agree with this. The chat UI is very slow
         | and inefficient.
         | 
         | But on the other, given what I know about these tools and how
         | error-prone they are, I simply refuse to give them access to my
         | system, to run commands, or do any action for me. Partly due to
         | security concerns, partly due to privacy, but mostly distrust
         | that they will do the right thing. When they screw up in a
         | chat, I can clean up the context and try again. Reverting a
         | removed file or messed up Git repo is much more difficult. This
         | is how you get a dropped database during code freeze...
         | 
         | The idea of giving any of these corporations such privileges is
         | unthinkable for me. It seems that most people either don't care
         | about this, or are willing to accept it as the price of
         | admission.
         | 
         | I experimented with Aider and a self-hosted model a few months
         | ago, and wasn't impressed. I imagine the experience with SOTA
         | hosted models is much better, but I'll probably use a sandbox
         | next time I look into this.
        
       | gdevenyi wrote:
       | Coming soon, adversarial attacks on LLM training to ensure
       | cryptographic mistakes.
        
       | Frannky wrote:
       | CLI terminals are incredibly powerful. They are also free if you
       | use Gemini CLI or Qwen Code. Plus, you can access any OpenAI-
       | compatible API(2k TPS via Cerebras at 2$/M or local models). And
       | you can use them in IDEs like Zed with ACP mode.
       | 
       | All the simple stuff (creating a repo, pushing, frontend edits,
       | testing, Docker images, deployment, etc.) is automated. For the
       | difficult parts, you can just use free Grok to one-shot small
       | code files. It works great if you force yourself to keep the
       | amount of code minimal and modular. Also, they are great UIs--you
       | can create smart programs just with CLI + MCP servers + MD files.
       | Truly amazing tech.
        
         | BrokenCogs wrote:
         | How good is Gemini CLI compared to Claude code and openAi
         | codex?
        
           | Frannky wrote:
           | I started with Claude Code, realized it was too much money
           | for every message, then switched to Gemini CLI, then Qwen.
           | Probably Claude Code is better, but I don't need it since I
           | can solve my problems without it.
        
             | cmrdporcupine wrote:
             | Try what I've done: use the Claude Code tool but point your
             | ANTHROPIC_URL at a DeepSeek API membership. It's like
             | 1/10th the cost, and about 2/3rds the intelligence.
             | 
             | Sometimes I can't really tell.
        
               | jen729w wrote:
               | Or a 3rd party service like https://synthetic.new, of
               | which I am an unaffiliated user.
        
             | luxuryballs wrote:
             | Yeah I was using openrouter for Claude code and burned
             | through $30 in credits to do things that if I had just used
             | the openrouter chat for it would have been like $1.50, I
             | decided it was better for now to do the extra "secretary
             | work" of manual entry and context management of the chat
             | and pain of attaching files. It was pretty disappointing
             | because at first I had assumed it would not be much
             | different in price at all.
        
             | distances wrote:
             | I've found the regular Claude Pro subscription quite enough
             | for coding tasks when you anyway have a bunch of other
             | things like code reviews to do in addition to coding, and
             | won't spend the whole day running it.
        
           | wdfx wrote:
           | Gemini and it's tooling is absolute shit. The LLM itself is
           | barely usable and needs so much supervision you might as well
           | do the work yourself. Then couple that with an awful cli and
           | vscode interface and you'll find that it's just a complete
           | waste of time.
           | 
           | Compared to the anthropic offering is night and day. Claude
           | gets on with the job and makes me way more productive.
        
             | SamInTheShell wrote:
             | > Gemini and it's tooling is absolute shit.
             | 
             | Which model were you using? In my experience Gemini 2.5 Pro
             | is just as good as Claude Sonnet 4 and 4.5. It's literally
             | what I use as a fallback to wrap something up if I hit the
             | 5 hour limit on Claude and want to just push past some
             | incomplete work.
             | 
             | I'm just going to throw this out there. I get good results
             | from a truly trash model like gpt-oss-20b (quantized at
             | 4bits). The reason I can literally use this model is
             | because I know my shit and have spent time learning how
             | much instruction each model I use needs.
             | 
             | Would be curious what you're actually having issues with if
             | you're willing to share.
        
               | sega_sai wrote:
               | I share the same opinion on Gemini cli. Other than for
               | simplest tasks it is just not usable, it gets stuck in
               | loops, ignores instructions, fails to edit files. Plus it
               | just has a plenty of bugs in the cli that you
               | occasionally hit. I wish I could use it rather than pay
               | an extra subscription for Claude Code, but it is just in
               | a different league (at least as recently as couple of
               | weeks ago)
        
               | SamInTheShell wrote:
               | Which model are you using though? When I run out of
               | Gemini 2.5 Pro and it falls back to the Flash model, the
               | Flash model is absolute trash for sure. I have to prompt
               | it like I do local models. Gemini 2.5 Pro has shown me
               | good results though. Nothing like "ignores instructions"
               | has really occurred for me with the Pro model.
        
               | sega_sai wrote:
               | I get that even with the 2.5 pro
        
             | Frannky wrote:
             | It's probably a mix of what you're working on and how
             | you're using the tool. If you can't get it done for free or
             | cheaply, it makes sense to pay. I first design the
             | architecture in my mind, then use Grok 4 fast (free) for
             | single-shot generation of main files. This forces me to
             | think first, and read the generated code to double-check.
             | Then, the CLI is mostly for editing, clerical work,
             | testing, etc. That said, I do try to avoid coding
             | altogether if the CLI + MCP servers + MD files can solve
             | the problem.
        
         | idiotsecant wrote:
         | Claude code ux is really good, though.
        
       | delaminator wrote:
       | > For example, how nice would it be if every time tests fail, an
       | LLM agent was kicked off with the task of figuring out why, and
       | only notified us if it did before we fixed it?
       | 
       | You can use Git hooks to do that. If you have tests and one
       | fails, spawn an instance of claude a prompt -p 'tests/test4.sh
       | failed, look in src/ and try and work out why'
       | $ claude -p 'hello, just tell me a joke about databases'
       | A SQL query walks into a bar, walks up to two tables and asks,
       | "Can I JOIN you?"              $
       | 
       | Or if, you use Gogs locally, you can add a Gogs hook to do the
       | same on pre-push
       | 
       | > An example hook script to verify what is about to be pushed.
       | Called by "git push" after it has checked the remote status, but
       | before anything has been pushed. If this script exits with a non-
       | zero status nothing will be pushed.
       | 
       | I like this idea. I think I shall get Claude to work out the
       | mechanism itself :)
       | 
       | It is even a suggestion on this Claude cheet sheet
       | 
       | https://www.howtouselinux.com/post/the-complete-claude-code-...
        
         | jamesponddotco wrote:
         | This could probably be implemented as a simple Bash script, if
         | the user wants to run everything manually. I might just do that
         | to burn some time.
        
           | delaminator wrote:
           | sure, there a multiple ways of spawning an instance
           | 
           | the only thing I imagine might be problem is claude demanding
           | a login token as it happens quite regularly
        
       | simonw wrote:
       | Using coding agents to track down the root cause of bugs like
       | this works really well:
       | 
       | > Three out of three one-shot debugging hits with no help is
       | extremely impressive. Importantly, there is no need to trust the
       | LLM or review its output when its job is just saving me an hour
       | or two by telling me where the bug is, for me to reason about it
       | and fix it.
       | 
       | The approach described here could also be a good way for LLM-
       | skeptics to start exploring how these tools can help them without
       | feeling like they're cheating, ripping off the work of everyone
       | who's code was used to train the model or taking away the most
       | fun part of their job (writing code).
       | 
       | Have the coding agents do the work of digging around hunting down
       | those frustratingly difficult bugs - don't have it write code on
       | your behalf.
        
         | jack_tripper wrote:
         | _> Have the coding agents do the work of digging around hunting
         | down those frustratingly difficult bugs - don't have it write
         | code on your behalf._
         | 
         | Why? Bug hunting is more challenging and cognitive intensive
         | than writing code.
        
           | simonw wrote:
           | Sometimes it's the end of the day and you've been crunching
           | for hours already and you hit one gnarly bug and you just
           | want to go and make a cup of tea and come back to some useful
           | hints as to the resolution.
        
           | theptip wrote:
           | Bug hunting tends to be interpolation, which LLMs are really
           | good at. Writing code is often some extrapolation (or
           | interpolating at a much more abstract level).
        
         | qa34514324 wrote:
         | I have tested the AI SAST tools that were hyped after a curl
         | article on several C code bases and they found nothing.
         | 
         | Which low level code base have you tried this latest tool on?
         | Official Anthropic commercials do not count.
        
           | simonw wrote:
           | You're posting this comment on a thread attached to an
           | article where Filippo Valsorda - a noted cryptography expert
           | - used these tools to track down gnarly bugs in Go
           | cryptography code.
        
             | tptacek wrote:
             | They're also using "AI SAST tools", which: I would not
             | expect anything branded as a "SAST" tool to find
             | interesting bugs. SAST is a term of art for "pattern
             | matching to a grocery list of specific bugs".
        
             | delusional wrote:
             | These are not "gnarly bugs".
        
         | mschulkind wrote:
         | One of my favorite ways to use LLM agents for coding is to have
         | them write extensive documentation on whatever I'm about to dig
         | in coding on. Pretty low stakes if the LLM makes a few
         | mistakes. It's perhaps even a better place to start for
         | skeptics.
        
           | dboreham wrote:
           | Same. Initially surprised how good it was. Now routinely do
           | this on every new codebase. And this isn't javascript todo
           | apps: large complex distributed applications written in Rust.
        
         | teaearlgraycold wrote:
         | I'm a bit of an LLM hater because they're overhyped. But in
         | these situations they can be pretty nice if you can quickly
         | evaluate correctness. If evaluating correctness is harder than
         | searching on your own then there are net negative. I've found
         | with my debugging it's really hard to know which will be the
         | case. And as it's my responsibility to build a "Do I give the
         | LLM a shot?" heuristic that's very frustrating.
        
       | lordnacho wrote:
       | I'm not surprised it worked.
       | 
       | Before I used Claude, I would be surprised.
       | 
       | I think it works because Claude takes some standard coding issues
       | and systematizes them. The list is long, but Claude doesn't run
       | out of patience like a human being does. Or at least it has some
       | credulity left after trying a few initial failed hypotheses. This
       | being a cryptography problem helps a little bit, in that there
       | are very specific keywords that might hint at a solution, but
       | from my skim of the article, it seems like it was mostly a good
       | old coding error, taking the high bits twice.
       | 
       | The standard issues are just a vague laundry list:
       | 
       | - Are you using the data you think you're using? (Bingo for this
       | one)
       | 
       | - Could it be an overflow?
       | 
       | - Are the types right?
       | 
       | - Are you calling the function you think you're calling? Check
       | internal, then external dependencies
       | 
       | - Is there some parameter you didn't consider?
       | 
       | And a bunch of others. When I ask Claude for a debug, it's always
       | something that makes sense as a checklist item, but I'm often
       | impressed by how it diligently followed the path set by the
       | results of the investigation. It's a great donkey, really takes
       | the drudgery out of my work, even if it sometimes takes just as
       | long.
        
         | ay wrote:
         | > Claude doesn't run out of patience like a human being does.
         | 
         | It very much does! I had a debugging session with Claude Code
         | today, and it was about to give up with the message along the
         | lines of "I am sorry I was not able to help you find the
         | problem".
         | 
         | It took some gentle cheering (pretty easy, just saying "you are
         | doing an excellent job, don't give up!") and encouragement, and
         | a couple of suggestions from me on how to approach the debug
         | process for it to continue and finally "we" (I am using plural
         | here because some information that Claude "volunteered" was
         | essential to my understanding of the problem) were able to
         | figure out the root cause and the fix.
        
           | lordnacho wrote:
           | That's interesting, that only happened to me on GPT models in
           | Cursor. It would apologize profusely.
        
         | vidarh wrote:
         | > The list is long, but Claude doesn't run out of patience like
         | a human being does
         | 
         | I've flat out had Claude tell me it's task was getting tedious,
         | and it will often grasp at straws to use as excuses for
         | stopping a repetitive task and moving in to something else.
         | 
         | Keeping it on task when something keeps moving forward, is
         | easy, but when it gets repetitive it takes a lot of effort to
         | make it stick to it.
        
       | rvz wrote:
       | As declared by an expert in cryptography who knows how to guide
       | the LLM into debugging low-level cryptography, which that's good.
       | 
       | Quite different if you are not a cryptographer or a domain
       | expert.
        
         | tptacek wrote:
         | Even the title of the post makes this clear: it's about
         | _debugging_ low-level cryptography. He didn 't vibe code ML-
         | DSA. You have to be building a low-level implementation in the
         | first place for anything in this post to apply to it.
        
       | XenophileJKO wrote:
       | Personally my biggest piece of advice is: AI First.
       | 
       | If you really want to understand what the limitations are of the
       | current frontier models (and also really learn how to use them),
       | ask the AI first.
       | 
       | By throwing things over the wall to the AI first, you learn what
       | it can do at the same time as you learn how to structure your
       | requests. The newer models are quite capable and in my experience
       | can largely be treated like a co-worker for "most" problems. That
       | being said.. you also need to understand how they fail and build
       | an intuition for why they fail.
       | 
       | Every time a new model generation comes up, I also recommend
       | throwing away your process (outside of things like lint, etc.)
       | and see how the model does without it. I work with people that
       | have elaborate context setups they crafted for less capable
       | models, they largely are un-neccessary with GPT-5-Codex and
       | Sonnet 4.5.
        
         | imiric wrote:
         | > By throwing things over the wall to the AI first, you learn
         | what it can do at the same time as you learn how to structure
         | your requests.
         | 
         | Unfortunately, it doesn't quite work out that way.
         | 
         | Yes, you will get better at using these tools the more you use
         | them, which is the case with any tool. But you will not learn
         | what they can do as easily, or at all.
         | 
         | The main problem with them is the same one they've had since
         | the beginning. If the user is a domain expert, then they will
         | be able to quickly spot the inaccuracies and hallucinations in
         | the seemingly accurate generated content, and, with some
         | effort, coax the LLM into producing correct output.
         | 
         | Otherwise, the user can be easily misled by the confident and
         | sycophantic tone, and waste potentially hours troubleshooting,
         | without being able to tell if the error is on the LLM side or
         | their own. In most of these situations, they would've probably
         | been better off reading the human-written documentation and
         | code, and doing the work manually. Perhaps with minor
         | assistance from LLMs, but never relying on them entirely.
         | 
         | This is why these tools are most useful to people who are
         | already experts in their field, such as Filippo. For everyone
         | else who isn't, and actually cares about the quality of their
         | work, the experience is very hit or miss.
         | 
         | > That being said.. you also need to understand how they fail
         | and build an intuition for why they fail.
         | 
         | I've been using these tools for years now. The only intuition I
         | have for how and why they fail is when I'm familiar with the
         | domain. But I had that without LLMs as well, whenever someone
         | is talking about a subject I know. It's impossible to build
         | that intuition with domains you have little familiarity with.
         | You can certainly do that by traditional learning, and LLMs can
         | help with that, but most people use them for what you suggest:
         | throwing things over the wall and running with it, which is a
         | shame.
         | 
         | > I work with people that have elaborate context setups they
         | crafted for less capable models, they largely are un-neccessary
         | with GPT-5-Codex and Sonnet 4.5.
         | 
         | I haven't used GPT-5-Codex, but have experience with Sonnet
         | 4.5, and it's only marginally better than the previous versions
         | IME. It still often wastes my time, no matter the quality or
         | amount of context I feed it.
        
           | XenophileJKO wrote:
           | I guess there are several unsaid assumptions here. The
           | article is by a domain expert working on their domain. Throw
           | work you understand at it, see what it does. Do it before you
           | even work on it. I kind of assumed based on the audience that
           | most people here would be domain experts.
           | 
           | As for the building intuition, perhaps I am over-estimating
           | what most people are capable of.
           | 
           | Working with and building systems using LLMs over the last
           | few years has helped me build a pretty good intuition about
           | what is breaking down when the model fails at a task. While
           | having an ML background is useful in some very narrow cases
           | (like: 'why does an LLM suck at ranking...'), I "think" a
           | person can get a pretty good intuition purely based on
           | observational outcomes.
           | 
           | I've been wrong before though. When we first started building
           | LLM products, I thought, "Anyone can prompt, there is no
           | barrier for this skill." That was not the case at all. Most
           | people don't do well trying to quantify ambiguity,
           | specificity, and logical contridiction when writing a process
           | or set of instructions. I was REALLY surprised how I became a
           | "go-to" person to "fix" prompt systems all based on
           | linguistics and systematic process decomposition. Some of
           | this was understaing how the auto-regressive attention system
           | benefits from breaking the work down into steps, but really
           | most of it was just "don't contradict yourself and be clear".
           | 
           | Working with them extensively also has helped me hone in on
           | how the models get "better" with each release. Though most of
           | my expertise is with OpenAI and Antrhopic model families.
           | 
           | I still think most engineers "should" be able to build
           | intuition generally on what works well with LLMs and how to
           | interact with them, but you are probably right. It will be
           | just like most ML engineers where they see something work in
           | a paper and then just paste it onto their model with no
           | intuition about what systemically that structure changes in
           | the model dynamics.
        
       | cluckindan wrote:
       | So the "fix" includes a completely new function? In a
       | cryptography implementation?
       | 
       | I feel like the article is giving out very bad advice which is
       | going to end up shooting someone in the foot.
        
         | OneDeuxTriSeiGo wrote:
         | The article even states that the solution claude proposed
         | wasn't the point. The point was finding the bug.
         | 
         | AI are very capable heuristics tools. Being able to "sniff
         | test" things blind is their specialty.
         | 
         | i.e. Treat them like an extremely capable gas detector that can
         | tell you there is a leak and where in the plumbing it is, not a
         | plumber who can fix the leak for you.
        
         | thadt wrote:
         | Can you expand on what you find to be 'bad advice'?
         | 
         | The author uses an LLM to find _bugs_ and then throw away the
         | fix and instead write the code he would have written anyway.
         | This seems like a rather conservative application of LLMs.
         | Using the  'shooting someone in the foot' analogy - this
         | article is an illustration of professional and responsible
         | firearm handling.
        
       | didibus wrote:
       | This is basically the ideal scenario for coding agents. Easily
       | verifiable through running tests, pure logic, algorithmic
       | problem. It's the case that has worked the best for me with LLMs.
        
       | pton_xd wrote:
       | > Full disclosure: Anthropic gave me a few months of Claude Max
       | for free. They reached out one day and told me they were giving
       | it away to some open source maintainers.
       | 
       | Related, lately I've been getting tons of Anthropic Instagram
       | ads; they must be near a quarter of all the sponsored content I
       | see for the last month or so. Various people vibe coding random
       | apps and whatnot using different incarnations of Claude. Or just
       | direct adverts to "Install Claude Code." I really have no idea
       | why I've been targeted so hard, on Instagram of all places. Their
       | marketing team must be working overtime.
        
         | simonw wrote:
         | I think it might be that they've hit product-market fit.
         | 
         | Developers find Claude Code extremely useful (once they figure
         | out how to use it). Many developers subscribe to their
         | $200/month plan. Assuming that's profitable (and I expect it
         | is, since even for that much money it cuts off at a certain
         | point to avoid over-use) Anthropic would be wise to spend a
         | _lot_ of money on marketing to try and grow their paying
         | subscriber base for it.
        
       | phendrenad2 wrote:
       | A whole class of tedious problems have been eliminated by LLMs
       | because they are able to look at code in a "fuzzy" way. But this
       | can be a liability, too. I have a codebase that "looks kinda"
       | like a nodejs project, so AI agents usually assume it is one,
       | even if I rename the package.json, it will inspect the contents
       | and immediately clock it as "node-like".
        
       ___________________________________________________________________
       (page generated 2025-11-01 23:00 UTC)