[HN Gopher] Claude Code Can Debug Low-Level Cryptography
___________________________________________________________________
Claude Code Can Debug Low-Level Cryptography
Author : Bogdanp
Score : 128 points
Date : 2025-11-01 18:41 UTC (4 hours ago)
(HTM) web link (words.filippo.io)
(TXT) w3m dump (words.filippo.io)
| qsort wrote:
| This resonates with me a lot:
|
| > As ever, I wish we had better tooling for using LLMs which
| didn't look like chat or autocomplete
|
| I think part of the reason why I was initially more skeptical
| than I ought to have been is because chat is such a garbage
| modality. LLMs started to "click" for me with Claude Code/Codex.
|
| A "continuously running" mode that would ping me would be
| interesting to try.
| cmrdporcupine wrote:
| I absolutely agree with this sentiment as well and keep coming
| back to it. What I want is more of an _actual_ copilot which
| works in a more paired way and forces me to interact with each
| of its changes and also involves me more directly in them, and
| teaches me about what it 's doing along the way, and asks for
| more input.
|
| A more socratic method, and more augmentic than "agentic".
|
| Hell, if anybody has investment money and energy and shares
| this vision I'd love to work on creating this tool with you. I
| think these models are being _misused_ right now in attempt to
| automate us out of work when their real amazing latent power is
| the intuition that we 're talking about on this thread.
|
| Misused they have the power to worsen codebases by making
| developers illiterate about the very thing they're working on
| because it's all magic behind the scenes. Uncorked they could
| enhance understanding and help better realize the potential of
| computing technology.
| mccoyb wrote:
| I'm working on such a thing, but I'm not interested in money,
| nor do I have money to offer - I'm interested in a system
| which I'm proud of.
|
| What are your motivations?
|
| Interested in your work: from your public GitHub repos, I'm
| perhaps most interested in `moor` -- as it shares many design
| inclinations that I've leaned towards in thinking about this
| problem.
| cmrdporcupine wrote:
| Unfortunately... mooR is my passion project, but I also
| need to get paid, and nobody is paying me for that.
|
| I'm off work right now, between jobs and have been working
| 10, 12 hours a day on it. That will shortly have to end. I
| applied for a grant and got turned down.
|
| My motivations come down to making a living doing the
| things I love. That is increasingly hard.
| imiric wrote:
| On the one hand, I agree with this. The chat UI is very slow
| and inefficient.
|
| But on the other, given what I know about these tools and how
| error-prone they are, I simply refuse to give them access to my
| system, to run commands, or do any action for me. Partly due to
| security concerns, partly due to privacy, but mostly distrust
| that they will do the right thing. When they screw up in a
| chat, I can clean up the context and try again. Reverting a
| removed file or messed up Git repo is much more difficult. This
| is how you get a dropped database during code freeze...
|
| The idea of giving any of these corporations such privileges is
| unthinkable for me. It seems that most people either don't care
| about this, or are willing to accept it as the price of
| admission.
|
| I experimented with Aider and a self-hosted model a few months
| ago, and wasn't impressed. I imagine the experience with SOTA
| hosted models is much better, but I'll probably use a sandbox
| next time I look into this.
| gdevenyi wrote:
| Coming soon, adversarial attacks on LLM training to ensure
| cryptographic mistakes.
| Frannky wrote:
| CLI terminals are incredibly powerful. They are also free if you
| use Gemini CLI or Qwen Code. Plus, you can access any OpenAI-
| compatible API(2k TPS via Cerebras at 2$/M or local models). And
| you can use them in IDEs like Zed with ACP mode.
|
| All the simple stuff (creating a repo, pushing, frontend edits,
| testing, Docker images, deployment, etc.) is automated. For the
| difficult parts, you can just use free Grok to one-shot small
| code files. It works great if you force yourself to keep the
| amount of code minimal and modular. Also, they are great UIs--you
| can create smart programs just with CLI + MCP servers + MD files.
| Truly amazing tech.
| BrokenCogs wrote:
| How good is Gemini CLI compared to Claude code and openAi
| codex?
| Frannky wrote:
| I started with Claude Code, realized it was too much money
| for every message, then switched to Gemini CLI, then Qwen.
| Probably Claude Code is better, but I don't need it since I
| can solve my problems without it.
| cmrdporcupine wrote:
| Try what I've done: use the Claude Code tool but point your
| ANTHROPIC_URL at a DeepSeek API membership. It's like
| 1/10th the cost, and about 2/3rds the intelligence.
|
| Sometimes I can't really tell.
| jen729w wrote:
| Or a 3rd party service like https://synthetic.new, of
| which I am an unaffiliated user.
| luxuryballs wrote:
| Yeah I was using openrouter for Claude code and burned
| through $30 in credits to do things that if I had just used
| the openrouter chat for it would have been like $1.50, I
| decided it was better for now to do the extra "secretary
| work" of manual entry and context management of the chat
| and pain of attaching files. It was pretty disappointing
| because at first I had assumed it would not be much
| different in price at all.
| distances wrote:
| I've found the regular Claude Pro subscription quite enough
| for coding tasks when you anyway have a bunch of other
| things like code reviews to do in addition to coding, and
| won't spend the whole day running it.
| wdfx wrote:
| Gemini and it's tooling is absolute shit. The LLM itself is
| barely usable and needs so much supervision you might as well
| do the work yourself. Then couple that with an awful cli and
| vscode interface and you'll find that it's just a complete
| waste of time.
|
| Compared to the anthropic offering is night and day. Claude
| gets on with the job and makes me way more productive.
| SamInTheShell wrote:
| > Gemini and it's tooling is absolute shit.
|
| Which model were you using? In my experience Gemini 2.5 Pro
| is just as good as Claude Sonnet 4 and 4.5. It's literally
| what I use as a fallback to wrap something up if I hit the
| 5 hour limit on Claude and want to just push past some
| incomplete work.
|
| I'm just going to throw this out there. I get good results
| from a truly trash model like gpt-oss-20b (quantized at
| 4bits). The reason I can literally use this model is
| because I know my shit and have spent time learning how
| much instruction each model I use needs.
|
| Would be curious what you're actually having issues with if
| you're willing to share.
| sega_sai wrote:
| I share the same opinion on Gemini cli. Other than for
| simplest tasks it is just not usable, it gets stuck in
| loops, ignores instructions, fails to edit files. Plus it
| just has a plenty of bugs in the cli that you
| occasionally hit. I wish I could use it rather than pay
| an extra subscription for Claude Code, but it is just in
| a different league (at least as recently as couple of
| weeks ago)
| SamInTheShell wrote:
| Which model are you using though? When I run out of
| Gemini 2.5 Pro and it falls back to the Flash model, the
| Flash model is absolute trash for sure. I have to prompt
| it like I do local models. Gemini 2.5 Pro has shown me
| good results though. Nothing like "ignores instructions"
| has really occurred for me with the Pro model.
| sega_sai wrote:
| I get that even with the 2.5 pro
| Frannky wrote:
| It's probably a mix of what you're working on and how
| you're using the tool. If you can't get it done for free or
| cheaply, it makes sense to pay. I first design the
| architecture in my mind, then use Grok 4 fast (free) for
| single-shot generation of main files. This forces me to
| think first, and read the generated code to double-check.
| Then, the CLI is mostly for editing, clerical work,
| testing, etc. That said, I do try to avoid coding
| altogether if the CLI + MCP servers + MD files can solve
| the problem.
| idiotsecant wrote:
| Claude code ux is really good, though.
| delaminator wrote:
| > For example, how nice would it be if every time tests fail, an
| LLM agent was kicked off with the task of figuring out why, and
| only notified us if it did before we fixed it?
|
| You can use Git hooks to do that. If you have tests and one
| fails, spawn an instance of claude a prompt -p 'tests/test4.sh
| failed, look in src/ and try and work out why'
| $ claude -p 'hello, just tell me a joke about databases'
| A SQL query walks into a bar, walks up to two tables and asks,
| "Can I JOIN you?" $
|
| Or if, you use Gogs locally, you can add a Gogs hook to do the
| same on pre-push
|
| > An example hook script to verify what is about to be pushed.
| Called by "git push" after it has checked the remote status, but
| before anything has been pushed. If this script exits with a non-
| zero status nothing will be pushed.
|
| I like this idea. I think I shall get Claude to work out the
| mechanism itself :)
|
| It is even a suggestion on this Claude cheet sheet
|
| https://www.howtouselinux.com/post/the-complete-claude-code-...
| jamesponddotco wrote:
| This could probably be implemented as a simple Bash script, if
| the user wants to run everything manually. I might just do that
| to burn some time.
| delaminator wrote:
| sure, there a multiple ways of spawning an instance
|
| the only thing I imagine might be problem is claude demanding
| a login token as it happens quite regularly
| simonw wrote:
| Using coding agents to track down the root cause of bugs like
| this works really well:
|
| > Three out of three one-shot debugging hits with no help is
| extremely impressive. Importantly, there is no need to trust the
| LLM or review its output when its job is just saving me an hour
| or two by telling me where the bug is, for me to reason about it
| and fix it.
|
| The approach described here could also be a good way for LLM-
| skeptics to start exploring how these tools can help them without
| feeling like they're cheating, ripping off the work of everyone
| who's code was used to train the model or taking away the most
| fun part of their job (writing code).
|
| Have the coding agents do the work of digging around hunting down
| those frustratingly difficult bugs - don't have it write code on
| your behalf.
| jack_tripper wrote:
| _> Have the coding agents do the work of digging around hunting
| down those frustratingly difficult bugs - don't have it write
| code on your behalf._
|
| Why? Bug hunting is more challenging and cognitive intensive
| than writing code.
| simonw wrote:
| Sometimes it's the end of the day and you've been crunching
| for hours already and you hit one gnarly bug and you just
| want to go and make a cup of tea and come back to some useful
| hints as to the resolution.
| theptip wrote:
| Bug hunting tends to be interpolation, which LLMs are really
| good at. Writing code is often some extrapolation (or
| interpolating at a much more abstract level).
| qa34514324 wrote:
| I have tested the AI SAST tools that were hyped after a curl
| article on several C code bases and they found nothing.
|
| Which low level code base have you tried this latest tool on?
| Official Anthropic commercials do not count.
| simonw wrote:
| You're posting this comment on a thread attached to an
| article where Filippo Valsorda - a noted cryptography expert
| - used these tools to track down gnarly bugs in Go
| cryptography code.
| tptacek wrote:
| They're also using "AI SAST tools", which: I would not
| expect anything branded as a "SAST" tool to find
| interesting bugs. SAST is a term of art for "pattern
| matching to a grocery list of specific bugs".
| delusional wrote:
| These are not "gnarly bugs".
| mschulkind wrote:
| One of my favorite ways to use LLM agents for coding is to have
| them write extensive documentation on whatever I'm about to dig
| in coding on. Pretty low stakes if the LLM makes a few
| mistakes. It's perhaps even a better place to start for
| skeptics.
| dboreham wrote:
| Same. Initially surprised how good it was. Now routinely do
| this on every new codebase. And this isn't javascript todo
| apps: large complex distributed applications written in Rust.
| teaearlgraycold wrote:
| I'm a bit of an LLM hater because they're overhyped. But in
| these situations they can be pretty nice if you can quickly
| evaluate correctness. If evaluating correctness is harder than
| searching on your own then there are net negative. I've found
| with my debugging it's really hard to know which will be the
| case. And as it's my responsibility to build a "Do I give the
| LLM a shot?" heuristic that's very frustrating.
| lordnacho wrote:
| I'm not surprised it worked.
|
| Before I used Claude, I would be surprised.
|
| I think it works because Claude takes some standard coding issues
| and systematizes them. The list is long, but Claude doesn't run
| out of patience like a human being does. Or at least it has some
| credulity left after trying a few initial failed hypotheses. This
| being a cryptography problem helps a little bit, in that there
| are very specific keywords that might hint at a solution, but
| from my skim of the article, it seems like it was mostly a good
| old coding error, taking the high bits twice.
|
| The standard issues are just a vague laundry list:
|
| - Are you using the data you think you're using? (Bingo for this
| one)
|
| - Could it be an overflow?
|
| - Are the types right?
|
| - Are you calling the function you think you're calling? Check
| internal, then external dependencies
|
| - Is there some parameter you didn't consider?
|
| And a bunch of others. When I ask Claude for a debug, it's always
| something that makes sense as a checklist item, but I'm often
| impressed by how it diligently followed the path set by the
| results of the investigation. It's a great donkey, really takes
| the drudgery out of my work, even if it sometimes takes just as
| long.
| ay wrote:
| > Claude doesn't run out of patience like a human being does.
|
| It very much does! I had a debugging session with Claude Code
| today, and it was about to give up with the message along the
| lines of "I am sorry I was not able to help you find the
| problem".
|
| It took some gentle cheering (pretty easy, just saying "you are
| doing an excellent job, don't give up!") and encouragement, and
| a couple of suggestions from me on how to approach the debug
| process for it to continue and finally "we" (I am using plural
| here because some information that Claude "volunteered" was
| essential to my understanding of the problem) were able to
| figure out the root cause and the fix.
| lordnacho wrote:
| That's interesting, that only happened to me on GPT models in
| Cursor. It would apologize profusely.
| vidarh wrote:
| > The list is long, but Claude doesn't run out of patience like
| a human being does
|
| I've flat out had Claude tell me it's task was getting tedious,
| and it will often grasp at straws to use as excuses for
| stopping a repetitive task and moving in to something else.
|
| Keeping it on task when something keeps moving forward, is
| easy, but when it gets repetitive it takes a lot of effort to
| make it stick to it.
| rvz wrote:
| As declared by an expert in cryptography who knows how to guide
| the LLM into debugging low-level cryptography, which that's good.
|
| Quite different if you are not a cryptographer or a domain
| expert.
| tptacek wrote:
| Even the title of the post makes this clear: it's about
| _debugging_ low-level cryptography. He didn 't vibe code ML-
| DSA. You have to be building a low-level implementation in the
| first place for anything in this post to apply to it.
| XenophileJKO wrote:
| Personally my biggest piece of advice is: AI First.
|
| If you really want to understand what the limitations are of the
| current frontier models (and also really learn how to use them),
| ask the AI first.
|
| By throwing things over the wall to the AI first, you learn what
| it can do at the same time as you learn how to structure your
| requests. The newer models are quite capable and in my experience
| can largely be treated like a co-worker for "most" problems. That
| being said.. you also need to understand how they fail and build
| an intuition for why they fail.
|
| Every time a new model generation comes up, I also recommend
| throwing away your process (outside of things like lint, etc.)
| and see how the model does without it. I work with people that
| have elaborate context setups they crafted for less capable
| models, they largely are un-neccessary with GPT-5-Codex and
| Sonnet 4.5.
| imiric wrote:
| > By throwing things over the wall to the AI first, you learn
| what it can do at the same time as you learn how to structure
| your requests.
|
| Unfortunately, it doesn't quite work out that way.
|
| Yes, you will get better at using these tools the more you use
| them, which is the case with any tool. But you will not learn
| what they can do as easily, or at all.
|
| The main problem with them is the same one they've had since
| the beginning. If the user is a domain expert, then they will
| be able to quickly spot the inaccuracies and hallucinations in
| the seemingly accurate generated content, and, with some
| effort, coax the LLM into producing correct output.
|
| Otherwise, the user can be easily misled by the confident and
| sycophantic tone, and waste potentially hours troubleshooting,
| without being able to tell if the error is on the LLM side or
| their own. In most of these situations, they would've probably
| been better off reading the human-written documentation and
| code, and doing the work manually. Perhaps with minor
| assistance from LLMs, but never relying on them entirely.
|
| This is why these tools are most useful to people who are
| already experts in their field, such as Filippo. For everyone
| else who isn't, and actually cares about the quality of their
| work, the experience is very hit or miss.
|
| > That being said.. you also need to understand how they fail
| and build an intuition for why they fail.
|
| I've been using these tools for years now. The only intuition I
| have for how and why they fail is when I'm familiar with the
| domain. But I had that without LLMs as well, whenever someone
| is talking about a subject I know. It's impossible to build
| that intuition with domains you have little familiarity with.
| You can certainly do that by traditional learning, and LLMs can
| help with that, but most people use them for what you suggest:
| throwing things over the wall and running with it, which is a
| shame.
|
| > I work with people that have elaborate context setups they
| crafted for less capable models, they largely are un-neccessary
| with GPT-5-Codex and Sonnet 4.5.
|
| I haven't used GPT-5-Codex, but have experience with Sonnet
| 4.5, and it's only marginally better than the previous versions
| IME. It still often wastes my time, no matter the quality or
| amount of context I feed it.
| XenophileJKO wrote:
| I guess there are several unsaid assumptions here. The
| article is by a domain expert working on their domain. Throw
| work you understand at it, see what it does. Do it before you
| even work on it. I kind of assumed based on the audience that
| most people here would be domain experts.
|
| As for the building intuition, perhaps I am over-estimating
| what most people are capable of.
|
| Working with and building systems using LLMs over the last
| few years has helped me build a pretty good intuition about
| what is breaking down when the model fails at a task. While
| having an ML background is useful in some very narrow cases
| (like: 'why does an LLM suck at ranking...'), I "think" a
| person can get a pretty good intuition purely based on
| observational outcomes.
|
| I've been wrong before though. When we first started building
| LLM products, I thought, "Anyone can prompt, there is no
| barrier for this skill." That was not the case at all. Most
| people don't do well trying to quantify ambiguity,
| specificity, and logical contridiction when writing a process
| or set of instructions. I was REALLY surprised how I became a
| "go-to" person to "fix" prompt systems all based on
| linguistics and systematic process decomposition. Some of
| this was understaing how the auto-regressive attention system
| benefits from breaking the work down into steps, but really
| most of it was just "don't contradict yourself and be clear".
|
| Working with them extensively also has helped me hone in on
| how the models get "better" with each release. Though most of
| my expertise is with OpenAI and Antrhopic model families.
|
| I still think most engineers "should" be able to build
| intuition generally on what works well with LLMs and how to
| interact with them, but you are probably right. It will be
| just like most ML engineers where they see something work in
| a paper and then just paste it onto their model with no
| intuition about what systemically that structure changes in
| the model dynamics.
| cluckindan wrote:
| So the "fix" includes a completely new function? In a
| cryptography implementation?
|
| I feel like the article is giving out very bad advice which is
| going to end up shooting someone in the foot.
| OneDeuxTriSeiGo wrote:
| The article even states that the solution claude proposed
| wasn't the point. The point was finding the bug.
|
| AI are very capable heuristics tools. Being able to "sniff
| test" things blind is their specialty.
|
| i.e. Treat them like an extremely capable gas detector that can
| tell you there is a leak and where in the plumbing it is, not a
| plumber who can fix the leak for you.
| thadt wrote:
| Can you expand on what you find to be 'bad advice'?
|
| The author uses an LLM to find _bugs_ and then throw away the
| fix and instead write the code he would have written anyway.
| This seems like a rather conservative application of LLMs.
| Using the 'shooting someone in the foot' analogy - this
| article is an illustration of professional and responsible
| firearm handling.
| didibus wrote:
| This is basically the ideal scenario for coding agents. Easily
| verifiable through running tests, pure logic, algorithmic
| problem. It's the case that has worked the best for me with LLMs.
| pton_xd wrote:
| > Full disclosure: Anthropic gave me a few months of Claude Max
| for free. They reached out one day and told me they were giving
| it away to some open source maintainers.
|
| Related, lately I've been getting tons of Anthropic Instagram
| ads; they must be near a quarter of all the sponsored content I
| see for the last month or so. Various people vibe coding random
| apps and whatnot using different incarnations of Claude. Or just
| direct adverts to "Install Claude Code." I really have no idea
| why I've been targeted so hard, on Instagram of all places. Their
| marketing team must be working overtime.
| simonw wrote:
| I think it might be that they've hit product-market fit.
|
| Developers find Claude Code extremely useful (once they figure
| out how to use it). Many developers subscribe to their
| $200/month plan. Assuming that's profitable (and I expect it
| is, since even for that much money it cuts off at a certain
| point to avoid over-use) Anthropic would be wise to spend a
| _lot_ of money on marketing to try and grow their paying
| subscriber base for it.
| phendrenad2 wrote:
| A whole class of tedious problems have been eliminated by LLMs
| because they are able to look at code in a "fuzzy" way. But this
| can be a liability, too. I have a codebase that "looks kinda"
| like a nodejs project, so AI agents usually assume it is one,
| even if I rename the package.json, it will inspect the contents
| and immediately clock it as "node-like".
___________________________________________________________________
(page generated 2025-11-01 23:00 UTC)