[HN Gopher] The unreasonable effectiveness of fuzzing for portin...
___________________________________________________________________
The unreasonable effectiveness of fuzzing for porting programs
Author : Bogdanp
Score : 148 points
Date : 2025-06-18 16:26 UTC (6 hours ago)
(HTM) web link (rjp.io)
(TXT) w3m dump (rjp.io)
| oasisaimlessly wrote:
| Author FYI: The "You can see the session log here." link to [1]
| is broken.
|
| [1]: https://rjp.io/blog/claude-rust-port-conversation
| rjpower9000 wrote:
| Fixed, thanks!
| nyanpasu64 wrote:
| > Most code doesn't express subtle logic paths. If I test if a
| million inputs are correctly sorted, I've probably implemented
| the sorter correctly.
|
| I don't know if this was referring to Zopfli's sorter or sorting
| in general, but I _have_ heard of a subtle sorting bug in
| Timsort:
| https://web.archive.org/web/20150316113638/http://envisage-p...
| rjpower9000 wrote:
| Thanks for sharing, I did not know about that!
|
| Indeed, this is exactly the type of subtle case you'd worry
| about when porting. Fuzzing would be unlikely to discover a bug
| that only occurs on giant inputs or needs a special
| configuration of lists.
|
| In practice I think it works out okay because most of the time
| the LLM has written correct code, and when it doesn't it's
| introduced a dumb bug that's quickly fixed.
|
| Of course, if the LLM introduces subtle bugs, that's even
| harder to deal with...
| awesome_dude wrote:
| > Fuzzing would be unlikely to discover a bug that only
| occurs on giant inputs or needs a special configuration of
| lists.
|
| I have a concern about peoples' over confidence in fuzz
| testing.
|
| It's a great tool, sure, but all it is is something that
| selects (and tries) inputs at random from the set of all
| possible inputs that can be generated for the API.
|
| For a strongly typed system that means randomly selecting
| ints from all the possible ints for an API that only accepts
| ints.
|
| If the API accepts any group of bytes possible, fuzz testing
| is going to randomly generate groups of bytes to try.
|
| The only advantage this has over other forms of testing is
| that it's not constrained by people thinking "Oh these are
| the likely inputs to deal with"
| amw-zero wrote:
| There are 2 main problems in generative testing:
|
| - Input data generation (how do you explore enough of the
| program's behavior to have confidence that you're test is a good
| proxy for total correctness)
|
| - Correctness statements (how do you express whether or not the
| program is correct for an arbitrary input)
|
| When you are porting a program, you have a built in correctness
| statement: The port should behave exactly as the source program
| does. This greatly simplifies the testing process.
| bluGill wrote:
| Several times I've been involved in porting code. Eventually we
| reach a time where we are getting a lot of bug reports "didn't
| work, didn't work with the old system as well" which is to say
| we ported correctly, but the old system wasn't right either and
| we just hadn't tested it in that situation until the new system
| had the budget for exhaustive testing. (normally it worked at
| one point on the old system and got broke in some other update)
| lhmiles wrote:
| Are you the author? You can speed things up and get better
| results sometimes by retrying the initial generation step many
| times in parallel, instead of the interactive rewrite thing.
| rjpower9000 wrote:
| I'm the author. That's a great idea. I didn't explore that for
| this session but it's worth trying.
|
| I didn't measure consistently, but I would guess 60-70% of the
| symbols ported easily, with either one-shot or trivial edits,
| 20% Gemini managed to get there but ended up using most of its
| attempts, and 10% it just struggled with.
|
| The 20% would be good candidates for multiple generations &
| certainly consumed more than 20% of the porting time.
| rcthompson wrote:
| The author notes that the resulting Rust port is not very
| "rusty", but I wonder if this could also be solved through
| further application of the same principle. Something like telling
| the AI to minimize the use of unsafe etc., while enforcing that
| the result should compile and produce identical outputs to the
| original.
| rjpower9000 wrote:
| It seems feasible, but I haven't thought enough it. One
| challenge is that as you Rustify the code, it's harder to keep
| the 1-1 mapping with C interfaces. Sometimes to make it more
| Rust-y, you might want an internal function or structure to
| change. You then lose your low-level fuzz tests.
|
| That said, you could have the LLM write equivalence tests, and
| you'd still have the top-level fuzz tests for validation.
|
| So I wouldn't say it's impossible, just a bit harder to
| mechanize directly.
| DrNosferatu wrote:
| It will be inevitable that this generalizes.
| DrNosferatu wrote:
| Why not use the same approach to port the _full set_ of Matlab
| libraries to Octave?
|
| (or a open source language of your choice)
|
| Matlab manuals are public: it would be clean room reverse
| engineering.
|
| (and many times, the appropriate bibliography of the underlying
| definitions of what is being implemented is listed on the manual
| page)
| e28eta wrote:
| > LLMs open up the door to performing radical updates that we'd
| never really consider in the past. We can port our libraries from
| one language to another. We can change our APIs to fix issues,
| and give downstream users an LLM prompt to migrate over to the
| new version automatically, instead of rewriting their code
| themselves. We can make massive internal refactorings. These are
| types of tasks that in the past, rightly, a senior engineer would
| reject in a project until its the last possibly option. Breaking
| customers almost never pays off, and its hard to justify
| refactoring on a "maintenance mode" project.
|
| > But if it's more about finding the right prompt and letting an
| LLM do the work, maybe that changes our decision process.
|
| I don't see much difference between documenting any breaking
| changes in sufficient detail for your library consumers to
| understand them vs "writing an LLM prompt for migrating
| automatically", but if that's what it takes for maintainers to
| communicate the changes, okay!
|
| Just as long as it doesn't become "use this LLM which we've
| already trained on the changes to the library, and you just need
| to feed us your codebase and we'll fix it. PS: sorry, no
| documentation."
| marxism wrote:
| There's a huge difference between documentation and prompts.
| Let me give you a concrete example.
|
| I get requests to "make your research code available on Hugging
| Face for inference" with a link to their integration guide.
| That guide is 80% marketing copy about Git-based repositories,
| collaboration features, and TensorBoard integration. The actual
| implementation details are mixed in through out.
|
| A prompt would be much more compact.
|
| The difference: I can read a prompt in 30 seconds and decide
| "yes, this is reasonable" or "no, I don't want this change."
| With documentation, I have to reverse-engineer the narrow
| bucket which applies to my specific scenario from a one size
| drowns all ocean.
|
| The person making the request has the clearest picture of what
| they want to happen. They're closest to the problem and most
| likely to understand the nuances. They should pack that
| knowledge densely instead of making me extract it from
| documentation links and back and forth.
|
| Documentation says "here's everything now possible, you can do
| it all!" A prompt says "here's the specific facts you need."
|
| Prompts are a shared social convention now. We all have a rough
| feel for what information you need to provide - you have to be
| matter-of-fact, specific, can't be vague. When I ask someone to
| "write me a prompt," that puts them in a completely different
| mindset than just asking me to "support X".
|
| Everyone has experience writing prompts now. I want to leverage
| that experience to get cooperative dividends. It's division of
| labor - you write the initial draft, I edit it with special
| knowledge about my codebase, then apply it. Now we're sharing
| the work instead of dumping it entirely on the maintainer.
|
| [1] https://peoplesgrocers.com/en/writing/write-prompts-not-
| guid...
| rjpower9000 wrote:
| I was pretty hand-wavy when I made the original comment. I
| was thinking implicitly to things like the Python sub-
| interpreter proposal, which had strong pushback from the
| Numpy engineers at the time (I don't know the current status,
| whether it's a good idea, etc, just something that came to
| mind).
|
| https://lwn.net/Articles/820424/
|
| The objections are of course reasonable, but I kept thinking
| this shouldn't be as big a problem in the future. A lot of
| times we want to make some changes that aren't _quite_
| mechanical, and if they hit a large part of the code base,
| it's hard to justify. But if we're able to defer these types
| of cleanups to LLMs, it seems like this could change.
|
| I don't want a world with no API stability of course, and you
| still have to design for compatibility windows, but it seems
| like we should be able to do better in the future. (More so
| in mono-repos, where you can hit everything at once).
|
| Exactly as you write, the idea with prompts is that they're
| directly actionable. If I want to make a change to API X, I
| can test the prompt against some projects to validate agents
| handle it well, even doing direct prompt optimization, and
| then sharing it with end users.
| e28eta wrote:
| Yes, there's a difference between "all documentation for a
| project" and "prompt for specific task".
|
| I don't think there should be a big difference between
| "documentation of specific breaking changes in a library and
| how consumers should handle them" and "LLM prompt to change a
| code base for those changes".
|
| You might call it a migration guide. Or it might be in the
| release notes, in a special section for Breaking Changes. It
| might show up in log messages ("you're using this API wrong,
| or it's deprecated").
|
| Why would describing the changes to an LLM be easier than
| explaining them to the engineer on the other end of your API
| change?
| gaogao wrote:
| Domains where fuzzing is useful are generally good candidates for
| formal verification, which I'm pretty bullish about in concert
| with LLMs. This is in part because you can just formal verify by
| exhaustiveness for many problems, but the enhancement is being
| able to prove that you don't need to test certain combinations
| through inductive reasoning and such.
| rjpower9000 wrote:
| That's an interesting idea. I hadn't thought about it, but it
| would be interesting to consider doing something similar for
| the porting task. I don't know enough about the space, could
| you have an LLM write a formal spec for a C function and the
| validate the translated function has the same properties?
|
| I guess I worry it would be hard to separate out the "noise",
| e.g. the C code touches some memory on each call so now the
| Rust version has to as well.
| zie1ony wrote:
| I find it amazing, that the same ideas pop up in the same period
| of time. For example, I work on tests generation and I went the
| same path. I tried to find bugs by prompting "Find bugs in this
| code and implement tests to show it.", but this didn't get me
| far. Then I switched to property (invariant) testing, like you,
| but in my case I ask AI: "Based on the whole codebase, make the
| property tests." and then I fuzz some random actions on the
| state-full objects and run prop tests over and over again.
|
| At first I also wanted to automate everything, but over time I
| realized that best is: 10% human to 90% AI of work.
|
| Another idea I'm exploring is AI + Mutation Tests
| (https://en.wikipedia.org/wiki/Mutation_testing). It should help
| AI with generation of full coverage.
| wahnfrieden wrote:
| An under-explored approach is to collect data on human usage of
| the app (from production and from internal testers) and feed
| that to your generative inputs
| LAC-Tech wrote:
| I'd have much more confidence in an AI codebase where the human
| has chosen the property tests, than a human codebase where the
| AI has chosen the property tests.
|
| Tests are executable specs. That is the last thing you should
| offload to an LLM.
| koakuma-chan wrote:
| How about an LRM?
| LAC-Tech wrote:
| I do not know this term; could you give a concise
| explanation?
| koakuma-chan wrote:
| LRM is a new term for reasoning LLMs. From my experience,
| either I am bad at prompting, or LRMs are vastly better
| than LLMs at instruction following.
| bccdee wrote:
| Also, a poorly designed test suite makes your code base
| extremely painful to change. A well-designed test suite with
| good abatractions makes it easy to change code, on top of
| which, it makes tests extremely fast to write.
|
| I think the whole idea of getting LLMs to write the tests
| comes from a pandemic of under-abstracted, labour-intensive
| test suites. And that just makes the problem worse.
| LAC-Tech wrote:
| Perhaps the viewpoint that tests are a chore or grunt work;
| something you have to do but you don't really view as
| interesting or important.
|
| (like how I describe what git should do and I get the LLM
| to give me the magic commands with all the confusing nouns
| and verbs and dashes in the right place).
| punnerud wrote:
| Reading that TensorFlow is not used much anymore (besides Google)
| felt good to read. Had to check Google Trends:
| https://trends.google.com/trends/explore?date=all&q=%2Fg%2F1...
|
| I started using TensorFlow years ago and switched to PyTorch.
| Hope ML will make switches like TensorFlow to PyTorch faster and
| easier, and not just the biggest companies eating the open source
| community. Like it have been for years.
| screye wrote:
| Google has moved to JAX. I know many people who prefer it over
| pytorch.
| leoh wrote:
| It's okay. Complaints are documentation, limited community
| support (all kinds of architecture is much more diy for it vs
| PyTorch).
|
| Unrelated gripe: they architected it really poorly from a
| pure sw pov imo. Specifically it's all about Python bindings
| for C++ so the py/c++ layer is tightly coupled both in code
| and in the build system.
|
| They have a huge opportunity to fix this so, for example,
| rust bindings could be (reasonably trivially) generated, let
| alone for other languages.
| comex wrote:
| Interesting! But there's a gap between aspirations and what was
| accomplished here.
|
| Early on in the blog post, the author mentions that "c2rust can
| produce a mechanical translation of C code to Rust, though the
| result is intentionally 'C in Rust syntax'". The flow of the post
| seems to suggest that LLMs can do better. But later on, they say
| that their final LLM approach produces Rust code which "is very
| 'C-like'" because "we use the same unsafe C interface for each
| symbol we port". Which sounds like they achieved roughly the same
| result as c2rust, but with a slower and less reliable process.
|
| It's true that, as the author says, "because our end result has
| end-to-end fuzz tests and tests for every symbol, its now much
| easier to 'rustify' the code with confidence". But it would have
| been possible to use c2rust for the actual port, and separately
| use an LLM to write fuzz tests.
|
| I'm not criticizing the approach. There's clearly a lot of
| promise in LLM-based code porting. I took a look at the earlier,
| non-fuzz-based Claude port mentioned in the post, and it reads
| like idiomatic Rust code. It would be a perfect proof of concept,
| if only it weren't (according to the author) subtly buggy.
| Perhaps there's a way to use fuzzing to remove the bugs while
| keeping the benefits compared to mechanical translation.
| Unfortunately, the author's specific approach to fuzzing seems to
| have removed both the bugs and the benefits. Still, it's a good
| base for future work to build on.
| maxjustus wrote:
| I used this general approach to port the ClickHouse specific
| version of cityHash64 to JS from an existing Golang
| implementation https://github.com/maxjustus/node-ch-
| city/blob/main/ch64.js. I think it works particularly well when
| porting pure functions.
___________________________________________________________________
(page generated 2025-06-18 23:00 UTC)