[HN Gopher] Ask HN: How are you using LLMs for traversing decomp...
___________________________________________________________________
Ask HN: How are you using LLMs for traversing decompiler output?
I need to reverse a binary made years ago, and I have zero
experience with cpp, so I think it would be a good experiment to
get an LLM to help me in any way
Author : mjbale116
Score : 44 points
Date : 2025-01-02 10:16 UTC (2 days ago)
| carom wrote:
| Binary Ninja has an AI integration called side kick, it has a
| free trial but I'm not sure it can be used in the free web
| version. [1]
|
| In my experience, the off the shelf LLMs (e.g. ChatGPT) do a
| pretty poor job with assembly, they can not reason about the
| stack or stack frames well.
|
| I think your job will be the same with or without AI. Figuring
| out the data structures and data types a function is operating on
| and naming variables.
|
| What are you reverse engineering for? For example, getting a full
| compilable decompilation has different goals than finding
| vulnerabilities or patching a bug.
|
| 1. https://sidekick.binary.ninja/
| th0ma5 wrote:
| This is what I gather from reverse engineering material I've
| read and groups I've been around. Hidden state, hidden data
| structures, hidden automations all abound, and there simply
| isn't enough detail in the assembler itself to bridge the
| hardware's internal conceptualization and processes.
| aidanhs wrote:
| Out of curiosity, what would you say the current state of the
| art is for full compilable decompilation? This is something I
| have a vague interest in but I'm not involved enough in the
| space to be on top of the latest and greatest tooling.
| Retr0id wrote:
| Looking at an individual function, IDA hex-rays output is
| often recompilable as-is (or with minor modifications), but
| it won't necessarily be idiomatic, especially if you don't
| have symbol information.
| feznyng wrote:
| Echoing IDA but its pricing is a huge PITA if you're using it
| in a hobbyist capacity i.e. you don't have an employer
| willing to pay for it. Could opt for the home version but
| that's a yearly cost and you have to use their cloud
| decompiler. Ghidra's your best bet if you want something FOSS
| and community-driven although not as great at decompilation.
| carom wrote:
| Most decompilers do not strive for recompilability. [1] I
| believe there are (or were) some academic projects that aimed
| for recompilation as a core feature, but it is a hard
| problem.
|
| On the commercial side, IDA / HexRays [2] is very strong for
| C-like decompilation. If you're looking at Go, Rust, or even
| C++ it is going to be a little bit more messy. As other
| commenters have said, you'll work function-by-function and it
| is expensive, though the free version does have decompilation
| (F5) for x86 and x64 (IIRC).
|
| Binary Ninja [3] (no affiliation) is the coolest IMO, they
| have multiple intermediate representations they lift the
| assembly through. So you get like assembly -> low level IL ->
| medium level IL -> high level IL. There are also SSA forms
| (static single assignment) that can aid in programmatic
| analyses. The high level IL is very readable but makes no
| effort to be compilable as a programming language. That being
| said, Binary Ninja has implemented different "views" on the
| HLIL so you can show it as pseudo-C, Rust, etc. There is a
| free online version and the commercial version is cheaper
| than IDA but still expensive. Good Python API, good UI.
|
| Ghidra [4] is the RE framework released by NSA. It is free
| and open source. It supports a ton of niche architectures.
| This is what most people use. I think the UI is awful,
| personally. It has a decompiler, the results are OK. They
| have an intermediate representation (P-Code) and plugins are
| in Java (since it is written in Java). I haven't worked much
| with it.
|
| Most online decompilations you see for old games are likely
| using Ghidra, some might be using IDA. This is largely a
| manual process of doing a function at a time and building up
| the mental map of the program and how things interact.
|
| Also worth mentioning are lifters. There were a few projects
| that aimed to lift assembly to LLVM IR (compiler framework's
| intermediate representation), with the idea being that then
| all your analyses could be written over LLVM IR as a lingua
| franca. Since it is in LLVM IR, it would be also recompilable
| and retargetable. [5][6]
|
| 1. https://reverseengineering.stackexchange.com/questions/260
| 3/...
|
| 2. https://hex-rays.com/ida-free
|
| 3. https://binary.ninja/free/
|
| 4. https://ghidra-sre.org/
|
| 5. https://github.com/avast/retdec
|
| 6. https://github.com/lifting-bits/mcsema
| netsec_burn wrote:
| I made a site to use LLMs to help me with reverse engineering.
| The output is surprisingly readable, even with C++ classes. Let
| me know any feedback you might have:
| https://decompiler.zeroday.engineering/
| btown wrote:
| What kind of file should be uploaded?
| netsec_burn wrote:
| The allowed types are a bit misleading. Any binary is
| accepted, any architecture. You can upload shared objects,
| ELF executables, PE binaries, etc.
| ianhawes wrote:
| Highly recommend it. I reversed an app with o1 Pro Mode and the
| analysis of the obfuscated C# code matched up accurately with
| what I eventually discovered by manually reversing.
| chc4 wrote:
| Reverse engineering C# is extremely different from C++
| binaries.
| flashgordon wrote:
| Interesting. Wouldn't this actually be a deterministic problem
| based on graph analysis. Id have thought LLMs would have been
| more effective taking the out out some graph recognizer and then
| identifying what those higher level constructs map to?
| warkdarrior wrote:
| Deterministic maybe, but surely undecidable in the general case
| since you need whole program analysis to understand, for
| example, the purpose of a memory location. ML may help
| approximate this undecidable problem.
| lumb63 wrote:
| It has nothing to do with LLMs, but Ghidra is a wonderful tool.
| jkstill wrote:
| I've only played a with this, but it was impressive.
|
| https://ghidra-sre.org/
| feznyng wrote:
| You could use the LLM to help you write utility scripts for
| whatever disassembler you're using e.g. python for IDA. That
| might work better than feeding it raw assembly.
|
| Game RE communities also have all sorts of neat utilities for
| decompiling large cpp binaries. Skyrim's community is pretty
| active with ghidra/ida.
|
| Guessing you're not lucky enough to have a PDB?
| tonetegeatinst wrote:
| PDB?
| feznyng wrote:
| Program database file - only relevant if the binary is
| Windows. But that makes decomp an order of magnitude easier.
| I'd be surprised if OP had one though.
| klmitchell2 wrote:
| https://github.com/radareorg/r2ai
| Dwedit wrote:
| Have you tried Ghidra yet? If you still have your debug symbols,
| then it can do a really good job.
| JosephRedfern wrote:
| These guys are building foundational models for this purpose:
| https://reveng.ai/. The results are quite compelling, and they
| have plugins for your favourite reverse engineering tools.
___________________________________________________________________
(page generated 2025-01-04 23:00 UTC)