[HN Gopher] Ask HN: How are you using LLMs for traversing decomp...
       ___________________________________________________________________
        
       Ask HN: How are you using LLMs for traversing decompiler output?
        
       I need to reverse a binary made years ago, and I have zero
       experience with cpp, so I think it would be a good experiment to
       get an LLM to help me in any way
        
       Author : mjbale116
       Score  : 44 points
       Date   : 2025-01-02 10:16 UTC (2 days ago)
        
       | carom wrote:
       | Binary Ninja has an AI integration called side kick, it has a
       | free trial but I'm not sure it can be used in the free web
       | version. [1]
       | 
       | In my experience, the off the shelf LLMs (e.g. ChatGPT) do a
       | pretty poor job with assembly, they can not reason about the
       | stack or stack frames well.
       | 
       | I think your job will be the same with or without AI. Figuring
       | out the data structures and data types a function is operating on
       | and naming variables.
       | 
       | What are you reverse engineering for? For example, getting a full
       | compilable decompilation has different goals than finding
       | vulnerabilities or patching a bug.
       | 
       | 1. https://sidekick.binary.ninja/
        
         | th0ma5 wrote:
         | This is what I gather from reverse engineering material I've
         | read and groups I've been around. Hidden state, hidden data
         | structures, hidden automations all abound, and there simply
         | isn't enough detail in the assembler itself to bridge the
         | hardware's internal conceptualization and processes.
        
         | aidanhs wrote:
         | Out of curiosity, what would you say the current state of the
         | art is for full compilable decompilation? This is something I
         | have a vague interest in but I'm not involved enough in the
         | space to be on top of the latest and greatest tooling.
        
           | Retr0id wrote:
           | Looking at an individual function, IDA hex-rays output is
           | often recompilable as-is (or with minor modifications), but
           | it won't necessarily be idiomatic, especially if you don't
           | have symbol information.
        
           | feznyng wrote:
           | Echoing IDA but its pricing is a huge PITA if you're using it
           | in a hobbyist capacity i.e. you don't have an employer
           | willing to pay for it. Could opt for the home version but
           | that's a yearly cost and you have to use their cloud
           | decompiler. Ghidra's your best bet if you want something FOSS
           | and community-driven although not as great at decompilation.
        
           | carom wrote:
           | Most decompilers do not strive for recompilability. [1] I
           | believe there are (or were) some academic projects that aimed
           | for recompilation as a core feature, but it is a hard
           | problem.
           | 
           | On the commercial side, IDA / HexRays [2] is very strong for
           | C-like decompilation. If you're looking at Go, Rust, or even
           | C++ it is going to be a little bit more messy. As other
           | commenters have said, you'll work function-by-function and it
           | is expensive, though the free version does have decompilation
           | (F5) for x86 and x64 (IIRC).
           | 
           | Binary Ninja [3] (no affiliation) is the coolest IMO, they
           | have multiple intermediate representations they lift the
           | assembly through. So you get like assembly -> low level IL ->
           | medium level IL -> high level IL. There are also SSA forms
           | (static single assignment) that can aid in programmatic
           | analyses. The high level IL is very readable but makes no
           | effort to be compilable as a programming language. That being
           | said, Binary Ninja has implemented different "views" on the
           | HLIL so you can show it as pseudo-C, Rust, etc. There is a
           | free online version and the commercial version is cheaper
           | than IDA but still expensive. Good Python API, good UI.
           | 
           | Ghidra [4] is the RE framework released by NSA. It is free
           | and open source. It supports a ton of niche architectures.
           | This is what most people use. I think the UI is awful,
           | personally. It has a decompiler, the results are OK. They
           | have an intermediate representation (P-Code) and plugins are
           | in Java (since it is written in Java). I haven't worked much
           | with it.
           | 
           | Most online decompilations you see for old games are likely
           | using Ghidra, some might be using IDA. This is largely a
           | manual process of doing a function at a time and building up
           | the mental map of the program and how things interact.
           | 
           | Also worth mentioning are lifters. There were a few projects
           | that aimed to lift assembly to LLVM IR (compiler framework's
           | intermediate representation), with the idea being that then
           | all your analyses could be written over LLVM IR as a lingua
           | franca. Since it is in LLVM IR, it would be also recompilable
           | and retargetable. [5][6]
           | 
           | 1. https://reverseengineering.stackexchange.com/questions/260
           | 3/...
           | 
           | 2. https://hex-rays.com/ida-free
           | 
           | 3. https://binary.ninja/free/
           | 
           | 4. https://ghidra-sre.org/
           | 
           | 5. https://github.com/avast/retdec
           | 
           | 6. https://github.com/lifting-bits/mcsema
        
       | netsec_burn wrote:
       | I made a site to use LLMs to help me with reverse engineering.
       | The output is surprisingly readable, even with C++ classes. Let
       | me know any feedback you might have:
       | https://decompiler.zeroday.engineering/
        
         | btown wrote:
         | What kind of file should be uploaded?
        
           | netsec_burn wrote:
           | The allowed types are a bit misleading. Any binary is
           | accepted, any architecture. You can upload shared objects,
           | ELF executables, PE binaries, etc.
        
       | ianhawes wrote:
       | Highly recommend it. I reversed an app with o1 Pro Mode and the
       | analysis of the obfuscated C# code matched up accurately with
       | what I eventually discovered by manually reversing.
        
         | chc4 wrote:
         | Reverse engineering C# is extremely different from C++
         | binaries.
        
       | flashgordon wrote:
       | Interesting. Wouldn't this actually be a deterministic problem
       | based on graph analysis. Id have thought LLMs would have been
       | more effective taking the out out some graph recognizer and then
       | identifying what those higher level constructs map to?
        
         | warkdarrior wrote:
         | Deterministic maybe, but surely undecidable in the general case
         | since you need whole program analysis to understand, for
         | example, the purpose of a memory location. ML may help
         | approximate this undecidable problem.
        
       | lumb63 wrote:
       | It has nothing to do with LLMs, but Ghidra is a wonderful tool.
        
       | jkstill wrote:
       | I've only played a with this, but it was impressive.
       | 
       | https://ghidra-sre.org/
        
       | feznyng wrote:
       | You could use the LLM to help you write utility scripts for
       | whatever disassembler you're using e.g. python for IDA. That
       | might work better than feeding it raw assembly.
       | 
       | Game RE communities also have all sorts of neat utilities for
       | decompiling large cpp binaries. Skyrim's community is pretty
       | active with ghidra/ida.
       | 
       | Guessing you're not lucky enough to have a PDB?
        
         | tonetegeatinst wrote:
         | PDB?
        
           | feznyng wrote:
           | Program database file - only relevant if the binary is
           | Windows. But that makes decomp an order of magnitude easier.
           | I'd be surprised if OP had one though.
        
       | klmitchell2 wrote:
       | https://github.com/radareorg/r2ai
        
       | Dwedit wrote:
       | Have you tried Ghidra yet? If you still have your debug symbols,
       | then it can do a really good job.
        
       | JosephRedfern wrote:
       | These guys are building foundational models for this purpose:
       | https://reveng.ai/. The results are quite compelling, and they
       | have plugins for your favourite reverse engineering tools.
        
       ___________________________________________________________________
       (page generated 2025-01-04 23:00 UTC)