[HN Gopher] LLM4Decompile: Decompiling Binary Code with LLM
       ___________________________________________________________________
        
       LLM4Decompile: Decompiling Binary Code with LLM
        
       Author : Davidbrcz
       Score  : 303 points
       Date   : 2024-03-17 10:15 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | potatoman22 wrote:
       | It's interesting the 6b model outperforms the 33b model. I wonder
       | if it means the 33b model needs more training data? It was
       | pretrained on ~1 million C programs, compared to DeepSeek-Coder,
       | which was trained on 2 trillion tokens, which is a few orders of
       | magnitude more data.
       | 
       | I'm also curious about how this compares to non-LLM solutions.
        
         | mattashii wrote:
         | > on ~1 million C programs, compared to [...] 2 trillion
         | tokens, which is a few orders of magnitude more data.
         | 
         | Is that comparable like that? This would assume that the
         | average C program of the set is orders (plural) of magnitude
         | less than 2m tokens in size, which could indeed be true but
         | sounds like an optimistic assumption.
        
         | Der_Einzige wrote:
         | This has been the dynamics with LLMs for awhile. The majority
         | of LLMs are massively _undertrained_. 7b models are the least
         | "undertrained" mainstream models we have, hence why they have
         | proliferated so much among the LLM fine-tuning community.
        
       | maCDzP wrote:
       | Can this be used for deobfuscation of code? I really hadn't
       | thought about LLM being a tool during reverse engineering.
        
         | Tiberium wrote:
         | Big LLMs like GPT-4 (and even GPT 3.5 Turbo) can be directly
         | used to beautify obfuscated/minified JS, see e.g.
         | https://thejunkland.com/blog/using-llms-to-reverse-javascrip...
         | and https://news.ycombinator.com/item?id=34503233
        
         | Eager wrote:
         | I have tried feeding some of the foundation models obfuscated
         | code from some of the competitions.
         | 
         | People might think that the answers would be in the training
         | data already, but I didn't find that to be the case. At least
         | in my small experiments.
         | 
         | The model's did try to guess what the code does. They would say
         | things like, "It seems to be trying to print some message to
         | the console". I wasn't able to get full solutions.
         | 
         | It's definitely worth more research, not just as a curiosity,
         | but these kinds of problems are good proxies for other tasks
         | and also excellent benchmarks for LLMs particularly.
        
         | evmar wrote:
         | I did a little experiment with this here:
         | 
         | https://neugierig.org/software/blog/2023/01/compiling-advent...
        
       | kken wrote:
       | Pretty wild how well GPT4 is still doing in comparison. It's
       | significantly better than their model at creating compilable
       | code, but is less accurate at recreating functional code. Still
       | quite impressive.
        
       | nebula8804 wrote:
       | Will be interesting to see is there is some way to train a
       | decompilation module based on who we know developed the
       | application and use their previous code used as training. For
       | example: Super Mario 64 and Zelda 64 were fully decompiled and a
       | handful of other N64 games are in the process. I wonder if we
       | could map which developers worked on these two games (maybe even
       | guess who did what module) and then use that to more easily
       | decompile any other game that had those developers working on it.
       | 
       | If this gets really good, maybe we can dream of having a fully
       | de-obfuscated and open source life. All the layers of binary
       | blobs in a PC can finally be decoded. All the drivers can be
       | open. Why not do the OS as well! We don't have to settle for
       | Linux, we can bring back Windows XP and back port modern security
       | and app compatibility into the OS and Microsoft can keep their
       | Windows 11 junk...at least one can dream! :D
        
         | ZitchDog wrote:
         | I doubt the code would be identifiable. It wouldn't be the
         | actual code written, but it would be very similar. But I assume
         | many elements of code style would be lost, and any semblance of
         | code style would be more or less hallucinated.
        
           | K0IN wrote:
           | if it can make test from the decompiled code, we could
           | reimplement it with our code style. might be cool to have
           | some bunch of llms working together with feedback loops.
        
         | coddle-hark wrote:
         | I wrote my bachelor thesis on something tangential --
         | basically, some researchers found that it was possible _in some
         | very specific circumstances_ to train a classifier to do author
         | attribution (i.e. figure out who wrote the program) based just
         | on the compiled binaries they produced. I don't think the
         | technique has been used for anything actually useful, but it's
         | cool to see that individual coding style survives the
         | compilation process, so much so that you can tell one person's
         | compiled programs apart from another's.
        
         | userbinator wrote:
         | _If this gets really good, maybe we can dream of having a fully
         | de-obfuscated and open source life. All the layers of binary
         | blobs in a PC can finally be decoded. All the drivers can be
         | open. Why not do the OS as well!_
         | 
         | Decompilers already exist and are really good. If an LLM can do
         | the same as these existing compilers, you can bet the lawyers
         | will consider it an equivalent process. The main problem is
         | legal/political, not technical.
        
       | kukas wrote:
       | Hey, I am working on my own LLM-based decompiler for Python
       | bytecode (https://github.com/kukas/deepcompyle). I feel there are
       | not many people working on this research direction but I think it
       | could be quite interesting, especially now that longer attention
       | contexts are becoming feasible. If anyone knows a team that is
       | working on this, I would be quite interested in cooperation.
        
         | ok123456 wrote:
         | Is there a benefit from using an LLM for Python byte code?
         | Python byte code is high enough level that it's possible to
         | translate it directly to source code from my experience.
        
           | kukas wrote:
           | My motivation is that the existing decompilers work only for
           | Python versions till ~3.8. Having a model that could be
           | finetuned with every new Python version release might
           | overcome the need for highly specialized programmer that is
           | able to update the decompiler to be compatible with the new
           | version.
           | 
           | It is also a toy example for me to set up a working pipeline
           | and then try to decompile more interesting targets.
        
         | a2code wrote:
         | Why Python? First, python is a language with a large open-
         | source library. Second, I do not think it is used for software
         | that is distributed as binaries?
        
       | jagrsw wrote:
       | Decompilation is somewhat a default choice for ML in the world of
       | comp-sec.
       | 
       | Searching for vulns and producing patches in source code is a bit
       | problematic, as the databases of vulnerable source code examples
       | and their corresponding patches are neither well-structured nor
       | comprehensive, and sometimes very, very specific to the analyzed
       | code (for higher abstraction type of problems). So, it's not easy
       | to train something usable beyond standard mem safety problems and
       | use of unsafe APIs.
       | 
       | The area of fuzzing is somewhat messy, with sporadic efforts
       | undertaken here and there, but it also requires a lot of
       | preparatory work, and the results might not be groundbreaking
       | unless we reach a point where we can feed an ML model the entire
       | source code of a project, allowing it to analyze and identify all
       | bugs, producing fixes and providing offending inputs. i.e. not
       | yet.
       | 
       | While decompilation is a fairly standard problem, it is possible
       | to produce input-output pairs somewhat at will based on existing
       | source code, using various compiler switches, CPU architectures,
       | ABIs, obfuscations, syscall calling conventions. And train models
       | on those input-output pairs (i.e. in reversed order).
        
       | a2code wrote:
       | The problem is interesting in at least two aspects. First, an
       | ideal decompiler would eliminate proprietary source code. Second,
       | the abundant publicly available C code allows you to simply make
       | a dataset of paired ASM and source code. There is also a lot of
       | variety with optimization level, compiler choice, and platform.
       | 
       | What is unclear to me is: why did the authors fine-tune the
       | DeepSeek-Coder model? Can you train an LLM from zero with a
       | similar dataset? How big does the LLM need to be? Can it run
       | locally?
        
         | 3abiton wrote:
         | I assume it's related to the cost of training vs fine-tuning.
         | It could be also a starting point to validate an idea.
        
         | mike_hearn wrote:
         | Most proprietary code runs behind firewalls and won't be
         | affected by this one way or another.
         | 
         | It's basically always better to start training with a pre-
         | trained model rather than random, even if what you want isn't
         | that close to what you start with.
        
       | madisonmay wrote:
       | This is an excellent use case for LLM fine-tuning, purely because
       | of the ease of generating a massive dataset of input / output
       | pairs from public C code
        
         | bt1a wrote:
         | I would also think that generating a very large amount of C
         | code using coding LLMs (using deepseek, for example, +
         | verifying that the output compiles) as synthetic training data
         | would be quite beneficial in this situation. Generally the
         | quality of synthetic training data is one of the main concerns,
         | but in this case, the ability for the code to compile is the
         | crux.
        
       | klik99 wrote:
       | This is a fascinating idea, but (honest question, not a
       | judgement) would the output be reliable? It would be hard to
       | identify hallucinations since recompiling could produce different
       | machine code. Particularly if there is some novel construct that
       | could be a key part of the code. Are there ways of also reporting
       | the LLMs confidence in sections like this when running
       | generatively? It's an amazing idea but I worry it would stumble
       | invisibly on the parts that are most critical. I suppose it would
       | just need human confirmation on the output
        
         | Eager wrote:
         | This is why round-tripping the code is important.
         | 
         | If you decompile the binary to source, then compile the source
         | back to binary you should get the original binary.
         | 
         | You just need to do this enough times until the loss drops to
         | some acceptable amount.
         | 
         | It's a great task for reinforcement learning, which is known to
         | be unreasonably effective for these types of problems.
        
           | thfuran wrote:
           | >If you decompile the binary to source, then compile the
           | source back to binary you should get the original binary.
           | 
           | You really can't expect that if you're not using exactly the
           | same version of exactly the same compiler with exactly the
           | same flags, and often not even then.
        
             | Eager wrote:
             | You try your best, and if you provide enough examples, it
             | will undoubtedly get figured out.
        
               | thfuran wrote:
               | What exactly are you suggesting will get figured out?
        
               | spqrr wrote:
               | The mapping from binary to source code.
        
               | layer8 wrote:
               | The question was about the reverse mapping.
        
               | thfuran wrote:
               | Even ignoring all sources of irreproducibility, there
               | does not exist a bijection between source and binary
               | artifact irrespective of tool chain. Two different
               | toolchains could compile the same source to different
               | binaries or different sources to the same binary. And you
               | absolutely shouldn't be ignoring sources of
               | irreproducibility in this context, since they'll cause
               | even the same toolchain to keep producing different
               | binaries given the same source.
        
               | achrono wrote:
               | Exactly, but neither the source nor the binary is what's
               | truly important here. The real question is: can the LLM
               | generate the _functionally valid_ source equivalent of
               | the binary at hand? If I disassemble Microsoft Paint, can
               | I get code that will result in a mostly functional
               | version of Microsoft Paint, or will I just get 515
               | compile errors instead?
        
               | Brian_K_White wrote:
               | This is what I thought the question was really about.
               | 
               | I assume that an llm will simply see patterns that look
               | similar to other patterns and make assosciations and
               | assume ewuivalences on that level, meanwhile real code is
               | full of things where the programmer, especially assembly
               | programmers, modify something by a single instruction or
               | offset value etc to get a very specific and functionally
               | important result.
               | 
               | Often the result is code that not only isn't obvious,
               | it's nominaly flatly wrong, violating standards, specs,
               | intended function, datasheet docs, etc. If all you knew
               | were the rules written in the docs, the code is broken
               | and invalid.
               | 
               | Is the llm really going to see or understand the intent
               | of that?
               | 
               | They find matching patterns in other existing stuff, and
               | to the user who can not see the infinite body of that
               | other stuff the llm pulled from, it looks like the llm
               | understood the intent of a question, but I say it just
               | found the prior work of some human who understood a
               | similar intent somewhere else.
               | 
               | Maybe an llm or some other flavor of ai can operate some
               | other way like actually playing out the binary like
               | executing in a debugger and map out the results not just
               | look at the code as fuzzy matching patterns. Can that
               | take the place of understanding the intents the way a
               | human would reading the decompiled assembly?
               | 
               | Guess we'll be finding out sooner of later since of
               | course it will all be tried.
        
               | fao_ wrote:
               | Except LLMs cannot reason.
        
               | lolinder wrote:
               | I think you're misunderstanding OP's objection. It's not
               | simply a matter of going back and forth with the LLM
               | until eventually (infinite monkeys on typewriters style)
               | it gets the same binary as before: Even if you got the
               | _exact same source code_ as the original there 's still
               | no automated way to tell that you're done because the
               | bits you get back out of the recompile step will almost
               | certainly not be the same, even if your decompiled source
               | were identical in every way. They might even vary quite
               | substantially depending on a lot of different
               | environmental factors.
               | 
               | Reproducible builds are hard to pull off cooperatively,
               | when you control the pipeline that built the original
               | binary and can work to eliminate all sources of
               | variation. It's simply not going to happen in a
               | decompiler like this.
        
               | blagie wrote:
               | Well, no, but yes.
               | 
               | The critical piece is that this can be done in training.
               | If I collect a large number of C programs from github,
               | compile them (in a deterministic fashion), I can use that
               | as a training, test, and validation set. The output of
               | the ML ought to compile to the same way given the same
               | environment.
               | 
               | Indeed, I can train over multiple deterministic build
               | environments (e.g. different compilers, different
               | compiler flags) to be even more robust.
               | 
               | The second critical piece is that for something like a
               | GAN, it doesn't need to be identical. You have two ML
               | algorithms competing:
               | 
               | - One is trying to identify generated versus ground-truth
               | source code
               | 
               | - One is trying to generate source code
               | 
               | Virtually all ML tasks are trained this way, and it
               | doesn't matter. I have images and descriptions, and all
               | the ML needs to do is generate an indistinguishable
               | description.
               | 
               | So if I give the poster a lot more benefit of the doubt
               | on what they wanted to say, it can make sense.
        
               | lolinder wrote:
               | Oh, I was assuming that Eager was responding to klik99's
               | question about how we could identify hallucinations in
               | the output--round tripping doesn't help with that.
               | 
               | If what they're actually saying is that it's possible to
               | train a model to low loss and then you just have to trust
               | the results, yes, what you say makes sense.
        
               | blagie wrote:
               | I haven't found many places where I trust the results of
               | an ML algorithm. I've found many places where they work
               | astonishingly well 30-95% of the time, which is to say,
               | save me or others a bunch of time.
               | 
               | It's been years, but I'm thinking back through things
               | I've reverse-engineered before, and having something
               | which kinda works most of the time would be super-useful
               | still as a starting point.
        
               | incrudible wrote:
               | Have you ever trained a GAN?
        
               | blagie wrote:
               | Technically, yes!
               | 
               | A more reasonable answer, though, is "no."
               | 
               | I've technically gone through random tutorials and
               | trained various toy networks, including a GAN at some
               | point, but I don't think that should really count. I also
               | have a ton of experience with neural networks that's
               | decades out-of-date (HUNDREDS of nodes, doing things like
               | OCR). And I've read a bunch of modern papers and used a
               | bunch of Hugging Face models.
               | 
               | Which is to say, I'm not completely ignorant, but I do
               | not have credible experience training GANs.
        
             | dheera wrote:
             | Maybe we then need an LLM to tell us if two pieces of
             | compiled code are equivalent in an input-output mapping
             | sense (ignoring execution time).
             | 
             | I'm actually serious; it would be exceedingly easy to get
             | training data for this just by running the same source code
             | through a bunch of different compiler versions and
             | optimization flags.
        
               | thfuran wrote:
               | Why would an llm be the tool for that job?
        
               | dheera wrote:
               | Without analytical thinking how else would you come to
               | conviction that two functions are identical, for a
               | computationally unfeasible number of possible inputs?
        
           | codethief wrote:
           | > you should get the original binary
           | 
           | According to the project's README, they only seem to be
           | checking mere "re-compilability" and "re-executability" of
           | the decompiled code, though.
        
           | 1024core wrote:
           | > If you decompile the binary to source, then compile the
           | source back to binary you should get the original binary.
           | 
           | Doesn't that depend on the compiler's version though? Or, for
           | that matter, even the sub-version. Every compiler does things
           | differently.
        
         | sebastianconcpt wrote:
         | Generators' nature is to hallucinate.
        
           | DougBTX wrote:
           | One man's hallucination is another's creativity.
        
             | sebastianconcpt wrote:
             | Well we need to remember that "hallucination" here is not a
             | concept but a language figure for the output of a
             | stochastic parroting machine. So what you mentinoed would
             | be a digitally induced halluciation out of some dancing
             | matrix multiplications / electrons on silicon.
        
         | riedel wrote:
         | One could as well use differential fuzzing.
        
           | klik99 wrote:
           | I'm amazed that there are so many good responses above only
           | this mentions fuzzing. In the context of security, inputs
           | might be non-linear things like adjacent memory, so I don't
           | see anyway to be confident about equilivancy without
           | substantial fuzzing.
           | 
           | Honestly I just don't see a way to formally verify this at
           | all, it's sounds like it could be a very useful tool but I
           | don't see a way for it to be fully confident. But, heck, just
           | getting you 90% of the way towards understanding it with LLMs
           | is still amazing and useful in real life.
        
         | layer8 wrote:
         | The way to do this is to have a formal verification tool that
         | takes the input, the output, and a formal proof that the input
         | matches the semantics of the output, and have the LLM create
         | the formal proof alongside the output. Then you can run the
         | verification tool to check if the LLM's output is correct
         | according to the proof that it also provided.
         | 
         | Of course, building and training an LLM that can provide such
         | proofs will be the bigger challenge, but it would be a safe a
         | way to detect hallucinations.
        
           | thfuran wrote:
           | Good luck formally proving Linux.
        
             | layer8 wrote:
             | The goal is to prove that the source code matches the
             | machine code, not to prove that the code implements some
             | intended higher-level semantics. This has nothing to do
             | with formally proving the correctness of the Linux kernel.
        
           | djinnandtonic wrote:
           | What if there are hallucinations in the verification tool?
        
             | thfuran wrote:
             | Then it's not a formal verification tool. Generative models
             | are profoundly unfit for that purpose.
        
             | layer8 wrote:
             | There may be bugs, but not hallucinations. Bugs are at
             | least reproducible, and the source code of the verification
             | tool is much, much smaller than an LLM, so has a much
             | higher chance of its finite number of bugs to be found,
             | whereas with an LLM it is probably impossible to remove all
             | hallucinations.
             | 
             | To turn your question around: What if the compiler that
             | compiles your LLM implementation "hallucinates"? That would
             | be the closer parallel.
        
             | smellf wrote:
             | I think the idea is that you'd have two independently-
             | develooed systems, one LLM decompiling the binary and the
             | other LLM formally verifying. If the verifier disagrees
             | with the decompiler you won't know which tool is right and
             | which is wrong, but if they agree then you'll know the
             | decompiled result is correct, since both tools are unlikely
             | to hallucinate the same thing.
        
               | layer8 wrote:
               | No, the idea is that the verifier is a human-written
               | program, like the many formal-verification tools that
               | already exist, not an LLM. There is zero reason to make
               | this an LLM.
               | 
               | It makes sense to use LLMs for the decompilation and the
               | proof generation, because both arguably require
               | creativity, but a mere proof verifier requires zero
               | creativity, only correctness.
        
           | natsch wrote:
           | That would require the tool to prove the equivalence of the
           | two programs, which is generally undecidable. Maybe this
           | could be weakened to preserving some properties of the
           | program.
        
             | ngruhn wrote:
             | That doesn't mean that it's impossible, right? Just that no
             | tool is guaranteed to give an answer in any case. And those
             | cases might be 90%, 10% or it-doesn't-matter-in-practice %
        
             | layer8 wrote:
             | No, it would not. It would require the LLM to provide a
             | proof for the program that it outputs, which seems
             | reasonable in the same way that a human decompiling a
             | program would be able to provide a record of his/her
             | reasoning.
             | 
             | The formal verifier would then merely check the provided
             | proof, which is a simple mechanical process.
             | 
             | This is analogous to a mathematician providing a detailed
             | proof and a computer checking it.
             | 
             | What is impossible due to undecidability is for two
             | _arbitrary_ programs, to either prove or disprove their
             | equivalence. However, the two programs we are talking about
             | are highly correlated, and thus not arbitrary at all with
             | respect to each other. If an LLM is able to provide a
             | correct decompilation, then in principle it should also be
             | able to provide a proof of the correctness of that
             | decompilation.
        
         | afro88 wrote:
         | The detail how they measure this in the readme. This is
         | directed at all the sibling comments as well!
         | 
         | TLDR they recompile and then re-execute (including test
         | suites). From the results table it looks like GPT4 still
         | "outperforms" their model in recompilation, but their
         | recompiled code has a much better re-execution success rate
         | (less hallucinations). But, that re-execution rate is still
         | pretty lacking (around 14%), even if better than GPT4.
        
         | londons_explore wrote:
         | Even if it isn't fully reliable, often it's only necessary to
         | modify a few functions for most changes one wants to make to a
         | binary.
         | 
         | You'd therefore only need to recompile those few functions.
        
         | userbinator wrote:
         | LLMs are by nature probabilistic, which is why they work
         | reasonably well for "imprecise" domains like natural language
         | processing. Expecting one to do decompilation, or disassembly
         | for that matter, is IMHO very much a "wrong tool for the job"
         | --- but perhaps it's just an exploratory exercise for the "just
         | use an LLM" meme that seems to be a common trend these days.
         | 
         | The bigger argument against the effectiveness of this approach
         | is that existing decompilers can already do a much better job
         | with far less processing power.
        
           | czl wrote:
           | In the future efficient rule based compilers and decompiler
           | may be generated by AI systems trained on inputs and outputs
           | of what we use today.
           | 
           | This effort is an exploration to find a radically different
           | AI way that may give superior results.
           | 
           | Yes. For all the reasons you give above, AI for this job is
           | not practical today.
        
       | ReptileMan wrote:
       | Let's hope it kills Denuvo ...
        
       | AndrewKemendo wrote:
       | If successful wouldn't you be replicating the compilers machine
       | code 1:1?
       | 
       | In which case that means fully complete code can live in the
       | "latent space" but is distributed as probabilities
       | 
       | Or perhaps more likely would it be replicating the logic only,
       | which can then be translated into the target language
       | 
       | I would guess that any binary that requires a non-deterministic
       | input (key, hash etc...) to compile would break this
       | 
       | Fascinating
        
       | m3kw9 wrote:
       | Basically predicting code token by token except now you don't
       | even have a large enough context size and worse, you are using
       | RAG
        
       | xorvoid wrote:
       | As someone who is actively developing a decompiler to reverse
       | engineer old DOS 8086 video games, I'd have a hard time trusting
       | an LLM to do this correctly. My standard is accurate semantics
       | lifting from Machine Code to C. Reversing assembly to C is very
       | delicate. There are many patterns that tend to _usually_ map to
       | obvious C constructs... except when they don 't. And that assumes
       | the original source was C. Once you bump into routines that were
       | hand-coded assembly and break every established rule in the
       | calling conventions, all bets are off. I'm somewhat convinced
       | that decompilation cannot be made fully-automatic. Instead a good
       | decompiler is just a lever-arm on the manual work a reverser
       | would otherwise be doing. Corollary: I'm also somewhat convinced
       | that only the decompiler's developers can really use it most
       | effectively because they know where the "bodies are buried" and
       | where different heuristics and assumptions were made. Decompilers
       | are compilers with all the usual engineering challenges, plus a
       | hard inference problem tacked on top.
       | 
       | All that said, I'm not a pessimist on this idea. I think it has
       | pretty great promise as a technique for general reversing
       | security analysis where the reversing is done mostly for
       | "discovery" and "understanding" rather than for perfect semantic
       | lifting to a high-level language. In that world, you can afford
       | to develop "hypotheses" and then drill down to validate if you
       | think you've discovered something big.
       | 
       | Compiling and testing the resulting decompilation is a great
       | idea. I do that as well. The limitation here is TEST SUITE. Some
       | random binary doesn't typically come with a high-coverage test
       | suite, so you have to develop your own acceptance criterion as
       | you go along. In other words: write tests for a function whose
       | computation you don't understand (ha). I suppose a form of
       | static-analysis / symbolic-computation might be handy here (I
       | haven't explored that). Here you're also beset with challenges of
       | specifying which machine state changes are important and which
       | are superfluous (e.g. is it okay if the x86 FLAGS register isn't
       | modified in the decompiled version, probably yes, but sometimes
       | no).
       | 
       | In my case I don't have access to the original compiler and even
       | if I did, I'm not sure I could convince it to reproduce the same
       | code. Maybe this is more feasible for more modern binaries where
       | you can assume GCC, Clang, MSVC, or ICC.
       | 
       | At any rate: crazy hard, crazy fun problem. I'm sure LLMs have a
       | role somewhere, but I'm not sure exactly where: the future will
       | tell. My guess is some kind of "copilot" / "assistant" type role
       | rather than directly making the decisions.
       | 
       | (If this is your kind of thing... I'll be writing more about it
       | on my blog soonish...)
        
         | a2code wrote:
         | I would devise a somewhat loose metric. Consider you assign a
         | percentage as to how much a binary is disassembled. As in, 0%
         | means the binary is in assembly and 100% means the whole binary
         | is now C code. The ideal decompiler would result in 100% for
         | any binary.
         | 
         | My prediction is that this percentage will increase with time.
         | It would be interesting to construct data for this metric.
         | 
         | It is important to define the limitations of using LLMs for
         | this endeavor. I would like to emphasize your subtle point. The
         | compiler used for the original binary may not be the same as
         | the one you use. The probability of this increases with time,
         | as compilers improve or the platform on which the binary runs
         | becomes obsolete. This is a problem for validation, as in you
         | cannot directly compare original assembly code with assembly
         | after compiling C code (that came from decompiling).
         | 
         | Perhaps assembly routines could be given a likelihood, as in
         | how sure the LLM is that some C code maps to assembly. Then,
         | routines with hand-coded assembly would have a lower
         | likelihood.
        
       | mdaniel wrote:
       | relevant: https://news.ycombinator.com/item?id=34250872 ( _G-3PO:
       | A protocol droid for Ghidra, or GPT-3 for reverse-engineering_ <h
       | ttps://github.com/tenable/ghidra_tools/blob/main/g3po/g3po....>;
       | Jan, 2023; 44 comments)
       | 
       |  _ed_ : seems they have this, too, which may value your
       | submission: https://github.com/tenable/awesome-llm-cybersecurity-
       | tools#a...
        
       | sinuhe69 wrote:
       | For me the huge difference between re-compilability and re-
       | excuteability scores is very interesting. GTP4 achieved 8x% on
       | re-compilability (syntactically correct) but abysmal 1x% in re-
       | excutability (schematically correct) demonstrated once again its
       | overgrown mimicry capacity.
        
         | sitkack wrote:
         | > overgrown mimicry
         | 
         | I don't think it shows that. GPT4 was not trained on
         | decompiling binaries back into C. Amazing result for an
         | untrained task.
         | 
         | We are soon going to have robust toolchain detection from
         | binaries, and source recovery with variable and function names.
        
       | speedylight wrote:
       | I have thought about doing something similar for heavily
       | obfuscated JavaScript. Very useful for security research I
       | imagine!
        
       | quantum_state wrote:
       | It seems the next logical step would be LLMAssistedHacking to
       | turn things up side down...
        
       | mahaloz wrote:
       | It's always cool to see different approaches in this area, but I
       | worry its benchmarks are meaningless without a comparison of non-
       | AI based approaches (like IDA Pro). It would be interesting to
       | see how this model holds up on metrics from previous papers in
       | security.
        
       | YeGoblynQueenne wrote:
       | If I read the "re-executability" results in the Results figure
       | right then that's a great idea but it doesn't really work:
       | 
       | https://raw.githubusercontent.com/albertan017/LLM4Decompile/...
       | 
       | To clarify:
       | 
       | >> Re-executability provides this critical measure of semantic
       | correctness. By re-compiling the decompiled output and running
       | the test cases, we assess if the decompilation preserved the
       | program logic and behavior. Together, re-compilability and re-
       | executability indicate syntax recovery and semantic preservation
       | - both essential for usable and robust decompilation.
        
       ___________________________________________________________________
       (page generated 2024-03-17 23:00 UTC)