https://github.com/albertan017/LLM4Decompile Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} albertan017 / LLM4Decompile Public * Notifications * Fork 28 * Star 590 * Reverse Engineering: Decompiling Binary Code with Large Language Models License MIT license 590 stars 28 forks Branches Tags Activity Star Notifications * Code * Issues 2 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights albertan017/LLM4Decompile This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 50 Commits decompiler-eval decompiler-eval evaluation evaluation samples samples scripts scripts .gitignore .gitignore LICENSE LICENSE README.md README.md requirements.txt requirements.txt View all files Repository files navigation * README * MIT license LLM4Decompile Reverse Engineering: Decompiling Binary Code with Large Language Models For more details check out the paper. 0. Updates 2023.03.16 Add llm4decompile-6.7b-uo model which is trained without prior knowledge of the optimization levels (O0~O3), the average re-executability is arond 0.21. 1. Introduction of LLM4Decompile and Decompile-Eval Our objective is to create and release the first open-source LLM dedicated to decompilation, and to assess its capabilities by constructing the first decompilation benchmark focused on re-compilability and re-executable. We start by compiling a million C code samples from AnghaBench into assembly code using GCC with different configurations, forming a dataset of assembly-source pairs in 4 billion tokens. We then finetune the DeepSeek-Coder model, a leading-edge code LLM, using this dataset. Followed by constructing the evaluation benchmark, Decompile-Eval, based on HumanEval questions and test samples. Specifically, we formulate the evaluation from two perspectives: whether the decompiled code can recompile successfully, and whether it passes all assertions in the test cases. Figure 1 presents the steps involved in our decompilation evaluation. First, the source code (denoted as src) is compiled by the GCC compiler with specific parameters, such as optimization levels, to produce the executable binary. This binary is then disassembled into assembly language (asm) using the objdump tool. The assembly instructions are subsequently decompiled to reconstruct the source code in a format that's readable to humans (noted as src'). To assess the quality of the decompiled code (src'), it is tested for its ability to be recompiled with the original GCC compiler (re-compilability) and for its functionality through test assertions (re-executability). image 2. Evaluation Results Metrics Re-compilability and re-executability serve as critical indicators in validating the effectiveness of a decompilation process. When decompiled code can be recompiled, it provides strong evidence of syntactic integrity. It ensures that the decompiled code is not just readable, but also adheres to the structural and syntactical standards expected by the compiler. However, syntax alone does not guarantee semantic equivalence to the original pre-compiled program. Re-executability provides this critical measure of semantic correctness. By re-compiling the decompiled output and running the test cases, we assess if the decompilation preserved the program logic and behavior. Together, re-compilability and re-executability indicate syntax recovery and semantic preservation - both essential for usable and robust decompilation. Results Alt text 3. How to Use The Model Our LLM4Decompile includes models with sizes between 1.3 billion and 33 billion parameters, and we have made these models available on Hugging Face. llm4decompile-1.3b llm4decompile-6.7b llm4decompile-33b llm4decompile-6.7b-nsp llm4decompile-6.7b-uo Note: The NSP model is trained with assembly code, the average re-executability is arond 0.17. Note: The unified optimization (UO) model is trained without prior knowledge of the optimization levels (O0~O3), the average re-executability is arond 0.21. The pre-processing of UO model is slightly different (no prior knowledge of the On), please check the model page. Here give an example of how to use our model. Preprocessing: compile the C code into binary, disassemble the binary into assembly instructions. import subprocess import os import re digit_pattern = r'\b0x[a-fA-F0-9]+\b'# binary codes in Hexadecimal zeros_pattern = r'^0+\s'#0s OPT = ["O0", "O1", "O2", "O3"] fileName = 'path/to/file' with open(fileName+'.c','r') as f:#original file c_func = f.read() for opt_state in OPT: output_file = fileName +'_' + opt_state input_file = fileName+'.c' compile_command = f'gcc -c -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux subprocess.run(compile_command, shell=True, check=True) compile_command = f'objdump -d {output_file}.o > {output_file}.s'#disassemble the binary file into assembly instructions subprocess.run(compile_command, shell=True, check=True) input_asm = '' with open(output_file+'.s') as f:#asm file asm= f.read() asm = asm.split('Disassembly of section .text:')[-1].strip() for tmp in asm.split('\n'): tmp_asm = tmp.split('\t')[-1]#remove the binary code tmp_asm = tmp_asm.split('#')[0].strip()#remove the comments input_asm+=tmp_asm+'\n' input_asm = re.sub(zeros_pattern, '', input_asm) before = f"# This is the assembly code with {opt_state} optimization:\n"#prompt after = "\n# What is the source code?\n"#prompt input_asm_prompt = before+input_asm.strip()+after with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f: f.write(input_asm_prompt) Decompilation: use LLM4Decompile to translate the assembly instructions into C: from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_path = 'arise-sustech/llm4decompile-1.3b' tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda() with open(fileName +'_' + opt_state +'.asm','r') as f:#original file asm_func = f.read() inputs = tokenizer(asm_func, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=500) c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1]) 4. How to use Decompile-Eval Data are stored in llm4decompile/decompile-eval/decompile-eval.json, using JSON list format. There are 164*4 (O0, O1, O2, O3) samples, each with five keys: * task_id: indicates the ID of the problem. * type: the optimization stage, is one of [O0, O1, O2, O3]. * c_func: C solution for HumanEval problem. * c_test: C test assertions. * input_asm_prompt: assembly instructions with prompts, can be derived as in our preprocessing example. To run the evaluation on single GPU and single process: cd LLM4Decompile python ./evaluation/run_evaluation_llm4decompile_singleGPU.py To run the evaluation using TGI (10x faster, support multiple GPUs and multi-process): First, please install the text-generation-inference following the official link git clone https://github.com/albertan017/LLM4Decompile.git cd LLM4Decompile pip install -r requirements.txt # Before run the evaluation script, plase update the model_path to your local mdoel path. bash ./scripts/run_evaluation_llm4decompile.sh 5. On Going LLM4Binary: We plan to include larger dataset to pre-train the model with assembly code and C code. Decompiler-ALL: Support mroe languages/platforms and settings (e.g., decompile multiple functions). 6. License This code repository is licensed under the MIT License. 7. Contact If you have any questions, please raise an issue. 8. Thoughts The conversation about the language model decompiler that took place on Reddit roughly a year ago was quite fascinating to us. 9. Citation @misc{tan2024llm4decompile, title={LLM4Decompile: Decompiling Binary Code with Large Language Models}, author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang}, year={2024}, eprint={2403.05286}, archivePrefix={arXiv}, primaryClass={cs.PL} } About Reverse Engineering: Decompiling Binary Code with Large Language Models Topics reverse-engineering decompile large-language-models Resources Readme License MIT license Activity Stars 590 stars Watchers 6 watching Forks 28 forks Report repository Releases No releases published Packages 0 No packages published Contributors 2 * @albertan017 albertan017 * @rocky-lq rocky-lq Qi Luo Languages * Python 97.5% * Shell 2.5% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.