[HN Gopher] The rev.ng decompiler goes open source
       ___________________________________________________________________
        
       The rev.ng decompiler goes open source
        
       Author : quic_bcain
       Score  : 243 points
       Date   : 2024-03-29 00:01 UTC (23 hours ago)
        
 (HTM) web link (rev.ng)
 (TXT) w3m dump (rev.ng)
        
       | Fnoord wrote:
       | Price model:
       | 
       | > Very briefly:
       | 
       | > The rev.ng framework is fully open source. You can decompile
       | anything you want from the CLI. > The UI will be available in the
       | following forms: > free to use in the cloud for public projects;
       | > available through a subscription in the cloud for private
       | projects; > available at a cost as a fully standalone, fully
       | offline application.
       | 
       | In comparison, Hopper costs 100 USD with one year of updates [1].
       | Ghidra and Radare2 are FOSS and completely free to use, IDA Pro
       | costs a fortune
       | 
       | [1] https://www.hopperapp.com/index.html
        
         | eyegor wrote:
         | Binary ninja is another good option. In my experience it's
         | pretty similar to ida but I find it more user friendly. It just
         | has a lot of well thought out features that make me more
         | productive. I haven't tried hopper but ghidra and radare2 both
         | had a bad dev experience and produced c that didn't "read
         | well". Granted it's been a couple of years since I tried
         | either.
         | 
         | Binja is $300 (or $1500 for commercial, both cheaper for
         | students).
         | 
         | https://binary.ninja/features
        
           | aleclm wrote:
           | Students shouldn't pay a dime. They are poor.
           | 
           | Our view is: the engine is 100% open source. The UI is
           | available for free in the cloud for anyone experimenting,
           | which we define as "I'm OK with leaving the project public".
           | 
           | Basically, the decompiler engine is Free Software, extensible
           | and available for automation/scripting, while the UI is
           | available for free for students/researchers and we can make a
           | living out of professionals (i.e., when your company is
           | paying for it).
        
           | halayli wrote:
           | They are now offering a free version:
           | https://binary.ninja/2024/02/28/4.0-dorsai.html
        
             | Nereuxofficial wrote:
             | Oh that is awesome! I've used the cloud version previously
             | but now that the desktop version is free with some small
             | limitations i think I'll probably use it instead of Ghidra
        
         | felipefar wrote:
         | I really like licensing models of one-time payments with a pre-
         | defined duration of updates. But I wonder how they enforce it
         | while not making internet access a requirement for the app.
        
           | 8organicbits wrote:
           | I've been planning to use a non-enforcement model for a
           | future project. Some users will always pay, because of
           | corporate policy or ethics. Some will never pay and will
           | reverse engineer out any software license checks. Asking the
           | user if they have a license keeps the honest ones honest and
           | permits ad-hoc free trials, emergency use, and other
           | reasonable "unlicensed use".
        
             | userbinator wrote:
             | _Some will never pay and will reverse engineer out any
             | software license checks._
             | 
             | For a long time (and might still be; not paying much
             | attention anymore), it was a "rite of passage" in the scene
             | to crack IDA... using itself.
        
               | mrexodia wrote:
               | It never was a "rite of passage", because removing IDA's
               | license checks has always been trivial...
        
         | dvzk wrote:
         | Decompilation is often the least important (and least reliable)
         | part of IDA/Ghidra, so comparing the two is unfair. That said,
         | the scene is perpetually starved for good C decompilers, so
         | more attempts are always exciting.
        
           | saagarjha wrote:
           | I hear this a lot and in my experience people who Ghidra or
           | IDA and don't use the decompiler are exceptionally rare. Why
           | would you suffer that when you can use something else for
           | what you actually want?
        
             | dvzk wrote:
             | I didn't say I never use it, just that it's not always the
             | core feature. This will depend heavily on your field, but
             | in my past work, the features that were way more essential
             | are: scripting (+ IR lifting), xrefs, CFGs, labels/notes
             | (in a persistent DB).
             | 
             | In my experience decompilers will totally ignore or fail on
             | certain types of malicious code, so they mainly exist to
             | assist disassembly analysis. And for that purpose, they
             | save us an incredible amount of human hours.
        
               | aleclm wrote:
               | For scripting, our approach is to give you access to the
               | project file (just a YAML file), and you can make changes
               | from any scripting language you want. Everything the user
               | can customize is in there, all the rest is
               | deterministically produced from that file.
               | 
               | I really disliked the fact that you usually need to buy
               | into the version of Python that $TOOL requires you to
               | use, or the fact itself that you need to use a specific
               | language.
               | 
               | Can parse YAML? You're mostly done.
               | 
               | The "project file" is what we call the model:
               | https://docs.rev.ng/user-manual/model-tutorial/
               | 
               | For xrefs, CFG and the rest: we have all of that in the
               | UI, but we also produce them in a rich way. For instance,
               | when we emit disassembly and decompiled code, we actually
               | emit plain text + HTML-like markup to provide
               | metainformation for navigation (basically, xrefs) and
               | highlighting. So you can use all that from any language
               | that can parse HTML/XML. It's called PTML:
               | https://docs.rev.ng/references/ptml/
               | 
               | For lifting: we use LLVM IR as our internal
               | representation. This means that: 1) you don't have to
               | learn an IR that no one else uses, 2) you can use off the
               | shelf tools (e.g., KLEE for symbolic execution) but you
               | can also use all the standard LLVM optimizations and
               | analyses and 3) you can recompile it, but we're not into
               | the binary translation business anymore.
        
               | znpy wrote:
               | > 3) you can recompile it, but we're not into the binary
               | translation business anymore
               | 
               | How comes?
        
               | aleclm wrote:
               | Short answer: if you want to execute a program (maybe
               | with some instrumentation, for fuzzing purposes) it's
               | much easier to adopt a dynamic approach (i.e., emulation
               | or virtualization). With static binary translation you
               | can get better performance, but there's a lot of other
               | things you need to get 100% right and that with a dynamic
               | approach are a given (e.g., the CFG).
               | 
               | There's much more space of improvement in the field of
               | analyzing code (as opposed to running it), so we're
               | investing our energies there.
               | 
               | Then we're strong believers in integrating dynamic and
               | static information, for instance see PageBuster:
               | https://rev.ng/blog/pagebuster
               | 
               | But other than that, static binary translation is a
               | feature of rev.ng in maintenance mode.
        
           | vient wrote:
           | Huh, for me as a malware analyst previously and a reverse
           | engineer in general, decompilation is the most important part
           | of such tools. It's all about speed, pseudo-C of some kind
           | lets you roughly understand what's going on in a function in
           | seconds. I guess you can become pretty fast with assembly
           | too, but C is just a lot more dense.
           | 
           | Regarding reliability, I would say that Hex-Rays is pretty
           | reliable (at least for x86) if you know its limitations, like
           | throwing away all code in catch blocks. Usually wrong
           | decompilation is caused by either wrong section permissions,
           | or wrong function signature, both of them can be fixed. It
           | can have bad time when stack frame size goes "negative" or
           | some complex dynamic stack array logic is involved, which are
           | usually signs of obfuscation anyway.
           | 
           | It was less reliable 10 years ago though.. Also even now hex-
           | rays weirdly does not support some simple instructions like
           | movbe.
        
           | aleclm wrote:
           | > Decompilation is often the least important (and least
           | reliable) part of IDA/Ghidra
           | 
           | This is something all people using decompilers say and sort
           | of shows how low is trust towards decompilers. Expectations
           | have always been rather low.
           | 
           | I've been there, but this does not have to be the case, the
           | whole reason why we started rev.ng is to prove that
           | expectations can be raised.
           | 
           | Apart from accuracy, which is difficult but engineering work,
           | why don't decompilers emit syntactically valid C? Have you
           | ever tried to re-compile code from _any_ decompiler? It 's a
           | terrible experience.
           | 
           | rev.ng only emits valid C code, and we test it with a bunch
           | of -Wall -Wextra:
           | 
           | https://github.com/revng/revng-c/blob/develop/share/revng-c/.
           | ..
           | 
           | Other key topic: data structures. When reversing I spend half
           | of the time renaming things and half of the time detecting
           | data structures. The help I get from decompilers in latter is
           | basically none.
           | 
           | rev.ng, by default, detects data structures on the whole
           | binary, interprocedurally, including arrays. See the linked
           | list example in the blog post. We also have plans to detect
           | enums and other stuff.
           | 
           | Clearly we're not there yet, we still need to work on
           | robustness, but our goal is to increase the confidence in
           | decompilers and actually offer features that save time.
           | Certain tools have made progress in improving the UI and the
           | scripting experience, but there's other things to do beyond
           | that.
           | 
           | I see this a bit like the transition from the phase in which
           | C developers where using macros to ensure things were being
           | inlined/unrolled to the phase where they stopped doing that
           | because compilers got smart enough to the right thing and to
           | do it much more effectively.
        
             | saagarjha wrote:
             | Curious what you do when you encounter an instruction you
             | don't model
        
               | aleclm wrote:
               | That's unlikely, since we use QEMU as a lifter, which
               | sometimes supports new instructions before they hit
               | silicon.
               | 
               | However, I think we'll emit a call to some `noreturn`
               | function. Basically we emit a call to `abort`.
        
               | saagarjha wrote:
               | Right but you do see how this means that you need to lift
               | code that has semantics that cannot be modeled in C?
        
               | aleclm wrote:
               | Sure, in those cases we emit calls to C functions. The
               | only thing we need to know is what registers are taken as
               | input, what registers are output and what registers are
               | preserved.
               | 
               | In QEMU parlance, these are helper functions, and they
               | have actual implementations. But for decompilation
               | purposes, you don't need to implement them. You just need
               | to know how they interact with the registers.
        
             | j-krieger wrote:
             | What happens if you put in a binary which outputs C-like
             | machine code, like Rust (llvm) or zig?
        
               | aleclm wrote:
               | Languages with a rich standard library and generating a
               | lot of code for you usually need some love to get
               | rid/represent idiomatically common patterns and to detect
               | common data structures.
               | 
               | We haven't looked into it yet, but the automatic data
               | structure recognition might help.
               | 
               | Frankly, Rust looks particularly scary: https://media.ccc
               | .de/v/37c3-11684-rust_binary_analysis_featu...
        
               | tux3 wrote:
               | Oh, very nice! I've dealt with forsaken deeply abstract
               | vtable mazes of hell, but the idea of using a ton of sum
               | types, dynamic dispatch, async everywhere, and long
               | iterator chains would make for some deliciously
               | unreadable binaries!
        
             | Sesse__ wrote:
             | > Other key topic: data structures. When reversing I spend
             | half of the time renaming things and half of the time
             | detecting data structures. The help I get from decompilers
             | in latter is basically none.
             | 
             | That's funny, because I've used both Hex-Rays and Ghidra,
             | and gotten lots of help with data structures. The
             | interactivity really helps a bunch with filling in the
             | blanks.
        
               | aleclm wrote:
               | In IDA you basically have only detection of stack frame
               | layout (in a quite confusing fashion) and "create struct
               | out of this pointer", which is something you have to do
               | manually and its intraprocedural.
               | 
               | Imagine this being done automatically, across all of the
               | binary. If you pass a pointer to another function the
               | type is correct and you build the type from all the
               | functions using it.
               | 
               | Then obviously the user needs to fix things, but
               | boostrapping can definitely be hugely improved.
        
               | Sesse__ wrote:
               | I'm sure user-defined structs can benefit from combining
               | information from multiple functions, but saying that what
               | you get today is "basically none" is a bit of an
               | overstatement. Also, the special (and important!) case of
               | operating system ABI structs is great, and that
               | information propagates throughout function calls.
        
             | jcranmer wrote:
             | Here's my issue with decompilers:
             | 
             | I don't want to look at assembly code. I'd rather see
             | expression trees, expressed in C-like syntax, than trying
             | to piece together variables from two-address or three-
             | address instructions. Looking at assembly tends to lead to
             | brain farts like "wait, was the first or second operand the
             | output operand?" (really, fuck AT&T syntax) or "wait, does
             | ja implement ugt or sgt?"
             | 
             | So that means I want to look at something vaguely C-like.
             | But the problem is that the C type system is too powerful
             | for decompilers to robustly lift to, and the resulting code
             | is generally at best filled with distractions of wait-I-
             | can-fix-this excessive casting and at worst just wrong. And
             | when it's wrong, I have to resort to staring at the
             | assembly, which (for Ghidra at least) means throwing away a
             | lot of the notes I've accumulated because they don't
             | correlate back to underlying assembly.
             | 
             | So what I really want isn't something that can emit
             | recompilable C code, that's optimizing for something that
             | doesn't help me in the end. What I want is robust
             | decompilation to something that lets me ignore the assembly
             | entirely. I'm a compiler writer, I can handle a language
             | where integers aren't signed but the operands are.
        
               | aleclm wrote:
               | I 120% agree with what you're saying, but emitting valid
               | C is kinda part of what you're asking, in design terms.
               | 
               | Our goal is: omit all the casts that can be omitted
               | without changing the semantics according to C. In fact,
               | we have a PR doing exactly this (still on the old repo,
               | hopefully it will go in soon).
               | 
               | But, how can you expect to be able to be strict with what
               | C allows you to do implicitly, if you're not even
               | emitting valid C? For instance, thanks to the fact that
               | we emit valid C, we could test if the assembly emitted by
               | a compiler is the same before and after removing
               | redundant casts.
               | 
               | My point is that emitting valid C is kind of a
               | prerequisite for what you're asking, a rather low bar to
               | pass, but that, in practice, no mainstream decompiler
               | passes. It's pretty obvious the decompiled code will
               | often be redundant and outright wrong if you don't even
               | guarantee it's syntactically valid. Then clearly it's not
               | a panacea, but it's an important design criterion and
               | shows the direction we want to go.
               | 
               | As for comments: we still haven't implemented inline
               | comments, but they will be attached to program addresses,
               | so they will be available both in disassembly and
               | decompiled C. It's not very hard to do, but that needs
               | some love.
        
               | jcranmer wrote:
               | One of the blog posts I keep meaning to write but never
               | quite get around to is a post that C is not portable
               | assembly. What is necessary is decompilation to a
               | portable C-like assembly, but that target is not C, and I
               | think focusing on creating valid C tends to drag you
               | towards suboptimal decisions, even leaving aside issues
               | like "should SLL decompile to x << y or x << (y % 32)?"
               | 
               | In my experience with Ghidra, I've just seen far too many
               | times where Ghidra starts with wrong types for something
               | and the result becomes gibberish--even just plain
               | _dropping_ stuff altogether. There are some cases where
               | it 's clear it's just poor analysis on Ghidra's part
               | (e.g., it doesn't seem to understand stack slot reuse,
               | and memcpy-via-xmm is very confusing to it). And Ghidra's
               | type system lacks function pointer types, which is very
               | annoying when you're doing vtable-heavy C++ code.
               | 
               | I do like the appeal of a recompileable target language.
               | But that language need not be C--in fact, I'm actually
               | sketching out the design of such a language for my own
               | purposes in being able to read LLVM IR without going
               | crazy (which means I need to distinguish between, e.g.,
               | add nuw and just plain add).
               | 
               | Analysis necessarily involves multiple levels. Given that
               | a lot of the type analysis today tends to be crap, I'd
               | rather prefer to have the ability to see a more solid
               | first-level analysis that does variable recovery and
               | works out function calling conventions so that it can
               | inform my ability to reverse engineer structures or
               | things like "does this C++ method return a non-trivial
               | struct that is an implicit first parameter?"
               | 
               | (Also, since I'm largely looking at C++ code in practice,
               | I'd absolutely love to be able to import C++ header files
               | to fill in known structure types.)
        
               | aleclm wrote:
               | > should SLL decompile to x << y or x << (y % 32)?
               | 
               | I think this a bit of a misguided question. The hardware
               | has a precise semantic defined, usually. QEMU's <<
               | behaves similarly to C (undefined behavior for rhs > 32),
               | but this means that the lifter (still QEMU) will account
               | for this and emit code preserving the semantics.
               | 
               | tl;dr: the code we emit should do the right thing
               | depending on what the original instruction did, without
               | making assumptions on what happens in case of C undefined
               | behaviors.
               | 
               | > Ghidra's type system lacks function pointer types
               | 
               | Weird limitation, we support those.
               | 
               | > it doesn't seem to understand stack slot reuse
               | 
               | That's a tricky one. We're now re-designing certain parts
               | of the pipeline to enable LLVM to promote stack accesses
               | to SSA values, which basically solves the stack slot
               | reuse. This is probably one of the most important
               | features experienced reversers ask for.
               | 
               | > that language need not be C--
               | 
               | Making up your own language is temptation one should
               | resist.
               | 
               | Anyway, we're rewriting our backend using an MLIR dialect
               | (we call it clift) which targets C but should be good
               | enough to emit something "similar to C but slightly
               | different". It might make sense to have a different
               | backend there. But a "standard C" backend has to be the
               | first use case.
               | 
               | We thought about emitting C++, it would make our life
               | simpler. But I think targeting non-C as the first and
               | foremost backend would be a mistake.
               | 
               | Also, a Python backend would be cool.
               | 
               | > Analysis necessarily involves...
               | 
               | I would be interested in discussing more what exactly you
               | mean here. Why don't you join our discord server?
               | 
               | > I'd absolutely love to be able to import C++ header
               | files to fill in known structure types
               | 
               | We have a project for importing from header files.
               | Basically we want use a compiler to turn them into DWARF
               | debug symbols and then import those. Not too hard.
        
       | nextos wrote:
       | A cool company fueled by one of the best PLT books out there:
       | https://link.springer.com/book/10.1007/978-3-662-03811-6
       | 
       |  _" He also met a partner in crime, Pietro. Romantically enough,
       | he met him thanks to a book which will turn out to be
       | foundational for company."_
       | 
       | https://rev.ng/about
       | 
       | Congrats on the launch.
        
         | aleclm wrote:
         | About the book, here's the full story: I was getting into
         | compilers, but I was really struggling with the theory, the
         | most famous books weren't doing it for me, and I felt really
         | down.
         | 
         | Then I find this book, which seems very dense, but clear. So I
         | ask my advisor if I could buy it and goes like "well, first
         | check out the university library". I check it out and there's a
         | copy, but... it's taken.
         | 
         | Working in the only group that was doing research on compilers
         | I'm like "who dares do compilers stuff out of our group!?".
         | 
         | I go to the library:
         | 
         | Me: who has the book?
         | 
         | Library guy: can't tell you, privacy reasons.
         | 
         | Me: what's the third letter of its surname?
         | 
         | Library guy: Z
         | 
         | Me: what's the second letter of its name?
         | 
         | Library: I
         | 
         | Me: thanks.
         | 
         | I go here: https://www.deib.polimi.it/ita/personale-lista-
         | alfabetica I found him.
         | 
         | Fast forward, we become friends and we start the company
         | together.
         | 
         | > Congrats on the launch.
         | 
         | Thanks! It was a lot of work.
        
       | albertzeyer wrote:
       | Checking the team about: https://rev.ng/about
       | 
       | And looking at the code contributions:
       | https://github.com/revng/revng/graphs/contributors
       | 
       | Isn't it a bit weird that the CEO (aleclearmind) has most
       | commits, even much more than the CTO (pfez)? I often hear the
       | complaints from other CEOs that they don't really find any time
       | anymore to code... Even the CTO usually is more on the managing
       | side and less active in actual coding.
       | 
       | Anyway, if this works, then I guess it's a lot of fun for them.
       | 
       |  _Edit_ Ah right, I didn 't check the timeline.
        
         | zote wrote:
         | The CTO has more recent commits, aleclearmind's commits drop to
         | 0 after 2020 so maybe they also have a hard time getting to
         | code.
        
         | aleclm wrote:
         | The CTO mostly works on the backend of the decompiler, revng-c,
         | which we just released:
         | 
         | https://github.com/revng/revng-c/commits/develop/
         | 
         | Eventually we'll merge the two repos.
         | 
         | Also, I develop stuff every day. For some reason GitHub is not
         | picking up my user correctly.
         | 
         | > Anyway, if this works, then I guess it's a lot of fun for
         | them.
         | 
         | It is!
        
         | albertzeyer wrote:
         | I wonder a bit about the downvotes. I didn't mean this as a
         | criticism or so in any way. In fact, I like this very much. I
         | just found this interesting and unlike what I saw elsewhere.
         | 
         | So the downvotes are because this is not interesting or not
         | unusual?
        
           | halayli wrote:
           | your observation was spot on and your question was answered
           | by the ceo. People on hn can be oversensitive.
        
       | londons_explore wrote:
       | Idea: automatically name variables and members of structs based
       | on how code interacts with them.
       | 
       | Eg. The next pointer in a linked list should be easy to identify
       | as 'next'.
       | 
       | That would be done by downloading all of GitHub, then seeing what
       | variables in GitHub code have the most similar layouts and
       | interactions, and then if the confidence is high enough, using
       | those names.
        
         | qweqwe14 wrote:
         | Sort of like GitHub Copilot but for reversing?
        
         | aleclm wrote:
         | In the past we were thinking to do something like this by hand.
         | For instance, we detect induction variables, we could rename
         | them into `i`.
         | 
         | However, nowadays, it seems pretty obvious that the right way
         | to do this things is using LLMs.
         | 
         | This said, at this stage, we see ourselves as people building
         | robust infrastructure. Once the infrastructure is there, using
         | some off the shelf model to rename things or add comments is
         | relatively easy.
         | 
         | Basically: we do the hard decompilation work that needs 100%
         | accuracy, and then we can adopt LLMs for things that are OK to
         | be approximate such as names, comments and the like.
         | 
         | Anyway, writing a script that renames stuff is pretty easy.
         | Check out the docs: https://docs.rev.ng/user-manual/model-
         | tutorial/
        
           | londons_explore wrote:
           | If an LLM is used, it's unclear how to best do it.
           | 
           | One could try to train ones own LLM from scratch, using an
           | encoder-decoder (translation - aka seq2seq) architecture
           | trying to predict the correct variable name given the
           | decompiled output.
           | 
           | One could try to use something like GPT-4 with a carefully
           | designed prompt "Given this datastructure, what might be the
           | name for this field?"
           | 
           | One could try to use something pretrained like llama, but
           | then finetune it based on hundreds of thousands of compiled
           | and decompiled programs.
        
             | Eisenstein wrote:
             | Option 4:
             | 
             | One could take an pretrained model like llama, train it on
             | only a few thousands of compiled and decompiled programs,
             | then feed it compiled programs and have it decompile them
             | and evaluate that output to make a new dataset and fine
             | tune it again. Repeat until satisfactory.
        
         | 19h wrote:
         | Sounds like sidekick for binary ninja
        
         | diggan wrote:
         | Would be very cool indeed, something like http://jsnice.org/
         | 
         | Paper that describes what JSNice is doing behind the scenes:
         | https://files.sri.inf.ethz.ch/website/papers/jsnice15.pdf
        
       | yakkityyak wrote:
       | I hope collaborative workflows get a lot of attention. I haven't
       | used IDA teams or anything, but a reverse engineering experience
       | that felt as frictionless as Google Docs would be amazing.
        
         | aleclm wrote:
         | That's our goal. We used to use QtCreator as a basis for the
         | UI, terrible idea.
         | 
         | Then we switched to VSCode, which happens to be able to run in
         | the browser. So we added some magic kubernetes sauce and voila,
         | you got the cloud decompiler with exactly the same user
         | experience as the fully standalone one.
         | 
         | We still need to perform some QA on collaboration, but
         | basically works. One daemon, many clients. Very simple
         | architecture.
         | 
         | I think we got inspiration to do this from a CTF where we were
         | doing "collaboration" using IDA with multiple windows on a X
         | session on a server with multiple cursors. Very cursed, but
         | effective.
        
       | fwr00t wrote:
       | Seems exciting. I'm keen to try the fully standalone version. Is
       | there any news about tentative pricing? Hopefully its affordable
       | enough for hobbyist as well.
        
       | JonChesterfield wrote:
       | Always pleased to see more binary hacking tools. A load of
       | overly-precise suggestions on the chosen packaging format follows
       | because I might want to use this tool myself :)
       | 
       | > `source ./environment`
       | 
       | That's a bad omen. I downloaded the tar to find it does indeed
       | set a bunch of environment variables including PATH, though
       | thankfully not LD_LIBRARY_PATH. Mostly prefixed "HARD_" which is
       | maybe unique (REVNG would be a more obvious choice, colliding
       | with existing environment variables is a bad thing).
       | 
       | It sets `AWS_EC2_METADATA_DISABLED="true"` which won't break me
       | (I don't use AWS) but in general seems dubious.
       | export RPATH_PLACEHOLDER="///////////////////////////////////////
       | /////////$ORCHESTRA_ROOT"         export
       | HARD_FLAGS_CXX_CLANG="-stdlib=libc++"         ...
       | "-Wl,-rpath,$RPATH_PLACEHOLDER/lib ...
       | 
       | This is suboptimal. The very long PATH setting with mingw32 and
       | gentoo and mips strings in it also looks very fragile.
       | 
       | I usually bail when the running instructions include "now mangle
       | your environment variables" because that step is really strongly
       | correlated with programs that don't work properly on my non-
       | ubuntu system. Wiring your application control flow through the
       | launching environment introduces a lot of failure modes - it's
       | not as convenient as it first appears. Very like global
       | variables.
       | 
       | Clang will burn a lot of this stuff in as defaults when you build
       | it if you ask, e.g. `-DCLANG_DEFAULT_CXX_STDLIB=libc++` would
       | remove the stdlib setting environment variable. DEFAULT_SYSROOT
       | is useful too.
       | 
       | Using rpath means you're vulnerable to someone running this
       | script with LD_LIBRARY_PATH set as the environment variable will
       | override your DT_RUNPATH setting in the binaries. The background
       | on this is aggravating. Abbreviating here, '-Wl,rpath' no longer
       | means rpath, it means 'runpath' which is a similar but much less
       | useful construct. The badly documented invocation you probably
       | want is `-Wl,rpath -Wl,--disable-new-dtags` to set rpath instead
       | of set runpath, at which point the loader will ignore
       | LD_LIBRARY_PATH when looking for libraries.
       | 
       | There's a good chance you can completely remove the environment
       | mangling through a combination of setting different flags when
       | building clang, static linking and embedding binaries in other
       | binaries.
       | 
       | Related, your clang-16 binary is dynamically linked. As in it
       | goes looking for things like libLLVMAArch64CodeGen.so.16 at
       | runtime. A lot of failure modes can be removed by
       | LLVM_BUILD_STATIC=ON. E.g. if I run your dynamically linked clang
       | with a module based HPC toolchain active, your compiler will pick
       | up the libraries from the HPC toolchain and it'll have a bad
       | time. The tools are all linked against glibc as well, pros and
       | cons to that.
       | 
       | Tools are also linked against libc++.so, which is linked against
       | libc++abi.so and so forth. Worth considering static libc++, but
       | even if you decline that, libc++abi and libunwind can and
       | probably should be statically linked into the libc++. The above
       | rpath rant? Runpath isn't transitive so dynamic libaries finding
       | other dynamic libraries using runpath (the one you get when you
       | ask for rpath) works really poorly.
       | 
       | Context for there being so many suggestions above - I am
       | completely out of patience with distributing dynamically linked
       | programs on Linux. I don't want a stray environment variable from
       | some program that had `source ourhack` in the readme or a "module
       | system" to reach into my application and rewire what libraries it
       | calls at runtime as the user experience and subsequent bug report
       | overhead is terrible. Static linking is really good in
       | comparison.
       | 
       | Thanks again for shipping, and I hope some of the above feedback
       | is helpful!
        
         | aleclm wrote:
         | I think most of your concerns about messing with the
         | environment are sensible only under the assumption that you
         | actually do `source environment`.
         | 
         | In truth, we suggest to do that only so you use the GCC we
         | distribute for the demo binary. The actual way this is intended
         | to be used is through the `./revng` script. In that way, the
         | environment changes only affect the invocation of `revng`.
         | 
         | This is documented here: https://docs.rev.ng/user-
         | manual/working-environment/ We should probably add a warning
         | about `source ./environment`.
         | 
         | Now, let's get to each of your comments :D
         | 
         | > though thankfully not LD_LIBRARY_PATH
         | 
         | We spent a lot of time to have a completely self-contained set
         | of binaries where each ELF refers to its dependencies through
         | relative paths. LD_LIBRARY_PATH is evil.
         | 
         | > Mostly prefixed "HARD_"
         | 
         | Those are just used by our compiler wrappers, I don't think
         | those environment variables collide with anything in practice.
         | 
         | > It sets `AWS_EC2_METADATA_DISABLED="true"`
         | 
         | Original discussion:
         | https://github.com/revng/revng/pull/309#discussion_r12805759...
         | 
         | I guess we could patch the AWS SDK to avoid this. Anyway, it
         | affects only when rev.ng is running in the cloud.
         | 
         | > export RPATH_PLACEHOLDER=... > export
         | HARD_FLAGS_CXX_CLANG=...
         | 
         | Those are used when linking binaries translated by revng. If
         | you're not interested in end-to-end binary translation, they
         | don't matter.
         | 
         | > it means 'runpath' which is a similar but much less useful
         | construct
         | 
         | We specifically want DT_RUNPATH. DT_RPATH is deprecated and
         | there might an use case for overriding our libraries with
         | LD_LIBRARY_PATH.
         | 
         | > There's a good chance you can completely remove the
         | environment mangling
         | 
         | I think your observations concerning "mangling the environment"
         | are only valid for non-private environment variables. The
         | following variables are private: RPATH_PLACEHOLDER, HARD_*,
         | REVNG_*. Also, they are all only for binary translation
         | purposes. We could push them down into some smaller-scoped
         | compiler wrappers, but those make sense only if we can get rid
         | of environment entirely, which we can't because we ship Python.
         | 
         | > a combination of setting different flags when building clang
         | 
         | No, the flags also affect the linker and there's some features
         | of our wrappers that cannot simply be burned in. We can push
         | them in more private places, though.
         | 
         | > a lot of failure modes can be removed > libc++abi and
         | libunwind can and probably should be statically linked into the
         | libc++
         | 
         | We no longer have issues with that, our build system is pretty
         | reliable in that regard. LLVM is just one of the components,
         | these things need to work robustly in general, and they do
         | (with quite some effort).
         | 
         | You seem to be wary of using dynamic linking, we put some
         | effort in it, now it works pretty good and always looks up
         | things in the right place, and without ever hardcoding absolute
         | paths anywhere, nor any install phase that "patches" the
         | binaries. The unpacked directory can be moved wherever you
         | want.
         | 
         | > I am completely out of patience with distributing dynamically
         | linked programs on Linux
         | 
         | You're thinking of some other solution, our solution does not
         | use LD_LIBRARY_PATH and all the binaries reference each other
         | in a robust way using `$ORIGIN`. Try:
         | ./root/bin/python ./root/bin/revng artifact --help
         | 
         | It works.
         | 
         | But again, doing `source environment` is mostly for demo
         | purposes, in the actual use case, you just do `./revng` and
         | your environment is untouched.
         | 
         | We ship our Python, but you don't have to use it: you're
         | supposed to just do ./revng (or interact over the network in
         | daemon mode).
         | 
         | Our approach is: use whatever tool you like for scripting as
         | long as it can parse our YAML project file, make changes to it,
         | and then invoke `./revng artifact` (or interact with the
         | daemon): https://docs.rev.ng/user-manual/model-tutorial/
         | 
         | Result: we get to use our Python version (the latest) and you
         | get to use whatever language you like. Then we'll provide on
         | pypi wrappers that help you with that and are compatible with
         | large set of Python versions.
         | 
         | tl;dr Don't `source ./environment`, use `./revng`.
         | 
         | > Thanks again for shipping, and I hope some of the above
         | feedback is helpful!
         | 
         | I'm happy there's someone that cares about this :D
         | 
         | Our next big iteration of this might involve simplifying things
         | a lot by adopting nix + mount namespace to make /nix/store
         | available without root.
         | 
         | Maybe this is not the right place for discussing this, we can
         | chat on our discord server if you'd like :)
        
           | JonChesterfield wrote:
           | Not setting environment variables is indeed solved by not
           | setting environment variables - but `source ./environment` is
           | what's written on the announcement page at the top of this
           | thread. './revng' doesn't appear anywhere on it.
           | 
           | You haven't set LD_LIBRARY_PATH but other people will do.
           | Also LIBRARY_PATH, and put other stuff on PATH and so forth.
           | Module systems are especially prone to this, but ending up
           | with .bashrc doing it happens too.
           | 
           | You have granted the user the ability to override parts of
           | the toolchain with environment variables and moving files to
           | various different directories. That's nice. Some compiler
           | devs will appreciate it. Also it's doing the thing Linux
           | recommends for things installed globally so that's
           | defensible.
           | 
           | In exchange, you will get bug reports saying "your product
           | does not work", where the root cause eventually turns out to
           | be "my linker chose a different library to my loader for some
           | internal component". You also lose however many people try
           | the product once, see it immediately fall over and don't take
           | the time to tell you about the experience.
           | 
           | I think that's a bad trade-off. Static linking is my
           | preferred fix, but generally anything that stops forgotten
           | environment variables breaking your software in confusing
           | ways is worth considering.
        
             | aleclm wrote:
             | > `source ./environment` is what's written on the
             | announcement page at the top of this thread. './revng'
             | doesn't appear anywhere on it.
             | 
             | You're right, but after that there's a link to the docs
             | where we say to use `./revng`. The blog post is for the
             | impatient :) On the long run the docs is what most people
             | will look at.
             | 
             | I don't think we want to support use cases that might break
             | system packages too. If you set LD_LIBRARY_PATH to a
             | directory where you have an LLVM installation, that might
             | break any system program using LLVM too... Why should we
             | try to fix that using `DT_RPATH` (which is a deprecated way
             | of doing things) when system components don't do it?
             | 
             | We might cleanup the environment from LD_LIBRARY_PATH and
             | other stuff, that might be a sensible default, yeah. Also
             | we might have some sanity check printing a warning if weird
             | libraries are pulled in.
             | 
             | But it's hard to take a decision without a specific use
             | case in mind. If you have an example, bring it forward and
             | I'm happy to discuss what should be the right approach
             | there.
        
               | JonChesterfield wrote:
               | LLVM picking up the wrong libraries from the environment
               | has cost me at least a couple of months over the last
               | decade or so. Maybe twenty instances of customers being
               | broken, ten hours or so in meetings explaining the
               | problem and trying to persuade people that the right
               | thing really is different for the system compiler vs your
               | bespoke thing.
               | 
               | If you think it's better for your product to find
               | unrelated libraries with the same name at runtime, you go
               | for it.
               | 
               | Detecting that failure mode would be an interesting
               | exercise - you could crawl your own address space after
               | startup and try to guess whether the libraries you got
               | are the ones you wanted. Probably implementable.
        
       | dark-star wrote:
       | It doesn't work with my ELF file:                   [orchestra]
       | [darkstar@shiina revng]$ ./revng artifact --analyze --progress
       | decompile-to-single-file ../maytag.ko
       | [=======================================] 100% 0.57s Analysis
       | list revng-initial-auto-analysis (5): import-binary
       | [===================>                   ]  50% 0.57s Run analyses
       | lists (2): revng-initial-auto-analysis         [=========>
       | ]  25% 0.57s revng-artifact (2): Run analyses         Only ELF
       | executables and ELF dynamic libraries are supported
       | [orchestra] [darkstar@shiina revng]$ file ../maytag.ko
       | ../maytag.ko: ELF 64-bit LSB relocatable, x86-64, version 1
       | (FreeBSD), not stripped
       | 
       | Does it not support FreeBSD binaries?
       | 
       | Edit: Ah I missed that it doesn't support kernel modules,
       | probably has nothing to do with FreeBSD but the fact that this is
       | not a simple executable
        
         | aleclm wrote:
         | Can you open an issue on GitHub and attach the binary? I don't
         | think it should be too hard to load that.
        
       | costco wrote:
       | Congrats. Do you have any regrets about outsourcing lifting to
       | the QEMU TCG or has it worked well?
        
         | aleclm wrote:
         | Thanks!
         | 
         | It has been working very well. Two regrets:
         | 
         | 1. Not rebasing our fork of QEMU for years has put us in a bad
         | spot. But just today a member of our team managed to lift stuff
         | with the latest QEMU. And he has also been able to lift
         | Qualcomm Hexagon code, for which we helped to add support in
         | QEMU. Eventually we'll be the first proper Hexagon decompiler
         | :)
         | 
         | 2. Focusing too much on QEMU led our frontend to be tightly
         | coupled with QEMU. It will now take some effort to enable
         | support for additional frontends, non-QEMU based. But not
         | impossible: our idea is to let user add support for a new
         | architecture by defining, in C, a struct for the CPU state and
         | a bunch of functions acting on it. That's it. No need to learn
         | any internal representation.
         | 
         | tl;dr QEMU was a great choice, it worked so well that we didn't
         | work on that part of the codebase for too much time and now
         | there's some technical debt there. But we're addressing it.
        
       ___________________________________________________________________
       (page generated 2024-03-29 23:02 UTC)