[HN Gopher] The rev.ng decompiler goes open source
___________________________________________________________________
The rev.ng decompiler goes open source
Author : quic_bcain
Score : 243 points
Date : 2024-03-29 00:01 UTC (23 hours ago)
(HTM) web link (rev.ng)
(TXT) w3m dump (rev.ng)
| Fnoord wrote:
| Price model:
|
| > Very briefly:
|
| > The rev.ng framework is fully open source. You can decompile
| anything you want from the CLI. > The UI will be available in the
| following forms: > free to use in the cloud for public projects;
| > available through a subscription in the cloud for private
| projects; > available at a cost as a fully standalone, fully
| offline application.
|
| In comparison, Hopper costs 100 USD with one year of updates [1].
| Ghidra and Radare2 are FOSS and completely free to use, IDA Pro
| costs a fortune
|
| [1] https://www.hopperapp.com/index.html
| eyegor wrote:
| Binary ninja is another good option. In my experience it's
| pretty similar to ida but I find it more user friendly. It just
| has a lot of well thought out features that make me more
| productive. I haven't tried hopper but ghidra and radare2 both
| had a bad dev experience and produced c that didn't "read
| well". Granted it's been a couple of years since I tried
| either.
|
| Binja is $300 (or $1500 for commercial, both cheaper for
| students).
|
| https://binary.ninja/features
| aleclm wrote:
| Students shouldn't pay a dime. They are poor.
|
| Our view is: the engine is 100% open source. The UI is
| available for free in the cloud for anyone experimenting,
| which we define as "I'm OK with leaving the project public".
|
| Basically, the decompiler engine is Free Software, extensible
| and available for automation/scripting, while the UI is
| available for free for students/researchers and we can make a
| living out of professionals (i.e., when your company is
| paying for it).
| halayli wrote:
| They are now offering a free version:
| https://binary.ninja/2024/02/28/4.0-dorsai.html
| Nereuxofficial wrote:
| Oh that is awesome! I've used the cloud version previously
| but now that the desktop version is free with some small
| limitations i think I'll probably use it instead of Ghidra
| felipefar wrote:
| I really like licensing models of one-time payments with a pre-
| defined duration of updates. But I wonder how they enforce it
| while not making internet access a requirement for the app.
| 8organicbits wrote:
| I've been planning to use a non-enforcement model for a
| future project. Some users will always pay, because of
| corporate policy or ethics. Some will never pay and will
| reverse engineer out any software license checks. Asking the
| user if they have a license keeps the honest ones honest and
| permits ad-hoc free trials, emergency use, and other
| reasonable "unlicensed use".
| userbinator wrote:
| _Some will never pay and will reverse engineer out any
| software license checks._
|
| For a long time (and might still be; not paying much
| attention anymore), it was a "rite of passage" in the scene
| to crack IDA... using itself.
| mrexodia wrote:
| It never was a "rite of passage", because removing IDA's
| license checks has always been trivial...
| dvzk wrote:
| Decompilation is often the least important (and least reliable)
| part of IDA/Ghidra, so comparing the two is unfair. That said,
| the scene is perpetually starved for good C decompilers, so
| more attempts are always exciting.
| saagarjha wrote:
| I hear this a lot and in my experience people who Ghidra or
| IDA and don't use the decompiler are exceptionally rare. Why
| would you suffer that when you can use something else for
| what you actually want?
| dvzk wrote:
| I didn't say I never use it, just that it's not always the
| core feature. This will depend heavily on your field, but
| in my past work, the features that were way more essential
| are: scripting (+ IR lifting), xrefs, CFGs, labels/notes
| (in a persistent DB).
|
| In my experience decompilers will totally ignore or fail on
| certain types of malicious code, so they mainly exist to
| assist disassembly analysis. And for that purpose, they
| save us an incredible amount of human hours.
| aleclm wrote:
| For scripting, our approach is to give you access to the
| project file (just a YAML file), and you can make changes
| from any scripting language you want. Everything the user
| can customize is in there, all the rest is
| deterministically produced from that file.
|
| I really disliked the fact that you usually need to buy
| into the version of Python that $TOOL requires you to
| use, or the fact itself that you need to use a specific
| language.
|
| Can parse YAML? You're mostly done.
|
| The "project file" is what we call the model:
| https://docs.rev.ng/user-manual/model-tutorial/
|
| For xrefs, CFG and the rest: we have all of that in the
| UI, but we also produce them in a rich way. For instance,
| when we emit disassembly and decompiled code, we actually
| emit plain text + HTML-like markup to provide
| metainformation for navigation (basically, xrefs) and
| highlighting. So you can use all that from any language
| that can parse HTML/XML. It's called PTML:
| https://docs.rev.ng/references/ptml/
|
| For lifting: we use LLVM IR as our internal
| representation. This means that: 1) you don't have to
| learn an IR that no one else uses, 2) you can use off the
| shelf tools (e.g., KLEE for symbolic execution) but you
| can also use all the standard LLVM optimizations and
| analyses and 3) you can recompile it, but we're not into
| the binary translation business anymore.
| znpy wrote:
| > 3) you can recompile it, but we're not into the binary
| translation business anymore
|
| How comes?
| aleclm wrote:
| Short answer: if you want to execute a program (maybe
| with some instrumentation, for fuzzing purposes) it's
| much easier to adopt a dynamic approach (i.e., emulation
| or virtualization). With static binary translation you
| can get better performance, but there's a lot of other
| things you need to get 100% right and that with a dynamic
| approach are a given (e.g., the CFG).
|
| There's much more space of improvement in the field of
| analyzing code (as opposed to running it), so we're
| investing our energies there.
|
| Then we're strong believers in integrating dynamic and
| static information, for instance see PageBuster:
| https://rev.ng/blog/pagebuster
|
| But other than that, static binary translation is a
| feature of rev.ng in maintenance mode.
| vient wrote:
| Huh, for me as a malware analyst previously and a reverse
| engineer in general, decompilation is the most important part
| of such tools. It's all about speed, pseudo-C of some kind
| lets you roughly understand what's going on in a function in
| seconds. I guess you can become pretty fast with assembly
| too, but C is just a lot more dense.
|
| Regarding reliability, I would say that Hex-Rays is pretty
| reliable (at least for x86) if you know its limitations, like
| throwing away all code in catch blocks. Usually wrong
| decompilation is caused by either wrong section permissions,
| or wrong function signature, both of them can be fixed. It
| can have bad time when stack frame size goes "negative" or
| some complex dynamic stack array logic is involved, which are
| usually signs of obfuscation anyway.
|
| It was less reliable 10 years ago though.. Also even now hex-
| rays weirdly does not support some simple instructions like
| movbe.
| aleclm wrote:
| > Decompilation is often the least important (and least
| reliable) part of IDA/Ghidra
|
| This is something all people using decompilers say and sort
| of shows how low is trust towards decompilers. Expectations
| have always been rather low.
|
| I've been there, but this does not have to be the case, the
| whole reason why we started rev.ng is to prove that
| expectations can be raised.
|
| Apart from accuracy, which is difficult but engineering work,
| why don't decompilers emit syntactically valid C? Have you
| ever tried to re-compile code from _any_ decompiler? It 's a
| terrible experience.
|
| rev.ng only emits valid C code, and we test it with a bunch
| of -Wall -Wextra:
|
| https://github.com/revng/revng-c/blob/develop/share/revng-c/.
| ..
|
| Other key topic: data structures. When reversing I spend half
| of the time renaming things and half of the time detecting
| data structures. The help I get from decompilers in latter is
| basically none.
|
| rev.ng, by default, detects data structures on the whole
| binary, interprocedurally, including arrays. See the linked
| list example in the blog post. We also have plans to detect
| enums and other stuff.
|
| Clearly we're not there yet, we still need to work on
| robustness, but our goal is to increase the confidence in
| decompilers and actually offer features that save time.
| Certain tools have made progress in improving the UI and the
| scripting experience, but there's other things to do beyond
| that.
|
| I see this a bit like the transition from the phase in which
| C developers where using macros to ensure things were being
| inlined/unrolled to the phase where they stopped doing that
| because compilers got smart enough to the right thing and to
| do it much more effectively.
| saagarjha wrote:
| Curious what you do when you encounter an instruction you
| don't model
| aleclm wrote:
| That's unlikely, since we use QEMU as a lifter, which
| sometimes supports new instructions before they hit
| silicon.
|
| However, I think we'll emit a call to some `noreturn`
| function. Basically we emit a call to `abort`.
| saagarjha wrote:
| Right but you do see how this means that you need to lift
| code that has semantics that cannot be modeled in C?
| aleclm wrote:
| Sure, in those cases we emit calls to C functions. The
| only thing we need to know is what registers are taken as
| input, what registers are output and what registers are
| preserved.
|
| In QEMU parlance, these are helper functions, and they
| have actual implementations. But for decompilation
| purposes, you don't need to implement them. You just need
| to know how they interact with the registers.
| j-krieger wrote:
| What happens if you put in a binary which outputs C-like
| machine code, like Rust (llvm) or zig?
| aleclm wrote:
| Languages with a rich standard library and generating a
| lot of code for you usually need some love to get
| rid/represent idiomatically common patterns and to detect
| common data structures.
|
| We haven't looked into it yet, but the automatic data
| structure recognition might help.
|
| Frankly, Rust looks particularly scary: https://media.ccc
| .de/v/37c3-11684-rust_binary_analysis_featu...
| tux3 wrote:
| Oh, very nice! I've dealt with forsaken deeply abstract
| vtable mazes of hell, but the idea of using a ton of sum
| types, dynamic dispatch, async everywhere, and long
| iterator chains would make for some deliciously
| unreadable binaries!
| Sesse__ wrote:
| > Other key topic: data structures. When reversing I spend
| half of the time renaming things and half of the time
| detecting data structures. The help I get from decompilers
| in latter is basically none.
|
| That's funny, because I've used both Hex-Rays and Ghidra,
| and gotten lots of help with data structures. The
| interactivity really helps a bunch with filling in the
| blanks.
| aleclm wrote:
| In IDA you basically have only detection of stack frame
| layout (in a quite confusing fashion) and "create struct
| out of this pointer", which is something you have to do
| manually and its intraprocedural.
|
| Imagine this being done automatically, across all of the
| binary. If you pass a pointer to another function the
| type is correct and you build the type from all the
| functions using it.
|
| Then obviously the user needs to fix things, but
| boostrapping can definitely be hugely improved.
| Sesse__ wrote:
| I'm sure user-defined structs can benefit from combining
| information from multiple functions, but saying that what
| you get today is "basically none" is a bit of an
| overstatement. Also, the special (and important!) case of
| operating system ABI structs is great, and that
| information propagates throughout function calls.
| jcranmer wrote:
| Here's my issue with decompilers:
|
| I don't want to look at assembly code. I'd rather see
| expression trees, expressed in C-like syntax, than trying
| to piece together variables from two-address or three-
| address instructions. Looking at assembly tends to lead to
| brain farts like "wait, was the first or second operand the
| output operand?" (really, fuck AT&T syntax) or "wait, does
| ja implement ugt or sgt?"
|
| So that means I want to look at something vaguely C-like.
| But the problem is that the C type system is too powerful
| for decompilers to robustly lift to, and the resulting code
| is generally at best filled with distractions of wait-I-
| can-fix-this excessive casting and at worst just wrong. And
| when it's wrong, I have to resort to staring at the
| assembly, which (for Ghidra at least) means throwing away a
| lot of the notes I've accumulated because they don't
| correlate back to underlying assembly.
|
| So what I really want isn't something that can emit
| recompilable C code, that's optimizing for something that
| doesn't help me in the end. What I want is robust
| decompilation to something that lets me ignore the assembly
| entirely. I'm a compiler writer, I can handle a language
| where integers aren't signed but the operands are.
| aleclm wrote:
| I 120% agree with what you're saying, but emitting valid
| C is kinda part of what you're asking, in design terms.
|
| Our goal is: omit all the casts that can be omitted
| without changing the semantics according to C. In fact,
| we have a PR doing exactly this (still on the old repo,
| hopefully it will go in soon).
|
| But, how can you expect to be able to be strict with what
| C allows you to do implicitly, if you're not even
| emitting valid C? For instance, thanks to the fact that
| we emit valid C, we could test if the assembly emitted by
| a compiler is the same before and after removing
| redundant casts.
|
| My point is that emitting valid C is kind of a
| prerequisite for what you're asking, a rather low bar to
| pass, but that, in practice, no mainstream decompiler
| passes. It's pretty obvious the decompiled code will
| often be redundant and outright wrong if you don't even
| guarantee it's syntactically valid. Then clearly it's not
| a panacea, but it's an important design criterion and
| shows the direction we want to go.
|
| As for comments: we still haven't implemented inline
| comments, but they will be attached to program addresses,
| so they will be available both in disassembly and
| decompiled C. It's not very hard to do, but that needs
| some love.
| jcranmer wrote:
| One of the blog posts I keep meaning to write but never
| quite get around to is a post that C is not portable
| assembly. What is necessary is decompilation to a
| portable C-like assembly, but that target is not C, and I
| think focusing on creating valid C tends to drag you
| towards suboptimal decisions, even leaving aside issues
| like "should SLL decompile to x << y or x << (y % 32)?"
|
| In my experience with Ghidra, I've just seen far too many
| times where Ghidra starts with wrong types for something
| and the result becomes gibberish--even just plain
| _dropping_ stuff altogether. There are some cases where
| it 's clear it's just poor analysis on Ghidra's part
| (e.g., it doesn't seem to understand stack slot reuse,
| and memcpy-via-xmm is very confusing to it). And Ghidra's
| type system lacks function pointer types, which is very
| annoying when you're doing vtable-heavy C++ code.
|
| I do like the appeal of a recompileable target language.
| But that language need not be C--in fact, I'm actually
| sketching out the design of such a language for my own
| purposes in being able to read LLVM IR without going
| crazy (which means I need to distinguish between, e.g.,
| add nuw and just plain add).
|
| Analysis necessarily involves multiple levels. Given that
| a lot of the type analysis today tends to be crap, I'd
| rather prefer to have the ability to see a more solid
| first-level analysis that does variable recovery and
| works out function calling conventions so that it can
| inform my ability to reverse engineer structures or
| things like "does this C++ method return a non-trivial
| struct that is an implicit first parameter?"
|
| (Also, since I'm largely looking at C++ code in practice,
| I'd absolutely love to be able to import C++ header files
| to fill in known structure types.)
| aleclm wrote:
| > should SLL decompile to x << y or x << (y % 32)?
|
| I think this a bit of a misguided question. The hardware
| has a precise semantic defined, usually. QEMU's <<
| behaves similarly to C (undefined behavior for rhs > 32),
| but this means that the lifter (still QEMU) will account
| for this and emit code preserving the semantics.
|
| tl;dr: the code we emit should do the right thing
| depending on what the original instruction did, without
| making assumptions on what happens in case of C undefined
| behaviors.
|
| > Ghidra's type system lacks function pointer types
|
| Weird limitation, we support those.
|
| > it doesn't seem to understand stack slot reuse
|
| That's a tricky one. We're now re-designing certain parts
| of the pipeline to enable LLVM to promote stack accesses
| to SSA values, which basically solves the stack slot
| reuse. This is probably one of the most important
| features experienced reversers ask for.
|
| > that language need not be C--
|
| Making up your own language is temptation one should
| resist.
|
| Anyway, we're rewriting our backend using an MLIR dialect
| (we call it clift) which targets C but should be good
| enough to emit something "similar to C but slightly
| different". It might make sense to have a different
| backend there. But a "standard C" backend has to be the
| first use case.
|
| We thought about emitting C++, it would make our life
| simpler. But I think targeting non-C as the first and
| foremost backend would be a mistake.
|
| Also, a Python backend would be cool.
|
| > Analysis necessarily involves...
|
| I would be interested in discussing more what exactly you
| mean here. Why don't you join our discord server?
|
| > I'd absolutely love to be able to import C++ header
| files to fill in known structure types
|
| We have a project for importing from header files.
| Basically we want use a compiler to turn them into DWARF
| debug symbols and then import those. Not too hard.
| nextos wrote:
| A cool company fueled by one of the best PLT books out there:
| https://link.springer.com/book/10.1007/978-3-662-03811-6
|
| _" He also met a partner in crime, Pietro. Romantically enough,
| he met him thanks to a book which will turn out to be
| foundational for company."_
|
| https://rev.ng/about
|
| Congrats on the launch.
| aleclm wrote:
| About the book, here's the full story: I was getting into
| compilers, but I was really struggling with the theory, the
| most famous books weren't doing it for me, and I felt really
| down.
|
| Then I find this book, which seems very dense, but clear. So I
| ask my advisor if I could buy it and goes like "well, first
| check out the university library". I check it out and there's a
| copy, but... it's taken.
|
| Working in the only group that was doing research on compilers
| I'm like "who dares do compilers stuff out of our group!?".
|
| I go to the library:
|
| Me: who has the book?
|
| Library guy: can't tell you, privacy reasons.
|
| Me: what's the third letter of its surname?
|
| Library guy: Z
|
| Me: what's the second letter of its name?
|
| Library: I
|
| Me: thanks.
|
| I go here: https://www.deib.polimi.it/ita/personale-lista-
| alfabetica I found him.
|
| Fast forward, we become friends and we start the company
| together.
|
| > Congrats on the launch.
|
| Thanks! It was a lot of work.
| albertzeyer wrote:
| Checking the team about: https://rev.ng/about
|
| And looking at the code contributions:
| https://github.com/revng/revng/graphs/contributors
|
| Isn't it a bit weird that the CEO (aleclearmind) has most
| commits, even much more than the CTO (pfez)? I often hear the
| complaints from other CEOs that they don't really find any time
| anymore to code... Even the CTO usually is more on the managing
| side and less active in actual coding.
|
| Anyway, if this works, then I guess it's a lot of fun for them.
|
| _Edit_ Ah right, I didn 't check the timeline.
| zote wrote:
| The CTO has more recent commits, aleclearmind's commits drop to
| 0 after 2020 so maybe they also have a hard time getting to
| code.
| aleclm wrote:
| The CTO mostly works on the backend of the decompiler, revng-c,
| which we just released:
|
| https://github.com/revng/revng-c/commits/develop/
|
| Eventually we'll merge the two repos.
|
| Also, I develop stuff every day. For some reason GitHub is not
| picking up my user correctly.
|
| > Anyway, if this works, then I guess it's a lot of fun for
| them.
|
| It is!
| albertzeyer wrote:
| I wonder a bit about the downvotes. I didn't mean this as a
| criticism or so in any way. In fact, I like this very much. I
| just found this interesting and unlike what I saw elsewhere.
|
| So the downvotes are because this is not interesting or not
| unusual?
| halayli wrote:
| your observation was spot on and your question was answered
| by the ceo. People on hn can be oversensitive.
| londons_explore wrote:
| Idea: automatically name variables and members of structs based
| on how code interacts with them.
|
| Eg. The next pointer in a linked list should be easy to identify
| as 'next'.
|
| That would be done by downloading all of GitHub, then seeing what
| variables in GitHub code have the most similar layouts and
| interactions, and then if the confidence is high enough, using
| those names.
| qweqwe14 wrote:
| Sort of like GitHub Copilot but for reversing?
| aleclm wrote:
| In the past we were thinking to do something like this by hand.
| For instance, we detect induction variables, we could rename
| them into `i`.
|
| However, nowadays, it seems pretty obvious that the right way
| to do this things is using LLMs.
|
| This said, at this stage, we see ourselves as people building
| robust infrastructure. Once the infrastructure is there, using
| some off the shelf model to rename things or add comments is
| relatively easy.
|
| Basically: we do the hard decompilation work that needs 100%
| accuracy, and then we can adopt LLMs for things that are OK to
| be approximate such as names, comments and the like.
|
| Anyway, writing a script that renames stuff is pretty easy.
| Check out the docs: https://docs.rev.ng/user-manual/model-
| tutorial/
| londons_explore wrote:
| If an LLM is used, it's unclear how to best do it.
|
| One could try to train ones own LLM from scratch, using an
| encoder-decoder (translation - aka seq2seq) architecture
| trying to predict the correct variable name given the
| decompiled output.
|
| One could try to use something like GPT-4 with a carefully
| designed prompt "Given this datastructure, what might be the
| name for this field?"
|
| One could try to use something pretrained like llama, but
| then finetune it based on hundreds of thousands of compiled
| and decompiled programs.
| Eisenstein wrote:
| Option 4:
|
| One could take an pretrained model like llama, train it on
| only a few thousands of compiled and decompiled programs,
| then feed it compiled programs and have it decompile them
| and evaluate that output to make a new dataset and fine
| tune it again. Repeat until satisfactory.
| 19h wrote:
| Sounds like sidekick for binary ninja
| diggan wrote:
| Would be very cool indeed, something like http://jsnice.org/
|
| Paper that describes what JSNice is doing behind the scenes:
| https://files.sri.inf.ethz.ch/website/papers/jsnice15.pdf
| yakkityyak wrote:
| I hope collaborative workflows get a lot of attention. I haven't
| used IDA teams or anything, but a reverse engineering experience
| that felt as frictionless as Google Docs would be amazing.
| aleclm wrote:
| That's our goal. We used to use QtCreator as a basis for the
| UI, terrible idea.
|
| Then we switched to VSCode, which happens to be able to run in
| the browser. So we added some magic kubernetes sauce and voila,
| you got the cloud decompiler with exactly the same user
| experience as the fully standalone one.
|
| We still need to perform some QA on collaboration, but
| basically works. One daemon, many clients. Very simple
| architecture.
|
| I think we got inspiration to do this from a CTF where we were
| doing "collaboration" using IDA with multiple windows on a X
| session on a server with multiple cursors. Very cursed, but
| effective.
| fwr00t wrote:
| Seems exciting. I'm keen to try the fully standalone version. Is
| there any news about tentative pricing? Hopefully its affordable
| enough for hobbyist as well.
| JonChesterfield wrote:
| Always pleased to see more binary hacking tools. A load of
| overly-precise suggestions on the chosen packaging format follows
| because I might want to use this tool myself :)
|
| > `source ./environment`
|
| That's a bad omen. I downloaded the tar to find it does indeed
| set a bunch of environment variables including PATH, though
| thankfully not LD_LIBRARY_PATH. Mostly prefixed "HARD_" which is
| maybe unique (REVNG would be a more obvious choice, colliding
| with existing environment variables is a bad thing).
|
| It sets `AWS_EC2_METADATA_DISABLED="true"` which won't break me
| (I don't use AWS) but in general seems dubious.
| export RPATH_PLACEHOLDER="///////////////////////////////////////
| /////////$ORCHESTRA_ROOT" export
| HARD_FLAGS_CXX_CLANG="-stdlib=libc++" ...
| "-Wl,-rpath,$RPATH_PLACEHOLDER/lib ...
|
| This is suboptimal. The very long PATH setting with mingw32 and
| gentoo and mips strings in it also looks very fragile.
|
| I usually bail when the running instructions include "now mangle
| your environment variables" because that step is really strongly
| correlated with programs that don't work properly on my non-
| ubuntu system. Wiring your application control flow through the
| launching environment introduces a lot of failure modes - it's
| not as convenient as it first appears. Very like global
| variables.
|
| Clang will burn a lot of this stuff in as defaults when you build
| it if you ask, e.g. `-DCLANG_DEFAULT_CXX_STDLIB=libc++` would
| remove the stdlib setting environment variable. DEFAULT_SYSROOT
| is useful too.
|
| Using rpath means you're vulnerable to someone running this
| script with LD_LIBRARY_PATH set as the environment variable will
| override your DT_RUNPATH setting in the binaries. The background
| on this is aggravating. Abbreviating here, '-Wl,rpath' no longer
| means rpath, it means 'runpath' which is a similar but much less
| useful construct. The badly documented invocation you probably
| want is `-Wl,rpath -Wl,--disable-new-dtags` to set rpath instead
| of set runpath, at which point the loader will ignore
| LD_LIBRARY_PATH when looking for libraries.
|
| There's a good chance you can completely remove the environment
| mangling through a combination of setting different flags when
| building clang, static linking and embedding binaries in other
| binaries.
|
| Related, your clang-16 binary is dynamically linked. As in it
| goes looking for things like libLLVMAArch64CodeGen.so.16 at
| runtime. A lot of failure modes can be removed by
| LLVM_BUILD_STATIC=ON. E.g. if I run your dynamically linked clang
| with a module based HPC toolchain active, your compiler will pick
| up the libraries from the HPC toolchain and it'll have a bad
| time. The tools are all linked against glibc as well, pros and
| cons to that.
|
| Tools are also linked against libc++.so, which is linked against
| libc++abi.so and so forth. Worth considering static libc++, but
| even if you decline that, libc++abi and libunwind can and
| probably should be statically linked into the libc++. The above
| rpath rant? Runpath isn't transitive so dynamic libaries finding
| other dynamic libraries using runpath (the one you get when you
| ask for rpath) works really poorly.
|
| Context for there being so many suggestions above - I am
| completely out of patience with distributing dynamically linked
| programs on Linux. I don't want a stray environment variable from
| some program that had `source ourhack` in the readme or a "module
| system" to reach into my application and rewire what libraries it
| calls at runtime as the user experience and subsequent bug report
| overhead is terrible. Static linking is really good in
| comparison.
|
| Thanks again for shipping, and I hope some of the above feedback
| is helpful!
| aleclm wrote:
| I think most of your concerns about messing with the
| environment are sensible only under the assumption that you
| actually do `source environment`.
|
| In truth, we suggest to do that only so you use the GCC we
| distribute for the demo binary. The actual way this is intended
| to be used is through the `./revng` script. In that way, the
| environment changes only affect the invocation of `revng`.
|
| This is documented here: https://docs.rev.ng/user-
| manual/working-environment/ We should probably add a warning
| about `source ./environment`.
|
| Now, let's get to each of your comments :D
|
| > though thankfully not LD_LIBRARY_PATH
|
| We spent a lot of time to have a completely self-contained set
| of binaries where each ELF refers to its dependencies through
| relative paths. LD_LIBRARY_PATH is evil.
|
| > Mostly prefixed "HARD_"
|
| Those are just used by our compiler wrappers, I don't think
| those environment variables collide with anything in practice.
|
| > It sets `AWS_EC2_METADATA_DISABLED="true"`
|
| Original discussion:
| https://github.com/revng/revng/pull/309#discussion_r12805759...
|
| I guess we could patch the AWS SDK to avoid this. Anyway, it
| affects only when rev.ng is running in the cloud.
|
| > export RPATH_PLACEHOLDER=... > export
| HARD_FLAGS_CXX_CLANG=...
|
| Those are used when linking binaries translated by revng. If
| you're not interested in end-to-end binary translation, they
| don't matter.
|
| > it means 'runpath' which is a similar but much less useful
| construct
|
| We specifically want DT_RUNPATH. DT_RPATH is deprecated and
| there might an use case for overriding our libraries with
| LD_LIBRARY_PATH.
|
| > There's a good chance you can completely remove the
| environment mangling
|
| I think your observations concerning "mangling the environment"
| are only valid for non-private environment variables. The
| following variables are private: RPATH_PLACEHOLDER, HARD_*,
| REVNG_*. Also, they are all only for binary translation
| purposes. We could push them down into some smaller-scoped
| compiler wrappers, but those make sense only if we can get rid
| of environment entirely, which we can't because we ship Python.
|
| > a combination of setting different flags when building clang
|
| No, the flags also affect the linker and there's some features
| of our wrappers that cannot simply be burned in. We can push
| them in more private places, though.
|
| > a lot of failure modes can be removed > libc++abi and
| libunwind can and probably should be statically linked into the
| libc++
|
| We no longer have issues with that, our build system is pretty
| reliable in that regard. LLVM is just one of the components,
| these things need to work robustly in general, and they do
| (with quite some effort).
|
| You seem to be wary of using dynamic linking, we put some
| effort in it, now it works pretty good and always looks up
| things in the right place, and without ever hardcoding absolute
| paths anywhere, nor any install phase that "patches" the
| binaries. The unpacked directory can be moved wherever you
| want.
|
| > I am completely out of patience with distributing dynamically
| linked programs on Linux
|
| You're thinking of some other solution, our solution does not
| use LD_LIBRARY_PATH and all the binaries reference each other
| in a robust way using `$ORIGIN`. Try:
| ./root/bin/python ./root/bin/revng artifact --help
|
| It works.
|
| But again, doing `source environment` is mostly for demo
| purposes, in the actual use case, you just do `./revng` and
| your environment is untouched.
|
| We ship our Python, but you don't have to use it: you're
| supposed to just do ./revng (or interact over the network in
| daemon mode).
|
| Our approach is: use whatever tool you like for scripting as
| long as it can parse our YAML project file, make changes to it,
| and then invoke `./revng artifact` (or interact with the
| daemon): https://docs.rev.ng/user-manual/model-tutorial/
|
| Result: we get to use our Python version (the latest) and you
| get to use whatever language you like. Then we'll provide on
| pypi wrappers that help you with that and are compatible with
| large set of Python versions.
|
| tl;dr Don't `source ./environment`, use `./revng`.
|
| > Thanks again for shipping, and I hope some of the above
| feedback is helpful!
|
| I'm happy there's someone that cares about this :D
|
| Our next big iteration of this might involve simplifying things
| a lot by adopting nix + mount namespace to make /nix/store
| available without root.
|
| Maybe this is not the right place for discussing this, we can
| chat on our discord server if you'd like :)
| JonChesterfield wrote:
| Not setting environment variables is indeed solved by not
| setting environment variables - but `source ./environment` is
| what's written on the announcement page at the top of this
| thread. './revng' doesn't appear anywhere on it.
|
| You haven't set LD_LIBRARY_PATH but other people will do.
| Also LIBRARY_PATH, and put other stuff on PATH and so forth.
| Module systems are especially prone to this, but ending up
| with .bashrc doing it happens too.
|
| You have granted the user the ability to override parts of
| the toolchain with environment variables and moving files to
| various different directories. That's nice. Some compiler
| devs will appreciate it. Also it's doing the thing Linux
| recommends for things installed globally so that's
| defensible.
|
| In exchange, you will get bug reports saying "your product
| does not work", where the root cause eventually turns out to
| be "my linker chose a different library to my loader for some
| internal component". You also lose however many people try
| the product once, see it immediately fall over and don't take
| the time to tell you about the experience.
|
| I think that's a bad trade-off. Static linking is my
| preferred fix, but generally anything that stops forgotten
| environment variables breaking your software in confusing
| ways is worth considering.
| aleclm wrote:
| > `source ./environment` is what's written on the
| announcement page at the top of this thread. './revng'
| doesn't appear anywhere on it.
|
| You're right, but after that there's a link to the docs
| where we say to use `./revng`. The blog post is for the
| impatient :) On the long run the docs is what most people
| will look at.
|
| I don't think we want to support use cases that might break
| system packages too. If you set LD_LIBRARY_PATH to a
| directory where you have an LLVM installation, that might
| break any system program using LLVM too... Why should we
| try to fix that using `DT_RPATH` (which is a deprecated way
| of doing things) when system components don't do it?
|
| We might cleanup the environment from LD_LIBRARY_PATH and
| other stuff, that might be a sensible default, yeah. Also
| we might have some sanity check printing a warning if weird
| libraries are pulled in.
|
| But it's hard to take a decision without a specific use
| case in mind. If you have an example, bring it forward and
| I'm happy to discuss what should be the right approach
| there.
| JonChesterfield wrote:
| LLVM picking up the wrong libraries from the environment
| has cost me at least a couple of months over the last
| decade or so. Maybe twenty instances of customers being
| broken, ten hours or so in meetings explaining the
| problem and trying to persuade people that the right
| thing really is different for the system compiler vs your
| bespoke thing.
|
| If you think it's better for your product to find
| unrelated libraries with the same name at runtime, you go
| for it.
|
| Detecting that failure mode would be an interesting
| exercise - you could crawl your own address space after
| startup and try to guess whether the libraries you got
| are the ones you wanted. Probably implementable.
| dark-star wrote:
| It doesn't work with my ELF file: [orchestra]
| [darkstar@shiina revng]$ ./revng artifact --analyze --progress
| decompile-to-single-file ../maytag.ko
| [=======================================] 100% 0.57s Analysis
| list revng-initial-auto-analysis (5): import-binary
| [===================> ] 50% 0.57s Run analyses
| lists (2): revng-initial-auto-analysis [=========>
| ] 25% 0.57s revng-artifact (2): Run analyses Only ELF
| executables and ELF dynamic libraries are supported
| [orchestra] [darkstar@shiina revng]$ file ../maytag.ko
| ../maytag.ko: ELF 64-bit LSB relocatable, x86-64, version 1
| (FreeBSD), not stripped
|
| Does it not support FreeBSD binaries?
|
| Edit: Ah I missed that it doesn't support kernel modules,
| probably has nothing to do with FreeBSD but the fact that this is
| not a simple executable
| aleclm wrote:
| Can you open an issue on GitHub and attach the binary? I don't
| think it should be too hard to load that.
| costco wrote:
| Congrats. Do you have any regrets about outsourcing lifting to
| the QEMU TCG or has it worked well?
| aleclm wrote:
| Thanks!
|
| It has been working very well. Two regrets:
|
| 1. Not rebasing our fork of QEMU for years has put us in a bad
| spot. But just today a member of our team managed to lift stuff
| with the latest QEMU. And he has also been able to lift
| Qualcomm Hexagon code, for which we helped to add support in
| QEMU. Eventually we'll be the first proper Hexagon decompiler
| :)
|
| 2. Focusing too much on QEMU led our frontend to be tightly
| coupled with QEMU. It will now take some effort to enable
| support for additional frontends, non-QEMU based. But not
| impossible: our idea is to let user add support for a new
| architecture by defining, in C, a struct for the CPU state and
| a bunch of functions acting on it. That's it. No need to learn
| any internal representation.
|
| tl;dr QEMU was a great choice, it worked so well that we didn't
| work on that part of the codebase for too much time and now
| there's some technical debt there. But we're addressing it.
___________________________________________________________________
(page generated 2024-03-29 23:02 UTC)