[HN Gopher] Unstripping Stripped Binaries
       ___________________________________________________________________
        
       Unstripping Stripped Binaries
        
       Author : taviso
       Score  : 177 points
       Date   : 2022-08-26 04:30 UTC (18 hours ago)
        
 (HTM) web link (lock.cmpxchg8b.com)
 (TXT) w3m dump (lock.cmpxchg8b.com)
        
       | saagarjha wrote:
       | > What I really want is to take the names and types I've figured
       | out from my disassembler, and make them visible to gdb.
       | 
       | You may also want to check out
       | https://github.com/mahaloz/decomp2dbg. (Disclaimer: I will never
       | not shill Zion's code.)
        
         | arbruijn wrote:
         | And for Ghidra https://github.com/cesena/ghidra2dwarf
        
       | magic_hamster wrote:
       | You have a really great blog. Kudos. I just read through a bunch
       | of your adventures and it was great fun.
        
       | ntauthority wrote:
       | For Windows, there once was a similar tool that synthesizes a
       | .pdb file using the DIA API from a list of names: https://pefrm-
       | units.osdn.jp/pe_debug.html
        
       | david_draco wrote:
       | Maybe machine learning techniques for translating text could
       | produce unstripped binaries based on stripped binaries, given a
       | large training set?
        
       | 082349872349872 wrote:
       | worth reading to the end: I knew about various arcane symbol
       | formats, but TIL that cross-architecture CFG-based diffing is a
       | thing.
        
         | tdullien wrote:
         | Original author of the BinDiff tool (ca. 2004) here.
         | 
         | BinDiff uses a good number of different algorithms to narrow
         | the set of candidates for CFG-based matching -- so things like
         | identical string references may be used to match a large
         | function, and then basic blocks in that large function may be
         | matched, and then BinDiff may identify that the functions
         | called from these basic blocks will be identical, too.
         | 
         | The entire field has exploded, and there are dozens of tools
         | for function matching nowadays, some with very sophisticated ML
         | approaches etc. - it is somewhat surprising that BinDiff still
         | does it's job, considering that the core engine has not been
         | modified or improved since 2010.
         | 
         | I'd also like to extend kudos to
         | https://twitter.com/admvonschneider -- he is the sole Googler
         | sacrificing his 20% time to keep the tool alive...
        
           | saagarjha wrote:
           | Did Google ever end up open sourcing BinDiff? I've always
           | been curious at what goes under the hood.
        
             | tdullien wrote:
             | So that has been "in the works" for 7+ years, and I think
             | it repeatedly stalls because someone needs to really drive
             | it against the bureaucratic internal obstacles ...
             | 
             | There's no reason they aren't doing it, except "nobody has
             | time to do it".
        
               | stevemk14ebr wrote:
               | This would be kind of huge if this actually happened. As
               | a researcher in the field, I hope someone can finally do
               | this, please.
        
               | saagarjha wrote:
               | Do you think the result, if open sourced, would be worth
               | it? Or is the code not particularly amenable to
               | maintenance and integration?
        
               | tdullien wrote:
               | Not sure if it'd be worth it, but it's reasonably clean
               | C++ code, which is not terribly difficult to maintain &
               | integrate.
        
             | tdullien wrote:
             | Happy to answer any questions btw :)
        
               | saagarjha wrote:
               | Sure, I have a couple. I will warn you that I don't know
               | much about how BinDiff works, other than knowing it does
               | :)
               | 
               | The "Achilles heel" of binary differs is recompiling with
               | different flags or making other minor changes that alter
               | all basic blocks. Do you think the techniques BinDiff
               | uses could be extended to work on "semantics" at a higher
               | level rather than actual code? Like, doing the diff after
               | an optimizing decompiler has run over it and lifted it.
               | 
               | If you were writing BinDiff today, what new techniques
               | would you integrate? Or are you happy with the algorithms
               | as they are already?
               | 
               | What kinds of things is BinDiff good at identifying? It's
               | often difficult to know beforehand "how well" a diff will
               | work. How would you judge whether it'll be worth it?
        
               | tdullien wrote:
               | 1) You could try to lift BinDiff to operate on the
               | semantics more, but I am not entirely sure that it'd
               | affect diffing quality. "Semantics" are a difficult thing
               | to define at the assembly level: If your compiler
               | optimizes less, and uses a stack slot instead of a
               | register, the assembly-level semantics are actually
               | changed. If you recompile using frame pointer omission,
               | the assembly-level semantics are also changed. If you
               | compile code for ARM/Thumb and ARM normally, different
               | assembly-level semantics are emitted. So if you are
               | trying to do diffing on the semantic level, you have to
               | lift to high-level-language semantics, and account for
               | stuff like /GS stack cookies etc. etc. - this path,
               | madness lies. Even state-of-the-art decompilers tend not
               | to generate the same C output if you decompile
               | differently-compiled assembly. And all of this assumes
               | that no inlining changes happen...
               | 
               | It is more promising to use things that are "affected by
               | semantics, but not excessively so". CFGs fall into the
               | category, and a variety of other things are.
               | 
               | BinDiff can sometimes work around heavily changing CFGs
               | by using string references, relative positioning in the
               | call graph, and a variety of other tricks. In general, if
               | a function X that calls Y is properly matched, the odds
               | that Y will be properly matched increase greatly.
               | 
               | In a nutshell, BinDiffs algorithm is a "attempt to match
               | two nodes in a graph (callgraph of CFG) using a variety
               | of criteria (ordered from precise to loos) -- and each
               | time a match is generated, try to generate additional
               | matches in the neighborhood of the matched node by taking
               | the new match into account". So you try to get high-
               | fidelity matches first, if you get any, you propagate
               | that information to generate more high-fidelity matches,
               | and as time goes by your criteria for allowing two
               | functions to match get looser and looser ... but
               | hopefully you've already done good high-fidelity matches
               | so this counterbalances the later recklessness of the
               | algo.
               | 
               | If I was writing BinDiff today ... wow, there's a million
               | things I'd do differently. I last touched that engine in
               | 2009 or so, and things that I've learnt since that'd be
               | very helpful:
               | 
               | 1. There's fantastic use of locality sensitive hashing to
               | be applied here. There's a bit of work in the
               | functionsimsearch github repo around this, but BinDiffs
               | performance could be boosted significantly using these
               | techniques. 2. There's a ton of good academic research on
               | metric learning, and using ML to generate embeddings to
               | then do similarity comparisons. If you combine this with
               | (1), you will see a significant performance boost. 3. The
               | algorithm is pretty sequential, I think it could be
               | multi-cored reasonably easily. 4. I'd like to add a
               | little bit more semantic diffing 5. I think one could
               | very profitably use the distribution of constant offsets
               | in a function as additional matching criteria ("two
               | functions that have approximately similar data structure
               | access might be the same"). 6. ... lots more ...
               | 
               | Regarding "what is BinDiff good at" -- if you can get a
               | bunch of initial high-fidelity matches, the overall
               | success of the diffing process will be much higher than
               | if the initial matches are poor. You can use BinDiff
               | interactively and do incremental diffing -- look at the
               | BinDiff config files, you can selectively disable
               | individual matching criteria, and then start with an
               | essentially empty diff, match a few functions manually,
               | and then ask BinDiff to propagate them.
               | 
               | Come to think of this, BinDiff tries too hard to solve
               | vastly different use cases: A) Patch diffing, where you
               | want to be very sensitive to even small changes B) Symbol
               | porting, where you want to be very loose to small changes
               | The tool should probably be split into two -- one
               | matching phase, and one semantic analysis phase; and for
               | the matching phase, making the interactive diffing
               | experience better would be a great idea...
               | 
               | In general, it blows my mind that BinDiff is still in
               | use. I had expected a superior successor to arrive within
               | 2-3 years of Google acquiring zynamics, I would have
               | never expected that 10+ years on, without any change to
               | the core engine, BinDiff still produces reasonably-ish
               | results...
        
         | archi42 wrote:
         | I am only little surprised that it works; after all, compiling
         | a program for two platforms results in a mostly equivalent
         | control flow graphs, except for optimizations and different
         | code gen strategies (*); and of course platform specific stuff
         | (but that was probably already abstracted for such a project
         | back then).
         | 
         | BUT I was very (pleasantly) surprised someone did go through
         | the effort of implementing this and made it available for
         | everyone as FOSS <3
         | 
         | (*, rant) I think inlining and tail call optimizations
         | (especially calls into libc or the equivalent of the era) might
         | cause the most trouble. Also loop generation can have subtle
         | differences reflected in the CFG (not only loop unrolling, but
         | also transforming a while-loop into a do-while-loop causes
         | differences in the CFGs). Back then I'd also expect compilers
         | not always perform equally well for "simple" things like dead
         | code elimination. I wish I had the time to check out bindiff in
         | detail, it's really interesting (personal bias: In the past I
         | did work on static code analysis and had a supporting role in a
         | compiler dev team).
        
       ___________________________________________________________________
       (page generated 2022-08-26 23:02 UTC)