[HN Gopher] Exploiting Undefined Behavior in C/C++ Programs: The...
       ___________________________________________________________________
        
       Exploiting Undefined Behavior in C/C++ Programs: The Performance
       Impact [pdf]
        
       Author : luu
       Score  : 77 points
       Date   : 2025-04-22 21:01 UTC (3 days ago)
        
 (HTM) web link (web.ist.utl.pt)
 (TXT) w3m dump (web.ist.utl.pt)
        
       | jonstewart wrote:
       | > The results show that, in the cases we evaluated, the
       | performance gains from exploiting UB are minimal. Furthermore, in
       | the cases where performance regresses, it can often be recovered
       | by either small to moderate changes to the compiler or by using
       | link-time optimizations.
       | 
       | _THANK YOU._
        
         | ryao wrote:
         | > by using link-time optimizations
         | 
         | These are almost never used by software.
        
           | mgaunard wrote:
           | Only places where I've seen LTO not be used are places with
           | bad and unreliable build systems that systematically
           | introduce undefined behaviour by violating the ODR.
        
             | jeffbee wrote:
             | The only organization I've worked in that had comprehensive
             | LTO for C++ code was Google. I've worked at other orgs even
             | with 1000s of engineers where LTO, PGO, BOLT, and other
             | things you might consider standard techniques were
             | considered voodoo and too much trouble to bother with,
             | despite the obvious efficiency improvements being left on
             | the table.
        
               | com2kid wrote:
               | I helped with pgo work at Microsoft over 15 years ago,
               | back when it was a Microsoft Research project.
               | 
               | The issue with early pgo implementations was getting a
               | really good profile, as you had to have automation
               | capable of fully exercising code paths that you knew
               | would be hot in actual usage, and you needed good
               | instrumentation to know what code paths those were!
               | 
               | The same problem exists now days, but programs are
               | instrumented to hell and back to collect usage data.
        
               | jeffbee wrote:
               | I am willing to assume that organizations dedicated to
               | shipping software to customers like Microsoft or Autodesk
               | or somebody like that are almost certainly all in on
               | optimization techniques. The organizations where I worked
               | are ones that are operating first party or third party
               | software in the cloud where they're responsible for
               | building their own artifacts.
        
               | loeg wrote:
               | Facebook uses LTO/PGO for C++ pretty broadly.
        
               | jeffbee wrote:
               | Yeah they just never hired me. They also invented BOLT.
               | 
               | I think there is a valley in terms of organization size
               | where you have tons of engineers but not enough to
               | accomplish peak optimization of C++ projects. These are
               | the orgs that are spending millions to operate, for
               | example, the VERY not-optimized packages of postgresql
               | from Ubuntu, in AWS.
        
               | UncleMeat wrote:
               | Google doesn't have full-lto either, since binaries are
               | way too big. Thin-lto is vastly less powerful.
        
               | jeffbee wrote:
               | "Vastly" eh? I seem to recall that LLVM ThinLTO has
               | slight regressions compared to GCC LTO on specCPU but on
               | Google's own applications the superior whole-program
               | devirtualization offered only with ThinLTO is a net win.
        
               | UncleMeat wrote:
               | I'll adjust my phrasing.
               | 
               | As a user, building with thin-lto vs full-lto generally
               | produces pretty similar performance in no small part
               | because a huge amount of effort has gone into making the
               | summaries as effective as possible for key performance
               | needs.
               | 
               | As a compiler developer, especially when developing
               | static analysis warnings rather than optimization passes,
               | the number of cases where I've run into "this would be
               | viable if we had full-lto" has been pretty high.
        
               | pcwalton wrote:
               | Yeah, I would have liked to see the paper specify whether
               | the LTO they tried is fat LTO or ThinLTO.
        
               | mgaunard wrote:
               | In practice the default ABI on linux x86-64 is still
               | limiting you to binaries that are 4G or thereabout.
               | 
               | Not exactly a problem for LTO since any reasonable build
               | machine will have 128GB of ram.
        
               | astrange wrote:
               | PGO is pretty difficult. In my experience compilers don't
               | seem to know the difference between "this thing never
               | runs" and "we don't have any information about if this
               | thing runs". Similarly it might be useful to know "is
               | this branch predictable" more than just "what % is it
               | taken".
               | 
               | CPUs are so dynamic anyway that there often isn't a way
               | to pass down the information you'd get from the profile.
               | eg I don't think Intel actually recommends any way of
               | hinting branch directions.
        
               | jeffbee wrote:
               | It's implied by the target offset. Taken branches jump
               | backwards, unlikely branches jump forward.
        
               | saagarjha wrote:
               | Surely you are not putting code behind an if/else
        
               | atq2119 wrote:
               | Not generally, no. This is true for some chips,
               | especially (very) old or simple cores, but it's not
               | something to lean on for modern high end cores.
        
             | tialaramex wrote:
             | Violating ODR doesn't introduce UB it's IFNDR, Ill-formed
             | No Diagnostic Required which is much worse in principle and
             | in such cases probably also in practice.
             | 
             | UB is a runtime phenemenon, it happens, or it doesn't, and
             | we may be able to ensure the case where it happens doesn't
             | occur with ordinary human controls.
             | 
             | But IFNDR is a property of the compiled program, if you
             | have IFNDR (by some estimates that's most C++ programs)
             | your program has no defined behaviour and never did, so
             | there is no possible countermeasure, too bad game over.
        
           | jandrewrogers wrote:
           | LTO is heavily used in my experience. If it breaks something
           | that is indicative of other issues that need to be addressed.
        
             | yxhuvud wrote:
             | Main issue isn't that it break stuff but that it tend to be
             | pretty slow to compile with it.
        
               | jorvi wrote:
               | .. that's why you compile without LTO during development
               | and do a final 'compile with LTO > profile > fix /
               | optimize > compile with LTO' pass.
               | 
               | Compilation happens once and then runs on hundreds of
               | thousands up to billions of devices. Respect your users.
        
               | astrange wrote:
               | This assumes that LTO is strictly better than no-LTO, ie
               | only gets faster, has the same optimization hotspots, and
               | doesn't break anything.
               | 
               | I would recommend only doing things that fit within the
               | 'build > text > fix' loop.
        
               | Sharlin wrote:
               | Which doesn't matter at all in a release build. And in a
               | dev build it's rarely necessary.
        
               | pcwalton wrote:
               | At FAANG scale the cost is prohibitive. Hence the
               | investment in ThinLTO.
        
               | KerrAvon wrote:
               | At FAANG scale, you absolutely want to have a pass before
               | deployment that does this or you're leaving money on the
               | table.
        
           | steveklabnik wrote:
           | It's on by default for Rust release builds, so at least the
           | codepaths in LLVM for it are well-exercised.
        
             | alpaca128 wrote:
             | That must have been changed sometime in the last year then.
             | When I enable LTO for one of my projects on a Rust compiler
             | from 2024 the compilation time more than doubles.
        
               | steveklabnik wrote:
               | I should have been more clear: thin LTO is, not full
               | "fat" LTO, for exactly that reason.
        
             | vlovich123 wrote:
             | I don't think that's right unless the docs are stale:
             | [profile.release]         lto = false
             | 
             | https://doc.rust-
             | lang.org/cargo/reference/profiles.html#rele...
        
               | steveklabnik wrote:
               | So the thing is that false means thinlto is used
               | depending on other settings, see https://doc.rust-
               | lang.org/cargo/reference/profiles.html#lto
               | 
               | > false: Performs "thin local LTO" which performs "thin"
               | LTO on the local crate only across its codegen units.
               | 
               | I think this is kind of confusing but whatever. I should
               | have been more clear.
        
         | Rusky wrote:
         | It's worth noting (and the paper does go into this) that this
         | is limited to a very specific subset of UB, which they call
         | "guardable."
         | 
         | They are _not_ removing UB around things like out-of-bounds or
         | use-after-free, which would likely be more expensive.
        
         | jonstewart wrote:
         | I don't understand the down votes. Conducting empirical
         | research on the performance impact of undefined behavior is
         | fantastically needed, as the C++ committee's obsession with
         | undefined behavior strictness (in contrast with longstanding
         | semantics, e.g., uninitialized memory accesses being just fine)
         | has been justified largely by how they enable optimizing
         | compilers. This research shows that many types of UB have a
         | negligible impact on performance.
        
           | saagarjha wrote:
           | You're getting downvoted because you're looking for a
           | particular result ("UB optimizations don't help performance")
           | rather than actually evaluating the quality of this analysis
           | (which doesn't really support what you want anyway).
        
           | atq2119 wrote:
           | Possibly somebody downvoted because "thank you" in all caps
           | is not a substantial contribution to discussion. It feels
           | like the kind of low effort stuff you'd see on reddit.
           | 
           | Also, commenting on downvotes is generally frowned upon.
        
       | mwkaufma wrote:
       | Reading e.g. the 13% perf regression in simdjson from disabling
       | UB:                 A simpler alternative is to compile the
       | program with LTO. We confirmed that LLVM's inter-procedural
       | analyses can propagate both alignment and dereferenceability
       | information for this function, which allows the LTO build to
       | recover the performance loss.
       | 
       | "can" is doing a lot of heavy-lifting here. Guaranteeing expected
       | optimizations "will" be applied are hard-enough, without leaving
       | it entirely to an easily-derailed indirect side-effect.
        
         | UebVar wrote:
         | This is "can" has exactly the same meaning as in "UB can make
         | your programms faster". You could replace it with "it does, at
         | least with clang". LTO is, in this regard, the same as UB, and
         | unlike guaranteed optimizations, such as the single member
         | optimization, or the empty base optimization.
        
           | mwkaufma wrote:
           | Concretely, here, the UB-exploitation in question in this
           | case is assuming that the "this" pointer in C++ is aligned
           | and non-null, meaning it's a pervasive annotation throughout
           | C++ codebases, not an edge-case.
           | 
           | Relying on LTO to "discover" this annotation through
           | interprocedural analysis -- based on my experience of looking
           | at LTO in practice -- will not be as comprehensive, and even
           | when it works it accomplishes its task in an achingly-slow
           | and expensive way.
           | 
           | This is a real devil-is-in-the-details case.
        
         | quotemstr wrote:
         | I love when papers disagree with their own abstracts.
        
       | gitroom wrote:
       | perfect, this is right up my alley - honestly i keep wondering if
       | teams avoid optimizations like lto just because build pain sucks
       | or if theres some deeper trust issues around letting the
       | toolchain be clever. you think peopled deal with slow builds if
       | it bought way more speed for the final product?
        
       | pcwalton wrote:
       | I notice that the paper doesn't claim to eliminate _all_
       | reasoning about undefined behavior for optimizations. For
       | example:                   int f() {             int arr[3], i =
       | 0;             arr[3] = 5;             return i;         }
       | 
       | Optimizing this to "return 0" is relying on UB, because it's
       | assuming that i wasn't laid out directly after arr in the stack
       | frame. I believe this is what the paper calls "non-guardable UB".
       | 
       | I don't agree with the claim in the paper that their semantics
       | offers a "flat memory model". A flat memory model would rule out
       | the optimization above. Rather, the memory model still has the
       | notion of object bounds; it's just simplified in some ways.
        
         | throwawayqqq11 wrote:
         | Sorry, i dont get why the memory layout should have any effect,
         | when its clear in the AST that i=0 should be returned.
        
           | bregma wrote:
           | It's clear in the AST that there is undefined behaviour and
           | it is malformed code. It is not valid C code, so what the
           | compiler chooses to do with it is not defined by the
           | language.
        
             | pcwalton wrote:
             | Note that if you change the code to this you have the same
             | issue:                   int g(int n) {             int
             | arr[3], i = 0;             arr[n] = 5;             return
             | i;         }
             | 
             | Without "exploiting UB" it's incorrect to optimize this to
             | "return 0", because of the possibility that i was allocated
             | right after arr and n == 3.
        
               | screcth wrote:
               | If we consider writing out of bounds to be legal, we make
               | it impossible to reason about the behavior of programs.
        
               | pcwalton wrote:
               | Hence why I (and many other compiler developers) are
               | inherently skeptical whenever anyone says "just stop
               | exploiting undefined behavior".
        
               | im3w1l wrote:
               | My thinking disagrees with yours, but I can't say I have
               | fully made up my mind yet. To me a compiler that is not
               | "exploiting UB" has several valid choices about how to
               | compile this. i may be stored in memory, in a register or
               | may be deleted entirely as it's known at compile time
               | that it will have value 0. The store to arr[n] may go
               | through or it may be deleted as it's a dead store.
               | 
               | You may say I'm "exploiting ub" when making these
               | deductions but I would disagree. My argument is not of
               | the form "x is ub so I can do whatever I want".
               | 
               | To elaborate on the example, if arr was volatile then I
               | would expect the write to always go through. And if i was
               | volatile then I would expect i to always be read. However
               | it's still not guaranteed that i is stored immediately
               | after arr, as the compiler has some discretion about
               | where to place variables afaik. But _if_ i is indeed put
               | immediately after, then the function should indeed return
               | 5 for n=3. For n >3 it should either return 0 (if writing
               | to unused stack memory), page fault (for small n outside
               | of the valid stack space), or stomp on random memory (for
               | unlucky and large n). For negative n, many bad things are
               | likely to happen.
               | 
               | Edit: I think I mixed up which way the stack grows but
               | yeah.
        
               | dgrunwald wrote:
               | > But if i is indeed put immediately after, then the
               | function should indeed return 5 for n=3.
               | 
               | That's not how compilers work. The optimization changing
               | `return i;` into `return 0;` happens long before the
               | compiler determines the stack layout.
               | 
               | In this case, because `return i;` was the only use of
               | `i`, the optimization allows deleting the variable `i`
               | altogether, so it doesn't end up anywhere on the stack.
               | This creates a situation where the optimization only
               | looks valid in the simple "flat memory model" because it
               | was performed; if the variable `i` hadn't been optimized
               | out, it would have been placed directly after `arr` (at
               | least in this case: https://godbolt.org/z/df4dhzT5a), so
               | the optimization would have been invalid.
               | 
               | There's no infrastructure in any compiler that I know of
               | that would track "an optimization assumed arr[3] does not
               | alias i, so a later stage must take care not to place i
               | at that specific point on the stack". Indeed, if array
               | index was a runtime value, the compiler would be
               | prevented from ever spilling to the stack any variable
               | that was involved in any optimizations.
               | 
               | So I think your general idea "the allowable behaviors of
               | an out-of-bounds write is specified by the possible
               | actual behaviors in a simple flat memory model for
               | various different stack layouts" could work as a
               | mathematical model as an alternative to UB-based
               | specifications, but it would end up not being workable
               | for actual optimizing compiler implementations -- unless
               | the compiler could guarantee that a variable can always
               | stay in a register and will never be spilled (how would
               | the compiler do that for functions calls?), it'd have to
               | essentially treat all variables as potentially-modified
               | by basically any store-via-pointer, which would
               | essentially disable all optimizations.
        
           | maartenscholl wrote:
           | I think in the example the parent gave `arr[3]` is past the
           | end of the 3 element array, where `i` might reside,
           | potentially changing its value.
        
         | dooglius wrote:
         | >it's assuming that i wasn't laid out directly after arr in the
         | stack frame
         | 
         | The compiler isn't "assuming" that so much as choosing not to
         | put i in the stack frame at all. And I don't think it's correct
         | to view the lack of a register spill as an "optimization" per
         | se. It does remain true that code writing past the end of an
         | array will be UB in typical scenarios (i.e. when not using
         | asan/valgrind).
         | 
         | (Now, if the compiler also removed the store, that could
         | legitimately be called an optimization based on assuming no-UB)
        
           | pcwalton wrote:
           | "Exploiting undefined behavior" occurs when a simple
           | semantics (however one defines "simple") results in behavior
           | A, but the compiler chooses behavior B instead based on the
           | actual, more complex, language semantics. The code snippet in
           | question passes that test. If I flip the declaration order of
           | i and arr, then I get this [1] at -O0 (the "simple"
           | semantics):                       push    rbp             mov
           | rbp, rsp             mov     dword ptr [rbp - 4], 0
           | mov     dword ptr [rbp - 4], 5             mov     eax, dword
           | ptr [rbp - 4]             pop     rbp             ret
           | 
           | Which indeed returns 5. But at -O2 clang optimizes it to
           | this:                       xor     eax, eax             ret
           | 
           | Which returns 0. So the simple semantics produces one result,
           | and the complex semantics produces another. Hence, it's
           | exploiting undefined behavior.
           | 
           | [1]: https://godbolt.org/z/df4dhzT5a
        
             | dooglius wrote:
             | Maybe this is just arguing semantics, but I don't agree
             | with the definition you've given, and I don't think that
             | your definition is what TFA means. "Exploiting undefined
             | behavior" I think refers narrowly to the implementation
             | assuming that undefined behavior does not occur, and acting
             | on that assumption. Undefined behavior naturally resulting
             | in unpredictable behavior is not exploitation in the same
             | sense. For example,                 printf("A");       bool
             | x;       if ( x ) {printf("B");} else {printf("C");}
             | printf("\n");
             | 
             | If at -O0 "AB" is printed and at -O2 "AC" is printed (due
             | to the vagaries of whatever was left on the stack), then
             | that would meet your definition, but I would not regard
             | that as "exploiting undefined behavior", merely as the
             | manifestation of the inherent unpredictability of UB. If
             | the compiler didn't print anything (i.e. the whole block
             | was removed due to UB detection) then that _would_ be a
             | case of exploiting undefined behavior.
        
               | pcwalton wrote:
               | That example is an instance of unspecified vs. undefined
               | behavior, but the correctness of the pointer provenance-
               | based optimization example I gave doesn't depend on
               | whether writing to an out-of-bounds pointer is
               | unspecified or undefined.
        
           | artikae wrote:
           | What about this: https://godbolt.org/z/xP9xG3Ee3
           | 
           | Here the compiler "register allocates" i for some reads but
           | not for others.
           | 
           | i gets stack allocated, but some uses of it act as though
           | they were register allocated.
        
             | dooglius wrote:
             | I'm not sure quite what you're asking for exactly, given
             | the link is for clang trunk and doesn't have the
             | modifications discussed in TFA, and I don't dispute that
             | clang does UB-based reasoning at -O3. But, I will argue
             | that the assembly shown can be accomplished without
             | resorting to what I call "reasoning about UB", and within a
             | flat memory model, supporting the claim that these
             | sacrifices are often not necessary. I'm going to draw a
             | distinction between stack memory being "private" in the
             | sense that only the compiler is allowed to alter it, and
             | "public" where the address can be written to by something
             | else and the compiler needs to handle that. Local variables
             | at first are tracked privately. After the address of a
             | variable is taken with &x, or at the point in time when an
             | array variable is indexed, the associated memory is public.
             | Conceptually, the use of private memory can be indirect;
             | the compiler could encode/decode a stack variable x as (x
             | XOR 0xDEADBEEF) on the stack and it would be fine (but the
             | compiler does the simple thing in practice, naturally).
             | Note that this notion of "private"/"public" stack memory is
             | a property of addresses, not the provenance of the
             | accessing pointers, and so is fully compatible with a flat
             | memory model. The compiler's ability to use private memory
             | isn't a case of "reasoning around UB" in a meaningful sense
             | -- otherwise you could just as well argue that returning
             | from a function call is "reasoning about UB", because the
             | the return address can't be clobbered.
             | 
             | In your provided snippet, the correctness argument for the
             | assembly in a non-UB-reasoning universe goes like this: at
             | first, i is stored privately on the stack with value zero,
             | and so as an optimization we can assume that value is still
             | zero without rereading. Only later, when &i is taken, is
             | that memory made public and the compiler has to worry about
             | something altering it. In actual execution, the problem is
             | that the write function alters compiler-private memory (and
             | note again, that being private is a property of the
             | underlying address, not the fact that it's accessed via an
             | out-of-bounds array indexing), and this is UB and so the
             | program breaks. But, the compiler didn't need to make
             | _assumptions_ around UB.
        
         | jcranmer wrote:
         | I've only briefly skimmed the paper, but on that glance, it
         | looks like what they did was (effectively) drop all the
         | attributes in LLVM that can indicate UB, e.g., the inbounds
         | flags on getelementptr instructions, or the nsw flags on
         | arithmetic operations.
         | 
         | As you note, it doesn't remove the more core UB behaviors in
         | LLVM, in particular LLVM's reliance on pointer provenance.
        
           | nikic wrote:
           | They do a bit more than that. One of the options (-disable-
           | object-based-analysis under AA2) disables the assumption that
           | distinct identified objects do not alias, which is disabling
           | pointer provenance in at least one key place.
           | 
           | So I think this option very roughly approximates a kind of
           | "no provenance, but with address non-determinism" model,
           | which still permits optimizations like SROA on non-escaping
           | objects.
        
             | jcranmer wrote:
             | That's what I get for relying on a skimming of the paper.
             | 
             | Also, hi, didn't know you commented on this site.
        
       | nikic wrote:
       | One peculiar thing about the benchmark results is that disabling
       | individual UB seems to fairly consistently reduce performance
       | without LTO, but improve it with LTO. I could see how the UB may
       | be less useful with LTO, but it's not obvious to me why reducing
       | UB would actually help LTO. As far as I can tell, the paper does
       | not attempt to explain this effect.
       | 
       | Another interesting thing is that there is clearly synergy
       | between different UB. For the LTO results, disabling each
       | individual UB seems to be either neutral or an improvement, but
       | if you disable all of them at once, then you get a significant
       | regression.
        
       ___________________________________________________________________
       (page generated 2025-04-25 23:01 UTC)