[HN Gopher] Exploiting Undefined Behavior in C/C++ Programs: The...
___________________________________________________________________
Exploiting Undefined Behavior in C/C++ Programs: The Performance
Impact [pdf]
Author : luu
Score : 77 points
Date : 2025-04-22 21:01 UTC (3 days ago)
(HTM) web link (web.ist.utl.pt)
(TXT) w3m dump (web.ist.utl.pt)
| jonstewart wrote:
| > The results show that, in the cases we evaluated, the
| performance gains from exploiting UB are minimal. Furthermore, in
| the cases where performance regresses, it can often be recovered
| by either small to moderate changes to the compiler or by using
| link-time optimizations.
|
| _THANK YOU._
| ryao wrote:
| > by using link-time optimizations
|
| These are almost never used by software.
| mgaunard wrote:
| Only places where I've seen LTO not be used are places with
| bad and unreliable build systems that systematically
| introduce undefined behaviour by violating the ODR.
| jeffbee wrote:
| The only organization I've worked in that had comprehensive
| LTO for C++ code was Google. I've worked at other orgs even
| with 1000s of engineers where LTO, PGO, BOLT, and other
| things you might consider standard techniques were
| considered voodoo and too much trouble to bother with,
| despite the obvious efficiency improvements being left on
| the table.
| com2kid wrote:
| I helped with pgo work at Microsoft over 15 years ago,
| back when it was a Microsoft Research project.
|
| The issue with early pgo implementations was getting a
| really good profile, as you had to have automation
| capable of fully exercising code paths that you knew
| would be hot in actual usage, and you needed good
| instrumentation to know what code paths those were!
|
| The same problem exists now days, but programs are
| instrumented to hell and back to collect usage data.
| jeffbee wrote:
| I am willing to assume that organizations dedicated to
| shipping software to customers like Microsoft or Autodesk
| or somebody like that are almost certainly all in on
| optimization techniques. The organizations where I worked
| are ones that are operating first party or third party
| software in the cloud where they're responsible for
| building their own artifacts.
| loeg wrote:
| Facebook uses LTO/PGO for C++ pretty broadly.
| jeffbee wrote:
| Yeah they just never hired me. They also invented BOLT.
|
| I think there is a valley in terms of organization size
| where you have tons of engineers but not enough to
| accomplish peak optimization of C++ projects. These are
| the orgs that are spending millions to operate, for
| example, the VERY not-optimized packages of postgresql
| from Ubuntu, in AWS.
| UncleMeat wrote:
| Google doesn't have full-lto either, since binaries are
| way too big. Thin-lto is vastly less powerful.
| jeffbee wrote:
| "Vastly" eh? I seem to recall that LLVM ThinLTO has
| slight regressions compared to GCC LTO on specCPU but on
| Google's own applications the superior whole-program
| devirtualization offered only with ThinLTO is a net win.
| UncleMeat wrote:
| I'll adjust my phrasing.
|
| As a user, building with thin-lto vs full-lto generally
| produces pretty similar performance in no small part
| because a huge amount of effort has gone into making the
| summaries as effective as possible for key performance
| needs.
|
| As a compiler developer, especially when developing
| static analysis warnings rather than optimization passes,
| the number of cases where I've run into "this would be
| viable if we had full-lto" has been pretty high.
| pcwalton wrote:
| Yeah, I would have liked to see the paper specify whether
| the LTO they tried is fat LTO or ThinLTO.
| mgaunard wrote:
| In practice the default ABI on linux x86-64 is still
| limiting you to binaries that are 4G or thereabout.
|
| Not exactly a problem for LTO since any reasonable build
| machine will have 128GB of ram.
| astrange wrote:
| PGO is pretty difficult. In my experience compilers don't
| seem to know the difference between "this thing never
| runs" and "we don't have any information about if this
| thing runs". Similarly it might be useful to know "is
| this branch predictable" more than just "what % is it
| taken".
|
| CPUs are so dynamic anyway that there often isn't a way
| to pass down the information you'd get from the profile.
| eg I don't think Intel actually recommends any way of
| hinting branch directions.
| jeffbee wrote:
| It's implied by the target offset. Taken branches jump
| backwards, unlikely branches jump forward.
| saagarjha wrote:
| Surely you are not putting code behind an if/else
| atq2119 wrote:
| Not generally, no. This is true for some chips,
| especially (very) old or simple cores, but it's not
| something to lean on for modern high end cores.
| tialaramex wrote:
| Violating ODR doesn't introduce UB it's IFNDR, Ill-formed
| No Diagnostic Required which is much worse in principle and
| in such cases probably also in practice.
|
| UB is a runtime phenemenon, it happens, or it doesn't, and
| we may be able to ensure the case where it happens doesn't
| occur with ordinary human controls.
|
| But IFNDR is a property of the compiled program, if you
| have IFNDR (by some estimates that's most C++ programs)
| your program has no defined behaviour and never did, so
| there is no possible countermeasure, too bad game over.
| jandrewrogers wrote:
| LTO is heavily used in my experience. If it breaks something
| that is indicative of other issues that need to be addressed.
| yxhuvud wrote:
| Main issue isn't that it break stuff but that it tend to be
| pretty slow to compile with it.
| jorvi wrote:
| .. that's why you compile without LTO during development
| and do a final 'compile with LTO > profile > fix /
| optimize > compile with LTO' pass.
|
| Compilation happens once and then runs on hundreds of
| thousands up to billions of devices. Respect your users.
| astrange wrote:
| This assumes that LTO is strictly better than no-LTO, ie
| only gets faster, has the same optimization hotspots, and
| doesn't break anything.
|
| I would recommend only doing things that fit within the
| 'build > text > fix' loop.
| Sharlin wrote:
| Which doesn't matter at all in a release build. And in a
| dev build it's rarely necessary.
| pcwalton wrote:
| At FAANG scale the cost is prohibitive. Hence the
| investment in ThinLTO.
| KerrAvon wrote:
| At FAANG scale, you absolutely want to have a pass before
| deployment that does this or you're leaving money on the
| table.
| steveklabnik wrote:
| It's on by default for Rust release builds, so at least the
| codepaths in LLVM for it are well-exercised.
| alpaca128 wrote:
| That must have been changed sometime in the last year then.
| When I enable LTO for one of my projects on a Rust compiler
| from 2024 the compilation time more than doubles.
| steveklabnik wrote:
| I should have been more clear: thin LTO is, not full
| "fat" LTO, for exactly that reason.
| vlovich123 wrote:
| I don't think that's right unless the docs are stale:
| [profile.release] lto = false
|
| https://doc.rust-
| lang.org/cargo/reference/profiles.html#rele...
| steveklabnik wrote:
| So the thing is that false means thinlto is used
| depending on other settings, see https://doc.rust-
| lang.org/cargo/reference/profiles.html#lto
|
| > false: Performs "thin local LTO" which performs "thin"
| LTO on the local crate only across its codegen units.
|
| I think this is kind of confusing but whatever. I should
| have been more clear.
| Rusky wrote:
| It's worth noting (and the paper does go into this) that this
| is limited to a very specific subset of UB, which they call
| "guardable."
|
| They are _not_ removing UB around things like out-of-bounds or
| use-after-free, which would likely be more expensive.
| jonstewart wrote:
| I don't understand the down votes. Conducting empirical
| research on the performance impact of undefined behavior is
| fantastically needed, as the C++ committee's obsession with
| undefined behavior strictness (in contrast with longstanding
| semantics, e.g., uninitialized memory accesses being just fine)
| has been justified largely by how they enable optimizing
| compilers. This research shows that many types of UB have a
| negligible impact on performance.
| saagarjha wrote:
| You're getting downvoted because you're looking for a
| particular result ("UB optimizations don't help performance")
| rather than actually evaluating the quality of this analysis
| (which doesn't really support what you want anyway).
| atq2119 wrote:
| Possibly somebody downvoted because "thank you" in all caps
| is not a substantial contribution to discussion. It feels
| like the kind of low effort stuff you'd see on reddit.
|
| Also, commenting on downvotes is generally frowned upon.
| mwkaufma wrote:
| Reading e.g. the 13% perf regression in simdjson from disabling
| UB: A simpler alternative is to compile the
| program with LTO. We confirmed that LLVM's inter-procedural
| analyses can propagate both alignment and dereferenceability
| information for this function, which allows the LTO build to
| recover the performance loss.
|
| "can" is doing a lot of heavy-lifting here. Guaranteeing expected
| optimizations "will" be applied are hard-enough, without leaving
| it entirely to an easily-derailed indirect side-effect.
| UebVar wrote:
| This is "can" has exactly the same meaning as in "UB can make
| your programms faster". You could replace it with "it does, at
| least with clang". LTO is, in this regard, the same as UB, and
| unlike guaranteed optimizations, such as the single member
| optimization, or the empty base optimization.
| mwkaufma wrote:
| Concretely, here, the UB-exploitation in question in this
| case is assuming that the "this" pointer in C++ is aligned
| and non-null, meaning it's a pervasive annotation throughout
| C++ codebases, not an edge-case.
|
| Relying on LTO to "discover" this annotation through
| interprocedural analysis -- based on my experience of looking
| at LTO in practice -- will not be as comprehensive, and even
| when it works it accomplishes its task in an achingly-slow
| and expensive way.
|
| This is a real devil-is-in-the-details case.
| quotemstr wrote:
| I love when papers disagree with their own abstracts.
| gitroom wrote:
| perfect, this is right up my alley - honestly i keep wondering if
| teams avoid optimizations like lto just because build pain sucks
| or if theres some deeper trust issues around letting the
| toolchain be clever. you think peopled deal with slow builds if
| it bought way more speed for the final product?
| pcwalton wrote:
| I notice that the paper doesn't claim to eliminate _all_
| reasoning about undefined behavior for optimizations. For
| example: int f() { int arr[3], i =
| 0; arr[3] = 5; return i; }
|
| Optimizing this to "return 0" is relying on UB, because it's
| assuming that i wasn't laid out directly after arr in the stack
| frame. I believe this is what the paper calls "non-guardable UB".
|
| I don't agree with the claim in the paper that their semantics
| offers a "flat memory model". A flat memory model would rule out
| the optimization above. Rather, the memory model still has the
| notion of object bounds; it's just simplified in some ways.
| throwawayqqq11 wrote:
| Sorry, i dont get why the memory layout should have any effect,
| when its clear in the AST that i=0 should be returned.
| bregma wrote:
| It's clear in the AST that there is undefined behaviour and
| it is malformed code. It is not valid C code, so what the
| compiler chooses to do with it is not defined by the
| language.
| pcwalton wrote:
| Note that if you change the code to this you have the same
| issue: int g(int n) { int
| arr[3], i = 0; arr[n] = 5; return
| i; }
|
| Without "exploiting UB" it's incorrect to optimize this to
| "return 0", because of the possibility that i was allocated
| right after arr and n == 3.
| screcth wrote:
| If we consider writing out of bounds to be legal, we make
| it impossible to reason about the behavior of programs.
| pcwalton wrote:
| Hence why I (and many other compiler developers) are
| inherently skeptical whenever anyone says "just stop
| exploiting undefined behavior".
| im3w1l wrote:
| My thinking disagrees with yours, but I can't say I have
| fully made up my mind yet. To me a compiler that is not
| "exploiting UB" has several valid choices about how to
| compile this. i may be stored in memory, in a register or
| may be deleted entirely as it's known at compile time
| that it will have value 0. The store to arr[n] may go
| through or it may be deleted as it's a dead store.
|
| You may say I'm "exploiting ub" when making these
| deductions but I would disagree. My argument is not of
| the form "x is ub so I can do whatever I want".
|
| To elaborate on the example, if arr was volatile then I
| would expect the write to always go through. And if i was
| volatile then I would expect i to always be read. However
| it's still not guaranteed that i is stored immediately
| after arr, as the compiler has some discretion about
| where to place variables afaik. But _if_ i is indeed put
| immediately after, then the function should indeed return
| 5 for n=3. For n >3 it should either return 0 (if writing
| to unused stack memory), page fault (for small n outside
| of the valid stack space), or stomp on random memory (for
| unlucky and large n). For negative n, many bad things are
| likely to happen.
|
| Edit: I think I mixed up which way the stack grows but
| yeah.
| dgrunwald wrote:
| > But if i is indeed put immediately after, then the
| function should indeed return 5 for n=3.
|
| That's not how compilers work. The optimization changing
| `return i;` into `return 0;` happens long before the
| compiler determines the stack layout.
|
| In this case, because `return i;` was the only use of
| `i`, the optimization allows deleting the variable `i`
| altogether, so it doesn't end up anywhere on the stack.
| This creates a situation where the optimization only
| looks valid in the simple "flat memory model" because it
| was performed; if the variable `i` hadn't been optimized
| out, it would have been placed directly after `arr` (at
| least in this case: https://godbolt.org/z/df4dhzT5a), so
| the optimization would have been invalid.
|
| There's no infrastructure in any compiler that I know of
| that would track "an optimization assumed arr[3] does not
| alias i, so a later stage must take care not to place i
| at that specific point on the stack". Indeed, if array
| index was a runtime value, the compiler would be
| prevented from ever spilling to the stack any variable
| that was involved in any optimizations.
|
| So I think your general idea "the allowable behaviors of
| an out-of-bounds write is specified by the possible
| actual behaviors in a simple flat memory model for
| various different stack layouts" could work as a
| mathematical model as an alternative to UB-based
| specifications, but it would end up not being workable
| for actual optimizing compiler implementations -- unless
| the compiler could guarantee that a variable can always
| stay in a register and will never be spilled (how would
| the compiler do that for functions calls?), it'd have to
| essentially treat all variables as potentially-modified
| by basically any store-via-pointer, which would
| essentially disable all optimizations.
| maartenscholl wrote:
| I think in the example the parent gave `arr[3]` is past the
| end of the 3 element array, where `i` might reside,
| potentially changing its value.
| dooglius wrote:
| >it's assuming that i wasn't laid out directly after arr in the
| stack frame
|
| The compiler isn't "assuming" that so much as choosing not to
| put i in the stack frame at all. And I don't think it's correct
| to view the lack of a register spill as an "optimization" per
| se. It does remain true that code writing past the end of an
| array will be UB in typical scenarios (i.e. when not using
| asan/valgrind).
|
| (Now, if the compiler also removed the store, that could
| legitimately be called an optimization based on assuming no-UB)
| pcwalton wrote:
| "Exploiting undefined behavior" occurs when a simple
| semantics (however one defines "simple") results in behavior
| A, but the compiler chooses behavior B instead based on the
| actual, more complex, language semantics. The code snippet in
| question passes that test. If I flip the declaration order of
| i and arr, then I get this [1] at -O0 (the "simple"
| semantics): push rbp mov
| rbp, rsp mov dword ptr [rbp - 4], 0
| mov dword ptr [rbp - 4], 5 mov eax, dword
| ptr [rbp - 4] pop rbp ret
|
| Which indeed returns 5. But at -O2 clang optimizes it to
| this: xor eax, eax ret
|
| Which returns 0. So the simple semantics produces one result,
| and the complex semantics produces another. Hence, it's
| exploiting undefined behavior.
|
| [1]: https://godbolt.org/z/df4dhzT5a
| dooglius wrote:
| Maybe this is just arguing semantics, but I don't agree
| with the definition you've given, and I don't think that
| your definition is what TFA means. "Exploiting undefined
| behavior" I think refers narrowly to the implementation
| assuming that undefined behavior does not occur, and acting
| on that assumption. Undefined behavior naturally resulting
| in unpredictable behavior is not exploitation in the same
| sense. For example, printf("A"); bool
| x; if ( x ) {printf("B");} else {printf("C");}
| printf("\n");
|
| If at -O0 "AB" is printed and at -O2 "AC" is printed (due
| to the vagaries of whatever was left on the stack), then
| that would meet your definition, but I would not regard
| that as "exploiting undefined behavior", merely as the
| manifestation of the inherent unpredictability of UB. If
| the compiler didn't print anything (i.e. the whole block
| was removed due to UB detection) then that _would_ be a
| case of exploiting undefined behavior.
| pcwalton wrote:
| That example is an instance of unspecified vs. undefined
| behavior, but the correctness of the pointer provenance-
| based optimization example I gave doesn't depend on
| whether writing to an out-of-bounds pointer is
| unspecified or undefined.
| artikae wrote:
| What about this: https://godbolt.org/z/xP9xG3Ee3
|
| Here the compiler "register allocates" i for some reads but
| not for others.
|
| i gets stack allocated, but some uses of it act as though
| they were register allocated.
| dooglius wrote:
| I'm not sure quite what you're asking for exactly, given
| the link is for clang trunk and doesn't have the
| modifications discussed in TFA, and I don't dispute that
| clang does UB-based reasoning at -O3. But, I will argue
| that the assembly shown can be accomplished without
| resorting to what I call "reasoning about UB", and within a
| flat memory model, supporting the claim that these
| sacrifices are often not necessary. I'm going to draw a
| distinction between stack memory being "private" in the
| sense that only the compiler is allowed to alter it, and
| "public" where the address can be written to by something
| else and the compiler needs to handle that. Local variables
| at first are tracked privately. After the address of a
| variable is taken with &x, or at the point in time when an
| array variable is indexed, the associated memory is public.
| Conceptually, the use of private memory can be indirect;
| the compiler could encode/decode a stack variable x as (x
| XOR 0xDEADBEEF) on the stack and it would be fine (but the
| compiler does the simple thing in practice, naturally).
| Note that this notion of "private"/"public" stack memory is
| a property of addresses, not the provenance of the
| accessing pointers, and so is fully compatible with a flat
| memory model. The compiler's ability to use private memory
| isn't a case of "reasoning around UB" in a meaningful sense
| -- otherwise you could just as well argue that returning
| from a function call is "reasoning about UB", because the
| the return address can't be clobbered.
|
| In your provided snippet, the correctness argument for the
| assembly in a non-UB-reasoning universe goes like this: at
| first, i is stored privately on the stack with value zero,
| and so as an optimization we can assume that value is still
| zero without rereading. Only later, when &i is taken, is
| that memory made public and the compiler has to worry about
| something altering it. In actual execution, the problem is
| that the write function alters compiler-private memory (and
| note again, that being private is a property of the
| underlying address, not the fact that it's accessed via an
| out-of-bounds array indexing), and this is UB and so the
| program breaks. But, the compiler didn't need to make
| _assumptions_ around UB.
| jcranmer wrote:
| I've only briefly skimmed the paper, but on that glance, it
| looks like what they did was (effectively) drop all the
| attributes in LLVM that can indicate UB, e.g., the inbounds
| flags on getelementptr instructions, or the nsw flags on
| arithmetic operations.
|
| As you note, it doesn't remove the more core UB behaviors in
| LLVM, in particular LLVM's reliance on pointer provenance.
| nikic wrote:
| They do a bit more than that. One of the options (-disable-
| object-based-analysis under AA2) disables the assumption that
| distinct identified objects do not alias, which is disabling
| pointer provenance in at least one key place.
|
| So I think this option very roughly approximates a kind of
| "no provenance, but with address non-determinism" model,
| which still permits optimizations like SROA on non-escaping
| objects.
| jcranmer wrote:
| That's what I get for relying on a skimming of the paper.
|
| Also, hi, didn't know you commented on this site.
| nikic wrote:
| One peculiar thing about the benchmark results is that disabling
| individual UB seems to fairly consistently reduce performance
| without LTO, but improve it with LTO. I could see how the UB may
| be less useful with LTO, but it's not obvious to me why reducing
| UB would actually help LTO. As far as I can tell, the paper does
| not attempt to explain this effect.
|
| Another interesting thing is that there is clearly synergy
| between different UB. For the LTO results, disabling each
| individual UB seems to be either neutral or an improvement, but
| if you disable all of them at once, then you get a significant
| regression.
___________________________________________________________________
(page generated 2025-04-25 23:01 UTC)