[HN Gopher] Do we really need undefined behavior?
       ___________________________________________________________________
        
       Do we really need undefined behavior?
        
       Author : ibraheemdev
       Score  : 49 points
       Date   : 2021-12-03 20:51 UTC (1 days ago)
        
 (HTM) web link (www.ralfj.de)
 (TXT) w3m dump (www.ralfj.de)
        
       | Ericson2314 wrote:
       | One thing I don't see remarked upon enough is that UB
       | fundamentally is about the _logical_ side of programming
       | languages, in particular
       | https://en.wikipedia.org/wiki/Principle_of_explosion . This
       | quote:
       | 
       | > Here I refer to the typical modern interpretation of UB:
       | assumptions the compiler may trust, without bounds on what
       | happens if they are violated.
       | 
       | Makes that abundantly clear.
       | 
       | This basic realization opens up many avenues for
       | interdisciplinary research. For example, maybe all the ink spill
       | on "paraconsistent logics" will finally be useful for designing
       | an unsafe language like C.
       | (https://plato.stanford.edu/entries/logic-paraconsistent/
       | https://ncatlab.org/nlab/show/paraconsistent+logic .) Maybe
       | "unsafe" in Rust is best understood as a logical "mode" of some
       | sort. Lattice of allowed behaviors lead to a nice fine-grained
       | semantics for semi-defined behavior.
       | 
       | I honestly don't know whether no one talks about this because it
       | is very obvious to researchers in the field, or because no one
       | actually dove deep into this interpretation.
        
         | Ericson2314 wrote:
         | I got those first comment upvotes, but no replies, and so the
         | mystery of whether this is obvious and I am preaching to the
         | choir, or this is a genuinely enlightening point, is not
         | resolved :)
        
       | OJFord wrote:
       | My reaction to the title was 'well sure, Do we really need tax or
       | legal loopholes and grey areas' either, like the question doesn't
       | really make sense, but actually for the author's position seems
       | like 'Do we really _not_ need undefined behaviour ' would be a
       | better fit. I.e. they're arguing that it's useful in response to
       | a claim that it should be eliminated.
        
       | clktmr wrote:
       | "Do we really need UB?" is the wrong question asked. "How much UB
       | do we need to achieve the design goals of our language?".
       | 
       | I think introducing UB while planning tight integration in the
       | language's tooling beforehand has lots of upside. Even GCC's
       | ubsan does quite a good job, although created retrospectively.
        
       | larsrc wrote:
       | Godel says you do.
        
         | Jweb_Guru wrote:
         | Godel's incompleteness theorem does not say that you cannot
         | create a programming language (even a Turing-complete one)
         | without undefined behavior. I'd love to hear how you came to
         | that conclusion.
        
       | stncls wrote:
       | The arguments I've read for or against undefined behavior are
       | always:
       | 
       | 1. fully binary: "[all] UB is nonsense" vs. "[all] UB is
       | necessary for performance"
       | 
       | 2. qualitative / anecdotic: "look at this compiler doing
       | something unexpected" vs. "look at this optimizing compiler's
       | transformation that would be impossible without UB".
       | 
       | Instead I would love to see a more fine-grained and quantitative
       | study. There are _many_ different types of UB. I 'm sure every
       | one of them is exploited by optimizing compilers. However, what
       | would be the quantitative impact of disabling each of them on,
       | say, SPECINT benchmarks? Of course some types of UB-related
       | assumptions are deeply embedded in compilers, so it's clearly a
       | question easier asked than answered. But I think simple things
       | like signed-integer overflow handling can be changed (in clang at
       | least?). Has anyone done benchmarking with this? Are you aware of
       | any papers on the topic?
        
         | vlovich123 wrote:
         | People focus too much on SPECINT. What I'd like to see instead
         | is the companies that are already running lots of real world
         | benchmarks report behavior changes. Then you can get more
         | meaningful real world results. Now obviously this isn't
         | something you could develop locally so these companies would
         | have to develop tooling to be able to gather these kinds of
         | reports. It would be extremely beneficial good for compiler
         | authors (eg "run experiment X"). Rust has the best setup I've
         | seen in that it has both microbenchmarks and non-
         | microbenchmarks and looks for statistical significance across
         | all workloads rather than hyper focusing on a small thing.
        
         | [deleted]
        
         | solmag wrote:
         | I think one of Jung's comments was that we don't really know
         | the performance benefits of UB in that blog post.
        
       | eximius wrote:
       | I don't follow. This demonstrates that we use UB, not that we
       | need it?
        
         | marginalia_nu wrote:
         | The is-ought problem is tricky like that.
        
         | masklinn wrote:
         | Indeed.
         | 
         | Though one of the confusing issue with UBs is that the
         | _purpose_ of the UB varies.
         | 
         | Some UBs are really close cousins of IBs like UBs on signed
         | overflow: it could be IB but was probably found not worth the
         | status, so was left undefined.
         | 
         | Others are about the intersection of efficiency and lack of
         | expressivity, like dereferencing null pointers: implicitly
         | checking each and every access was (likely) considered too
         | costly, but the language has no way of expressing nullability
         | statically.
         | 
         | So given a language with C-type semantics, you don't _need_ the
         | first UB (though it can certainly be leveraged by compilers),
         | while the second one is much harder to get rid off.
        
         | Jweb_Guru wrote:
         | The point he is making is that trying to optimize C without any
         | UB at all is basically impossible, even basic stuff like
         | register allocation. If you're willing to forego that much
         | performance just for the nebulous goal of producing "platform-
         | specific UB" (which in practice can be just as damaging as the
         | "regular" kind of UB in most cases) to simplify reasoning about
         | your program, do you really need to be writing in C? Or would
         | another, memory-safe language suffice? For that matter, if you
         | really only care about what happens on the platform you're on,
         | why not write in assembler and avoid having to worry about an
         | optimizer at all? After all, "platform-specific" UB will
         | otherwise vary across systems, leading to the same kind of
         | unexpected behavior when you port your program (e.g. weak
         | memory bugs in code designed for x86-64 machines that only
         | surface on ARM).
         | 
         | (And to be clear--I think the article is being incredibly
         | diplomatic in saying "there isn't enough research" into
         | compilers that don't optimize based on UB. It would be
         | absolutely shocking if a non-UB-based compiler were able to
         | come anywhere close to modern C compilers on either real-world
         | programs or toy benchmarks, because almost every avenue of
         | optimization would be closed to them. When any out of bounds
         | write from anywhere in your code, including another thread, can
         | modify a variable on the stack, it's not even usually possible
         | to coalesce or rearrange writes, or remove redundant reads!
         | These sorts of things are the bread and butter of compiler
         | optimizations. Hell, are we even allowed to inline function
         | calls? It might break out of bounds writes to the saved stack
         | pointer!).
        
           | eximius wrote:
           | UB that just says "this is an axiom the compiler makes" is
           | fine. Ignoring it as a possibility is fine. Lighting your
           | program on fire when it _knows_ you 've done it instead of
           | just telling you or doing something consistent is not.
           | 
           | Anyway, if this is basically just "C is bad", whatever, I'm
           | not using C for anything. I'd use Zig or Rust.
        
             | Jweb_Guru wrote:
             | > UB that just says "this is an axiom the compiler makes"
             | is fine.
             | 
             | That's literally what UB is.
             | 
             | If the compiler makes any optimizations based on those
             | axioms, and the axioms are violated, you'll end up with
             | unexpected behaviors ("lighting your program on fire") that
             | are crazy hard to debug. So saying "the compiler can have
             | UB, but it just shouldn't make weird optimizations based on
             | that" is a little hard to justify, or else you get absurd
             | results like what I outlined in the post above. A lot of
             | beginning C programmers _do_ find it strange that they can
             | 't index past the end of a variable and expect to always
             | hit the next variable on the stack, for example, but
             | experienced C programmers aren't. Being able to rearrange,
             | reuse, and register allocate variable slots is one of the
             | most fundamental optimizations out there.
             | 
             | Now, imagine if compilers were never allowed to reorder
             | writes and loads because another thread might potentially
             | update them... pretty crazy, right? Well, a lot of the
             | "benign data races" people yell at compilers for screwing
             | with, are caused by exactly that kind of reasoning,
             | assuming that writes in one thread will always be seen in
             | the same order by other threads without any explicit
             | synchronization or ordering primitives. And that's not even
             | necessarily a "messing with what the hardware does" thing,
             | because there's hardware out there that will reorder
             | ordinary loads and stores for you!
        
           | throwuxiytayq wrote:
           | > It would be absolutely shocking if a non-UB-based compiler
           | were able to come anywhere close to modern C compilers on
           | either real-world programs or toy benchmarks, because almost
           | every avenue of optimization would be closed to them.
           | 
           | Modern C# has next to no undefined behavior, you can
           | disregard the managed features and write low level code with
           | it, and it JIT-compiles nearly as efficiently as C.
           | 
           | Unity has an LLVM-based compiler for a subset of C# that
           | emits _extremely clean_ native assembly.
           | 
           | These days there's no real reason for languages to be as ugly
           | as C.
        
             | Jweb_Guru wrote:
             | > Modern C# has next to no undefined behavior, you can
             | disregard the managed features and write low level code
             | with it, and it JIT-compiles nearly as efficiently as C.
             | 
             | The calculus for memory-safe languages is completely
             | different. They are not allowed to perform arbitrary,
             | unchecked, out-of-bounds writes, so optimizations assuming
             | those writes don't exist are sound without UB. If you mix
             | them with an unsafe language or component that can use raw
             | pointers, that part of the language has a lot of the same
             | UB that C does, or else it would invalidate optimizations
             | in the managed part of the language. If you've ever seen
             | C++ code that needs to interact with a managed runtime,
             | you're well aware of how careful the unmanaged part has to
             | be to avoid treading on runtime invariants.
             | 
             | Anyway I'm not defending C here. I'm defending UB. It
             | exists in the language for a reason, which is that it's not
             | meaningfully possible to optimize programs without it. If
             | people don't want to deal with it, they should use another
             | language that isn't C. That doesn't mean that the _amount_
             | of UB in C couldn 't be greatly reduced, but the article is
             | responding to people who claim that a modern optimizing C
             | compiler could reasonably avoid non-"platform specific" UB
             | altogether.
        
       | saurik wrote:
       | FWIW, some of the biggest cases of undefined behavior are ones
       | that, as far as I am concerned, are workarounds for legacy code:
       | things like how older code uses "int" as the counter for every
       | single for loop, despite this being an incredibly stupid type for
       | this purpose... compilers are then competing on how fast they
       | make the existing ecosystem of code and so add awkward detection
       | cases like "if it sort of looks like you only increment it I will
       | decide you didn't want overflow and that way I can avoid having
       | to simulate it now that int is 32-bits but the CPU is 64-bits".
       | 
       | But once compilers start adding these special cases then new code
       | has to be written to understand "if I add a bounds check on some
       | wrapping operation this compiler thinks I am insane and removes
       | it as it assumes I am an idiot using the wrong type". And once
       | compilers start doing that, particularly with incompatible
       | heuristics, the concepts of what the language is even
       | guaranteeing begin to erode and the standards adjust to define
       | "undefined" behavior.
       | 
       | But the correct solution wasn't to fix the compiler... it was to
       | use the correct data type! Whether that feels like it should be
       | size_t or uintptr_t I will refrain from providing an opinion on
       | in this comment (to avoid bike-shedding), but the premise that
       | the compiler is a billion little attempts to figure out how to
       | make my code faster--rules that change constantly and see under-
       | specified--when it is the code that is wrong is extremely
       | frustrating.
       | 
       | If you want to avoid some aliasing use case, or make math
       | operations faster, or avoid issues with the calling convention on
       | "this" pointers, or whatever it is... we should agree on what the
       | code _should_ look like to make those semantics clear and then
       | _forcibly deprecate as non-compliant_ any implementation that
       | uses strange quirks to auto-detect and  "improve" old code.
       | 
       | Alternatively, we should have a special mode the compiler should
       | support to "just do what the code says even if it is slow" and we
       | should then guide developers in the direction that it is a "code
       | smell" to stay in quirks mode: that if there is a way to make the
       | "strict" compiler mode understand your code correctly and still
       | be as fast as the one with tons of undefined behavior, it is your
       | code that is at fault.
       | 
       | But like, the market incentives will never support this: people
       | evaluating between two compilers are going to run benchmarks
       | against projects that they have been using for decades and choose
       | the compiler that "works better"; and meanwhile, convincing
       | developers to stop using "int" or explicitly call out aliasing or
       | whatever will never happen because people will benchmark the code
       | against a compiler with all these bells and whistles and decide
       | "the risk of making these modifications to every single place in
       | our code where it might matter is you great as we might break
       | something in the process and the compiler seems to compile our
       | code just as fast without it being explicit (due to UB)" with a
       | side of "the explicit syntax for that is caught up in committee
       | and isn't supported on one of our key targets yet".
       | 
       | So, I personally consider this all a "lost cause". But I also
       | really want to point out that the root cause of all of this
       | undefined behavior is due to an underlying incentive structure
       | that your pet language isn't immune to: you just got to start off
       | clean without a ton of legacy baggage code. You don't have a
       | ridiculously large amount of code written with silly assumptions
       | for 16-bit systems that has been carefully dragged through the
       | ages into 64-bit, and you have a ton of explicit syntax already
       | to prevent a lot of the other brutal cases in all new code... and
       | there is also (probably) a single implementation of the compiler.
       | 
       | But, one day, 20 years from now, after a billion lines of code
       | have been written in your language, there are four big
       | implementations of the compiler (only one of which still
       | supporting legacy ARM CPUs, as everyone has since moved on to KR7
       | or whatever is hip in 2040), and you've made a ton of sad choices
       | in an attempt to hold your community together, someone is going
       | to weigh a cost/benefit analysis and decide "keeping existing
       | code working but making it faster is more important--and wins us
       | more marketshare--than telling people how to port their code to
       | the new era".
       | 
       | Now, I don't know what these cases are going to be--maybe
       | something super complex involving how massively parallel
       | execution environments handle order of execution for old code
       | designed to run on a classic computer core, or how to map
       | explicit integer representations into prime-order field elements
       | for zero-knowledge execution, or what to do with information
       | leakage during homomorphic encryption--but it is going to happen,
       | and we are going to repeat all of these "quirks mode vs. strict
       | mode" discussions each generation for every actually-important
       | piece of technology, just as we have for standards even so far
       | afield to "programming" as HTML and TCP.
        
         | sokoloff wrote:
         | > like how older code uses "int" as the counter for every
         | single for loop, despite this being an incredibly stupid type
         | for this purpose
         | 
         | Is it? I always internalized int as "the most natural sized
         | number for the platform". For a lot of uses, that's a pretty
         | reasonable choice, despite there more specific/constrained
         | types available now.
        
           | seoaeu wrote:
           | The problem is that the "most natural sized number for the
           | platform" was 32-bits for so long on major platforms, that
           | now everyone assumes that changing that will break too much
           | for it to be worth switching
        
           | msbarnett wrote:
           | Its not about the size, its about the signedness being
           | presumably not natural for an index, and the potential for
           | having very inconvenient consequences for performance on
           | modern CPU architectures if you were not able to assume the
           | index never goes negative due to a wrap.
           | 
           | Basically, to get decent perf out of C loops, arithmetic
           | overflow _has to_ be left as UB for int. If the standard
           | index variable type had historically been unsigned and
           | defined to terminate on overflow or something, you could make
           | wrap-on-overflow standard for int without killing the
           | performance of hot inner loops.
        
           | cpeterso wrote:
           | int is signed and signed int overflow is UB. If you use an
           | int as a loop counter, the compiler can optimize the loop
           | assuming the loop counter will never overflow because,
           | surely, a programmer would never invoke UB. :) If the loop
           | counter does overflow, then who knows what bad things will
           | happen. For example, a simple loop like "for (int i = 0; i >=
           | 0; i++) { ... }" might become an infinite loop, crash,
           | execute random code, or never execute in the first place.
        
             | sokoloff wrote:
             | I think it's defined to execute at least until overflow, is
             | it not? (So, even assuming a 1-bit int, it's going to
             | execute at least once I think.)
        
       | Someone wrote:
       | I think the first question to ask is whether we need a language
       | that does away with array bounds checking.
       | 
       | If, like C, you think you do, you accept (on currently popular
       | CPUs) that code may read from and write to arbitrary memory, and
       | that means you have to define that as "anything can happen".
       | 
       | If you don't, I think you can get rid of the most common causes
       | of UB.
       | 
       | So, with modern compiler technology, are languages that check
       | array bounds necessarily slower than those that aren't? If so, is
       | the difference large enough?
        
         | spacechild1 wrote:
         | > So, with modern compiler technology, are languages that check
         | array bounds necessarily slower than those that aren't? If so,
         | is the difference large enough?
         | 
         | For performance sensitive tasks - definitely! I certainly don't
         | want array bound checking in my (soft)realtime audio processing
         | code.
         | 
         | As a side note: when I write an audio plugin, it's the
         | responsibility of the host application to tell me the correct
         | buffer sizes, so I basically deal with trusted input. It's
         | still possible to mess up, of course, but personally I found
         | writing DSP code to be pretty safe.
        
       | nayuki wrote:
       | A short note about the content - the step of creating an out-of-
       | bounds pointer is UB, and the step of dereferencing the pointer
       | is also UB:                   int *x = j + 32/4;
       | *x = 40;
       | 
       | Thus, you can argue that if UB were to be banned, it would be
       | better to crash at the point of creating the out-of-bounds
       | pointer, instead of accommodating the out-of-bounds write by
       | always reloading every read from memory in case of aliasing.
        
         | Jweb_Guru wrote:
         | How do you know it's out of bounds, in general? C pointers
         | aren't required to store their size at runtime, and even if
         | they did and you were inside the extent the pointer might have
         | already been freed due to aliasing (which can be impossible to
         | track if your language allows integer-to-pointer casts). Of
         | course in some obvious cases, it would be good to detect this
         | (and many compilers try), but in cases where it's obvious isn't
         | it better to error at compile time rather than runtime?
        
       ___________________________________________________________________
       (page generated 2021-12-04 23:01 UTC)