[HN Gopher] Do we really need undefined behavior?
___________________________________________________________________
Do we really need undefined behavior?
Author : ibraheemdev
Score : 49 points
Date : 2021-12-03 20:51 UTC (1 days ago)
(HTM) web link (www.ralfj.de)
(TXT) w3m dump (www.ralfj.de)
| Ericson2314 wrote:
| One thing I don't see remarked upon enough is that UB
| fundamentally is about the _logical_ side of programming
| languages, in particular
| https://en.wikipedia.org/wiki/Principle_of_explosion . This
| quote:
|
| > Here I refer to the typical modern interpretation of UB:
| assumptions the compiler may trust, without bounds on what
| happens if they are violated.
|
| Makes that abundantly clear.
|
| This basic realization opens up many avenues for
| interdisciplinary research. For example, maybe all the ink spill
| on "paraconsistent logics" will finally be useful for designing
| an unsafe language like C.
| (https://plato.stanford.edu/entries/logic-paraconsistent/
| https://ncatlab.org/nlab/show/paraconsistent+logic .) Maybe
| "unsafe" in Rust is best understood as a logical "mode" of some
| sort. Lattice of allowed behaviors lead to a nice fine-grained
| semantics for semi-defined behavior.
|
| I honestly don't know whether no one talks about this because it
| is very obvious to researchers in the field, or because no one
| actually dove deep into this interpretation.
| Ericson2314 wrote:
| I got those first comment upvotes, but no replies, and so the
| mystery of whether this is obvious and I am preaching to the
| choir, or this is a genuinely enlightening point, is not
| resolved :)
| OJFord wrote:
| My reaction to the title was 'well sure, Do we really need tax or
| legal loopholes and grey areas' either, like the question doesn't
| really make sense, but actually for the author's position seems
| like 'Do we really _not_ need undefined behaviour ' would be a
| better fit. I.e. they're arguing that it's useful in response to
| a claim that it should be eliminated.
| clktmr wrote:
| "Do we really need UB?" is the wrong question asked. "How much UB
| do we need to achieve the design goals of our language?".
|
| I think introducing UB while planning tight integration in the
| language's tooling beforehand has lots of upside. Even GCC's
| ubsan does quite a good job, although created retrospectively.
| larsrc wrote:
| Godel says you do.
| Jweb_Guru wrote:
| Godel's incompleteness theorem does not say that you cannot
| create a programming language (even a Turing-complete one)
| without undefined behavior. I'd love to hear how you came to
| that conclusion.
| stncls wrote:
| The arguments I've read for or against undefined behavior are
| always:
|
| 1. fully binary: "[all] UB is nonsense" vs. "[all] UB is
| necessary for performance"
|
| 2. qualitative / anecdotic: "look at this compiler doing
| something unexpected" vs. "look at this optimizing compiler's
| transformation that would be impossible without UB".
|
| Instead I would love to see a more fine-grained and quantitative
| study. There are _many_ different types of UB. I 'm sure every
| one of them is exploited by optimizing compilers. However, what
| would be the quantitative impact of disabling each of them on,
| say, SPECINT benchmarks? Of course some types of UB-related
| assumptions are deeply embedded in compilers, so it's clearly a
| question easier asked than answered. But I think simple things
| like signed-integer overflow handling can be changed (in clang at
| least?). Has anyone done benchmarking with this? Are you aware of
| any papers on the topic?
| vlovich123 wrote:
| People focus too much on SPECINT. What I'd like to see instead
| is the companies that are already running lots of real world
| benchmarks report behavior changes. Then you can get more
| meaningful real world results. Now obviously this isn't
| something you could develop locally so these companies would
| have to develop tooling to be able to gather these kinds of
| reports. It would be extremely beneficial good for compiler
| authors (eg "run experiment X"). Rust has the best setup I've
| seen in that it has both microbenchmarks and non-
| microbenchmarks and looks for statistical significance across
| all workloads rather than hyper focusing on a small thing.
| [deleted]
| solmag wrote:
| I think one of Jung's comments was that we don't really know
| the performance benefits of UB in that blog post.
| eximius wrote:
| I don't follow. This demonstrates that we use UB, not that we
| need it?
| marginalia_nu wrote:
| The is-ought problem is tricky like that.
| masklinn wrote:
| Indeed.
|
| Though one of the confusing issue with UBs is that the
| _purpose_ of the UB varies.
|
| Some UBs are really close cousins of IBs like UBs on signed
| overflow: it could be IB but was probably found not worth the
| status, so was left undefined.
|
| Others are about the intersection of efficiency and lack of
| expressivity, like dereferencing null pointers: implicitly
| checking each and every access was (likely) considered too
| costly, but the language has no way of expressing nullability
| statically.
|
| So given a language with C-type semantics, you don't _need_ the
| first UB (though it can certainly be leveraged by compilers),
| while the second one is much harder to get rid off.
| Jweb_Guru wrote:
| The point he is making is that trying to optimize C without any
| UB at all is basically impossible, even basic stuff like
| register allocation. If you're willing to forego that much
| performance just for the nebulous goal of producing "platform-
| specific UB" (which in practice can be just as damaging as the
| "regular" kind of UB in most cases) to simplify reasoning about
| your program, do you really need to be writing in C? Or would
| another, memory-safe language suffice? For that matter, if you
| really only care about what happens on the platform you're on,
| why not write in assembler and avoid having to worry about an
| optimizer at all? After all, "platform-specific" UB will
| otherwise vary across systems, leading to the same kind of
| unexpected behavior when you port your program (e.g. weak
| memory bugs in code designed for x86-64 machines that only
| surface on ARM).
|
| (And to be clear--I think the article is being incredibly
| diplomatic in saying "there isn't enough research" into
| compilers that don't optimize based on UB. It would be
| absolutely shocking if a non-UB-based compiler were able to
| come anywhere close to modern C compilers on either real-world
| programs or toy benchmarks, because almost every avenue of
| optimization would be closed to them. When any out of bounds
| write from anywhere in your code, including another thread, can
| modify a variable on the stack, it's not even usually possible
| to coalesce or rearrange writes, or remove redundant reads!
| These sorts of things are the bread and butter of compiler
| optimizations. Hell, are we even allowed to inline function
| calls? It might break out of bounds writes to the saved stack
| pointer!).
| eximius wrote:
| UB that just says "this is an axiom the compiler makes" is
| fine. Ignoring it as a possibility is fine. Lighting your
| program on fire when it _knows_ you 've done it instead of
| just telling you or doing something consistent is not.
|
| Anyway, if this is basically just "C is bad", whatever, I'm
| not using C for anything. I'd use Zig or Rust.
| Jweb_Guru wrote:
| > UB that just says "this is an axiom the compiler makes"
| is fine.
|
| That's literally what UB is.
|
| If the compiler makes any optimizations based on those
| axioms, and the axioms are violated, you'll end up with
| unexpected behaviors ("lighting your program on fire") that
| are crazy hard to debug. So saying "the compiler can have
| UB, but it just shouldn't make weird optimizations based on
| that" is a little hard to justify, or else you get absurd
| results like what I outlined in the post above. A lot of
| beginning C programmers _do_ find it strange that they can
| 't index past the end of a variable and expect to always
| hit the next variable on the stack, for example, but
| experienced C programmers aren't. Being able to rearrange,
| reuse, and register allocate variable slots is one of the
| most fundamental optimizations out there.
|
| Now, imagine if compilers were never allowed to reorder
| writes and loads because another thread might potentially
| update them... pretty crazy, right? Well, a lot of the
| "benign data races" people yell at compilers for screwing
| with, are caused by exactly that kind of reasoning,
| assuming that writes in one thread will always be seen in
| the same order by other threads without any explicit
| synchronization or ordering primitives. And that's not even
| necessarily a "messing with what the hardware does" thing,
| because there's hardware out there that will reorder
| ordinary loads and stores for you!
| throwuxiytayq wrote:
| > It would be absolutely shocking if a non-UB-based compiler
| were able to come anywhere close to modern C compilers on
| either real-world programs or toy benchmarks, because almost
| every avenue of optimization would be closed to them.
|
| Modern C# has next to no undefined behavior, you can
| disregard the managed features and write low level code with
| it, and it JIT-compiles nearly as efficiently as C.
|
| Unity has an LLVM-based compiler for a subset of C# that
| emits _extremely clean_ native assembly.
|
| These days there's no real reason for languages to be as ugly
| as C.
| Jweb_Guru wrote:
| > Modern C# has next to no undefined behavior, you can
| disregard the managed features and write low level code
| with it, and it JIT-compiles nearly as efficiently as C.
|
| The calculus for memory-safe languages is completely
| different. They are not allowed to perform arbitrary,
| unchecked, out-of-bounds writes, so optimizations assuming
| those writes don't exist are sound without UB. If you mix
| them with an unsafe language or component that can use raw
| pointers, that part of the language has a lot of the same
| UB that C does, or else it would invalidate optimizations
| in the managed part of the language. If you've ever seen
| C++ code that needs to interact with a managed runtime,
| you're well aware of how careful the unmanaged part has to
| be to avoid treading on runtime invariants.
|
| Anyway I'm not defending C here. I'm defending UB. It
| exists in the language for a reason, which is that it's not
| meaningfully possible to optimize programs without it. If
| people don't want to deal with it, they should use another
| language that isn't C. That doesn't mean that the _amount_
| of UB in C couldn 't be greatly reduced, but the article is
| responding to people who claim that a modern optimizing C
| compiler could reasonably avoid non-"platform specific" UB
| altogether.
| saurik wrote:
| FWIW, some of the biggest cases of undefined behavior are ones
| that, as far as I am concerned, are workarounds for legacy code:
| things like how older code uses "int" as the counter for every
| single for loop, despite this being an incredibly stupid type for
| this purpose... compilers are then competing on how fast they
| make the existing ecosystem of code and so add awkward detection
| cases like "if it sort of looks like you only increment it I will
| decide you didn't want overflow and that way I can avoid having
| to simulate it now that int is 32-bits but the CPU is 64-bits".
|
| But once compilers start adding these special cases then new code
| has to be written to understand "if I add a bounds check on some
| wrapping operation this compiler thinks I am insane and removes
| it as it assumes I am an idiot using the wrong type". And once
| compilers start doing that, particularly with incompatible
| heuristics, the concepts of what the language is even
| guaranteeing begin to erode and the standards adjust to define
| "undefined" behavior.
|
| But the correct solution wasn't to fix the compiler... it was to
| use the correct data type! Whether that feels like it should be
| size_t or uintptr_t I will refrain from providing an opinion on
| in this comment (to avoid bike-shedding), but the premise that
| the compiler is a billion little attempts to figure out how to
| make my code faster--rules that change constantly and see under-
| specified--when it is the code that is wrong is extremely
| frustrating.
|
| If you want to avoid some aliasing use case, or make math
| operations faster, or avoid issues with the calling convention on
| "this" pointers, or whatever it is... we should agree on what the
| code _should_ look like to make those semantics clear and then
| _forcibly deprecate as non-compliant_ any implementation that
| uses strange quirks to auto-detect and "improve" old code.
|
| Alternatively, we should have a special mode the compiler should
| support to "just do what the code says even if it is slow" and we
| should then guide developers in the direction that it is a "code
| smell" to stay in quirks mode: that if there is a way to make the
| "strict" compiler mode understand your code correctly and still
| be as fast as the one with tons of undefined behavior, it is your
| code that is at fault.
|
| But like, the market incentives will never support this: people
| evaluating between two compilers are going to run benchmarks
| against projects that they have been using for decades and choose
| the compiler that "works better"; and meanwhile, convincing
| developers to stop using "int" or explicitly call out aliasing or
| whatever will never happen because people will benchmark the code
| against a compiler with all these bells and whistles and decide
| "the risk of making these modifications to every single place in
| our code where it might matter is you great as we might break
| something in the process and the compiler seems to compile our
| code just as fast without it being explicit (due to UB)" with a
| side of "the explicit syntax for that is caught up in committee
| and isn't supported on one of our key targets yet".
|
| So, I personally consider this all a "lost cause". But I also
| really want to point out that the root cause of all of this
| undefined behavior is due to an underlying incentive structure
| that your pet language isn't immune to: you just got to start off
| clean without a ton of legacy baggage code. You don't have a
| ridiculously large amount of code written with silly assumptions
| for 16-bit systems that has been carefully dragged through the
| ages into 64-bit, and you have a ton of explicit syntax already
| to prevent a lot of the other brutal cases in all new code... and
| there is also (probably) a single implementation of the compiler.
|
| But, one day, 20 years from now, after a billion lines of code
| have been written in your language, there are four big
| implementations of the compiler (only one of which still
| supporting legacy ARM CPUs, as everyone has since moved on to KR7
| or whatever is hip in 2040), and you've made a ton of sad choices
| in an attempt to hold your community together, someone is going
| to weigh a cost/benefit analysis and decide "keeping existing
| code working but making it faster is more important--and wins us
| more marketshare--than telling people how to port their code to
| the new era".
|
| Now, I don't know what these cases are going to be--maybe
| something super complex involving how massively parallel
| execution environments handle order of execution for old code
| designed to run on a classic computer core, or how to map
| explicit integer representations into prime-order field elements
| for zero-knowledge execution, or what to do with information
| leakage during homomorphic encryption--but it is going to happen,
| and we are going to repeat all of these "quirks mode vs. strict
| mode" discussions each generation for every actually-important
| piece of technology, just as we have for standards even so far
| afield to "programming" as HTML and TCP.
| sokoloff wrote:
| > like how older code uses "int" as the counter for every
| single for loop, despite this being an incredibly stupid type
| for this purpose
|
| Is it? I always internalized int as "the most natural sized
| number for the platform". For a lot of uses, that's a pretty
| reasonable choice, despite there more specific/constrained
| types available now.
| seoaeu wrote:
| The problem is that the "most natural sized number for the
| platform" was 32-bits for so long on major platforms, that
| now everyone assumes that changing that will break too much
| for it to be worth switching
| msbarnett wrote:
| Its not about the size, its about the signedness being
| presumably not natural for an index, and the potential for
| having very inconvenient consequences for performance on
| modern CPU architectures if you were not able to assume the
| index never goes negative due to a wrap.
|
| Basically, to get decent perf out of C loops, arithmetic
| overflow _has to_ be left as UB for int. If the standard
| index variable type had historically been unsigned and
| defined to terminate on overflow or something, you could make
| wrap-on-overflow standard for int without killing the
| performance of hot inner loops.
| cpeterso wrote:
| int is signed and signed int overflow is UB. If you use an
| int as a loop counter, the compiler can optimize the loop
| assuming the loop counter will never overflow because,
| surely, a programmer would never invoke UB. :) If the loop
| counter does overflow, then who knows what bad things will
| happen. For example, a simple loop like "for (int i = 0; i >=
| 0; i++) { ... }" might become an infinite loop, crash,
| execute random code, or never execute in the first place.
| sokoloff wrote:
| I think it's defined to execute at least until overflow, is
| it not? (So, even assuming a 1-bit int, it's going to
| execute at least once I think.)
| Someone wrote:
| I think the first question to ask is whether we need a language
| that does away with array bounds checking.
|
| If, like C, you think you do, you accept (on currently popular
| CPUs) that code may read from and write to arbitrary memory, and
| that means you have to define that as "anything can happen".
|
| If you don't, I think you can get rid of the most common causes
| of UB.
|
| So, with modern compiler technology, are languages that check
| array bounds necessarily slower than those that aren't? If so, is
| the difference large enough?
| spacechild1 wrote:
| > So, with modern compiler technology, are languages that check
| array bounds necessarily slower than those that aren't? If so,
| is the difference large enough?
|
| For performance sensitive tasks - definitely! I certainly don't
| want array bound checking in my (soft)realtime audio processing
| code.
|
| As a side note: when I write an audio plugin, it's the
| responsibility of the host application to tell me the correct
| buffer sizes, so I basically deal with trusted input. It's
| still possible to mess up, of course, but personally I found
| writing DSP code to be pretty safe.
| nayuki wrote:
| A short note about the content - the step of creating an out-of-
| bounds pointer is UB, and the step of dereferencing the pointer
| is also UB: int *x = j + 32/4;
| *x = 40;
|
| Thus, you can argue that if UB were to be banned, it would be
| better to crash at the point of creating the out-of-bounds
| pointer, instead of accommodating the out-of-bounds write by
| always reloading every read from memory in case of aliasing.
| Jweb_Guru wrote:
| How do you know it's out of bounds, in general? C pointers
| aren't required to store their size at runtime, and even if
| they did and you were inside the extent the pointer might have
| already been freed due to aliasing (which can be impossible to
| track if your language allows integer-to-pointer casts). Of
| course in some obvious cases, it would be good to detect this
| (and many compilers try), but in cases where it's obvious isn't
| it better to error at compile time rather than runtime?
___________________________________________________________________
(page generated 2021-12-04 23:01 UTC)