[HN Gopher] Undefined Behavior deserves a better reputation (2021)
       ___________________________________________________________________
        
       Undefined Behavior deserves a better reputation (2021)
        
       Author : rramadass
       Score  : 40 points
       Date   : 2022-09-20 07:25 UTC (3 days ago)
        
 (HTM) web link (blog.sigplan.org)
 (TXT) w3m dump (blog.sigplan.org)
        
       | DougBTX wrote:
       | FWIW, the original link in the article shows that the
       | panic_bounds_check check is present in optimised code in Rust
       | 1.55, but if you update to a more recent Rust, say 1.60, then it
       | is optimised away as expected:                   example::mid:
       | test    rsi, rsi             je      .LBB0_1             and
       | rsi, -2             mov     edx, dword ptr [rdi + 2*rsi]
       | mov     eax, 1             ret         .LBB0_1:             xor
       | eax, eax             ret
       | 
       | https://rust.godbolt.org/z/Wz4Prjed4
        
         | tialaramex wrote:
         | Interesting, do you know how this is achieved?
         | 
         | Is there a Rust optimization that notices logically this type
         | of construction is always in-bounds and so the bounds check is
         | never emitted, or is the LLVM bounds check smarter now and
         | realises hey, my parameter is never out of bounds, so no need
         | to emit code here ?
        
       | teddyh wrote:
       | > _This post is about defending and promoting UB as a concept,
       | not UB in C /C++._
        
         | zaphar wrote:
         | This is important. He is arguing specifically that when made
         | explicit and opted into by the coder that UB can be useful. The
         | issues that C and C++ have are that it's too easy for the
         | developer to get opted in to UB by the compiler without knowing
         | it.
        
           | megous wrote:
           | So you just remember the relevant UBs defined in the C
           | standard:
           | 
           | https://gist.github.com/Earnestly/7c903f481ff9d29a3dd1
           | 
           | There are not that many UBs in the base language. And even
           | fewer are relevant for day to day coding. Most of it is in
           | the std library.
        
             | tialaramex wrote:
             | Well, it's countably many, but there are about 200 of them,
             | so we're not even talking US states or US presidents, but
             | more like UN member states. I remember Togo exists, but I
             | can't point to it on a map, and if you left it _off_ a map
             | I wouldn 't notice.
             | 
             | So we're asking quite a lot to say people should just
             | remember all of them _and actually use this knowledge not
             | just recite the list_. And these aren 't small things, they
             | include very broad ideas, like 'object is referred to
             | outside of its lifetime' and even just categories of
             | deviations from the standard like 'A "shall" or "shall not"
             | requirement that appears outside of a constraint is
             | violated'
             | 
             | It's impressive that they enumerated them, A+ gold stars.
             | As I understand it WG21 (C++) is still attempting to
             | produce a list of the Undefined Behaviours in their
             | language - but I don't think "memorize all these vague
             | ideas and use that knowledge when programming" is
             | practical.
        
       | kazinator wrote:
       | > _I have presented Undefined Behavior as a tool that enables the
       | programmer to write code that the compiler cannot check for
       | correctness, and argued that -- used responsibly -- it is a
       | useful component in a language designer 's toolbox._
       | 
       | That is simply wrong. Code that the compiler cannot or does not
       | check for correctness isn't "undefined behavior". Traditionally
       | that is called "unsafe code": the programmer ensures safety
       | through analyzing all the cases that may occur and making sure
       | they have reliable consequences.
       | 
       | (The article talks about Rust a lot; maybe undefined and unsafe
       | are the same thing in Rust?)
       | 
       | Undefined behavior with ill consequence may co-occur with safety.
       | A case of undefined behavior may be flagged by the compiler by a
       | diagnostic "undefined behavior in line 42". Thus, safety was
       | ensured; the situation was diagnosed. Yet, an executable program
       | could be produced anyway, and if that is run, it may mishbehave
       | due to that undefined behavior in line 42.
       | 
       | This is the case in C, when you assign incompatible types, and
       | use a compiler like GCC which only warns about that, by default,
       | and translates anyway.
       | 
       | Some undefined behavior leads to a documented extension, which
       | makes it defined for the given implementation, and safe (if used
       | as documented). If you try to perform arithmetic on a void *
       | pointer, that is not an ISO C feature; it requires a diagnostic.
       | GCC allows void * pointer arithmetic; it behaves like byte
       | addressing. Programs which use this feature are invoking
       | undefined behavior: they are violating an ISO C constraint rule,
       | yet being executed anyway. Furthermore, if the diagnostic isn't
       | issued, then GCC is being a non-conforming implementation; it's a
       | non-conforming extension.
        
       | overgard wrote:
       | Here's my issue with UB in C/C++ (and I think it agrees with the
       | article): I would much rather have _direct compiler directives_
       | over the compiler being clever in a quiet way. The first example
       | in this article is very good -- get_unchecked avoids undefined
       | behavior (well, outside of accessing out of bounds), but still
       | provides the wanted optimization. If something goes bad in your
       | program, one of the first things you 're going to suspect is
       | functions with _unchecked at the end of them, so the developer
       | ergonomics are great.
       | 
       | My main problem with undefined behavior is that it's rare to know
       | when it's happening. I would much rather give hints to the
       | compiler than having it "prove" something very subtle underneath
       | the hood using UB, because then I understand where those things
       | are happening. And it's not even like it's a rare thing! My C++
       | is littered with "const" and "[[nodiscard]]" and #pragma pack and
       | all sorts of other things that don't change the behavior of the
       | program but do indicate something important to the compiler. I
       | want more of that, and less subtlety.
        
       | somat wrote:
       | "all it does is perform optimizations that are correct under the
       | extra assumption that there is no Undefined Behavior."
       | 
       | How did the compiler writers reach the point where they were able
       | to read "undefined behavior" as "does not happen" rather than
       | "specification left this undefined so that the compiler can
       | define it"
       | 
       | I don't think there is anything wrong with undefined behavior, I
       | do however expect the compiler documentationto have an extensive
       | section on stuff undefined by the specification.
       | 
       | for example: when a signed integer would exceeds the maximum
       | value this compiler just lets it go and it does whatever the
       | underlying hardware would do. on amd64 this will overflow to a
       | negative value.
        
         | kevincox wrote:
         | Rather than look at it as "does not happen" it may be helpful
         | to think of "if this does happen anything is correct". So you
         | can then split the codepaths into parts where the compiler is
         | strictly specified in what it has to do (defined behaviour) and
         | parts where anything is "correct". Of course the most optimal
         | solution is to just produce the best code you can for the first
         | category, because what code also happens to be "correct" for
         | the second category.
         | 
         | Of course this is equivalent to "does not happen" but may make
         | more sense as to why the compiler acts this way.
        
         | tialaramex wrote:
         | > How did the compiler writers reach the point where they were
         | able to read "undefined behavior" as "does not happen" rather
         | than "specification left this undefined so that the compiler
         | can define it"
         | 
         | If they want the _compiler_ to define it, the specification can
         | say so, and on several things it says exactly that. This is
         | what the phrase  "implementation defined" is for.
         | 
         | The example of Undefined Behaviour which is most commonly given
         | in this sort of argument - and you chose the same - is integer
         | overflow. Why can't it just do what I meant? And you're right,
         | the language _could_ have chosen to do what you meant here and
         | it did not. There are lots of options and you don 't like the
         | option C chose. I'll talk about that in a moment.
         | 
         | But more generally, Undefined Behaviour isn't at all like that.
         | What happens if I cast my local telephone number to an integer
         | pointer and then dereference the pointer ? Today that's
         | Undefined Behaviour in C, but if we think that means "the
         | compiler documentation should say" then _what_ should it say?
         | 
         | "It does whatever the underlying hardware would do" is a
         | circular definition, this is a computer program, what the
         | hardware does is _whatever we told it to do_. So maybe you say
         | well, it emits this particular machine code. Congratulations
         | now your  "compiler documentation" reads like the source code
         | for the compiler _and_ your users still don 't know the answer
         | because guess what, the CPU vendor can't do any better with
         | that either.
         | 
         | Now, back to those integer overflows. We can do a whole bunch
         | of things here, and they all have different consequences.
         | 
         | WUFFS says this is forbidden, you get a compiler error. If your
         | code can overflow, that's a bad WUFFS function it does not
         | compile.
         | 
         | Several languages including Python just don't have overflow.
         | Your integer types just get bigger, this may be annoyingly slow
         | in some cases, but like actual integers from grade school you
         | won't just accidentally make one that's too big by mistake and
         | something weird happens.
         | 
         | We could wrap the integers as you seem to prefer. This is what
         | Rust's Wrapping<> types always do and several languages provide
         | alternate arithmetic operators to request wrapping
         | 
         | We could _saturate_ the integers, which means they stop at the
         | edge of the overflow. This is what Rust 's Saturating<> types
         | do, and again I believe some languages have saturating
         | operators (at least addition and multiplication anyway)
         | 
         | We could make arithmetic operations all "checked" so that they
         | can fail if there would be overflow. The check could be in the
         | form of a soft error, or it could cause something more dramatic
         | like Rust's panic.
         | 
         | Or like C we can just say we refuse to define this, don't do
         | it.
        
         | jcranmer wrote:
         | There's a note in the C99 rationale explaining how compilers
         | can use undefined behavior (specifically, with regards to
         | signed overflow) to perform certain optimizations
         | (specifically, reassociation of integer addition). I'd go back
         | to C89, but the documents before the work effort to make C99
         | are not available on the WG14 website.
         | 
         | > I do however expect the compiler documentation to have an
         | extensive section on stuff undefined by the specification.
         | 
         | There is an entire section of the C specification that lists
         | every single undefined behavior. What more do you want?
        
           | [deleted]
        
           | somat wrote:
           | My thought process, note that I am not a compiler writer,
           | runs along these lines.
           | 
           | undefined behavior is valid code(it is not a syntax error),
           | it has to do some thing, that thing should be documented, if
           | the spec does not want to document it, then the
           | implementation should. don't make the assumption that because
           | it is undefined it will never happen.
        
             | jcranmer wrote:
             | It is not practical to constrain what happens in the case
             | of undefined behavior, even if you discard the potential of
             | the compiler to optimize assuming undefined behavior can't
             | happen.
             | 
             | For example, what memory has and hasn't been written when a
             | trap occurs isn't. Hell, on some architectures (hi Alpha!),
             | an unknown amount of code will continue executing _after_
             | the trapping instruction before the trap handler gets
             | around to being invoked. Similar insanity is also in play
             | when data races are involved (if data races are undefined,
             | you can pretend that all code is sequentially consistent
             | and there is a nice, simple, global total order of memory
             | accesses. Data races cause memory accesses to not even be a
             | consistent partial order.)
             | 
             | Or take pointer provenance. Writing to an unknown memory
             | location may cause printf in another thread to instead call
             | system("rm -rf /").
             | 
             | How the hell is one supposed to document the potential
             | behavior of undefined behavior when undefined behavior is
             | so inherently unconstrainable, and "undefined behavior can
             | do anything" is considered unacceptable documentation?
        
             | pklausler wrote:
             | What you're talking about is "implementation defined"
             | behavior, which is a distinct concept from "undefined
             | behavior".
        
           | kps wrote:
           | https://www.lysator.liu.se/c/rat/title.html has the C89
           | Rationale.
           | 
           | "The terms _unspecified behavior_ , _undefined behavior_ ,
           | and _implementation-defined behavior_ are used to categorize
           | the result of writing programs whose properties the Standard
           | does not, or cannot, completely describe. [...] _Undefined
           | behavior_ gives the implementor license not to catch certain
           | program errors that are difficult to diagnose."
        
             | jcranmer wrote:
             | For the commentary on undefined behavior, I find N790
             | (https://www9.open-
             | std.org/jtc1/sc22/wg14/www/docs/n790.htm) to be a better
             | assessment of what the C committee thinks about undefined
             | behavior:
             | 
             | First consider the term "unspecified behavior". Most
             | commentators on the Standard are of the opinion that this
             | has the following properties:
             | 
             | (1) There are a number of possible courses of actions, or
             | the behavior is one that generates a result and then has a
             | number of possible results.
             | 
             | (2) The implementation can make any of the available
             | choices, and can make different choices at different places
             | or times.
             | 
             | (3) The implementation need not document its choices.
             | 
             | (4) No matter what choice the implementation makes, it
             | cannot affect anything outside the range of that choice. If
             | a value has to be chosen, it must be a valid value for that
             | type.
             | 
             | Property number 4 is the interesting one: it is usually
             | taken to mean that the implementation cannot generate a
             | spurious signal, branch to a random place in the code, or
             | choose a trap representation. All of these, of course, are
             | valid "undefined behavior".
        
           | fsckboy wrote:
           | >> _" specification left this undefined so that the compiler
           | can define it"_
           | 
           | >> _I do however expect the compiler documentation to have an
           | extensive section on stuff undefined by the specification_
           | and he should have said _which is defined by the compiler_
           | 
           | so your response is not to what he meant
           | 
           | > _There is an entire section of the C specification that
           | lists every single undefined behavior. What more do you
           | want?_
           | 
           | a similar section wrt the compiler he is using, because he's
           | talking about "behavior undefined by C but defined by
           | implementation"
        
             | jcranmer wrote:
             | > a similar section wrt the compiler he is using, because
             | he's talking about "behavior undefined by C but defined by
             | implementation"
             | 
             | That's called implementation-defined behavior. And there is
             | a section in C that lists all the implementation-defined
             | behavior, and all the compilers are supposed to document
             | what they define the implementation-defined behavior to be
             | (glares at Clang).
        
         | [deleted]
        
         | kps wrote:
         | I've said before1 that I'm convinced that 'undefined behavior'
         | was an actual mistake by the C89 committee that would not have
         | been accepted if anyone at the time had realized the future
         | implications. It is precisely "a license for the compiler to
         | undertake aggressive optimizations that are completely legal by
         | the committee's rules, but make hash of apparently safe
         | programs" (as dmr said of `noalias`2).
         | 
         | 1 https://news.ycombinator.com/item?id=30024867
         | 
         | 2 https://www.lysator.liu.se/c/dmr-on-noalias.html
        
           | rramadass wrote:
           | > was an actual mistake by the C89 committee
           | 
           | Agreed. Apparently the first compiler written by Dennis
           | Ritchie had no UB (i have not been able to find more info. on
           | this).
           | 
           | See also the article published in IEEE : _Dealing With C 's
           | Original Sin by Chris Hathhorn and Grigore Rosu._
        
         | jessermeyer wrote:
         | Compiler, Language, and Software engineer people do not often
         | overlap, and so neither do their constraints. Over time you see
         | opposing drift occurring, where language people end up
         | designing languages nice for language development, software
         | engineers design software nice for software development, and so
         | on..
         | 
         | Consider what is nice for compiler development but is not nice
         | for software engineering.
        
         | mort96 wrote:
         | The specification has a separate concept for "specification
         | left this undefined so that the compiler can define it":
         | implementation-defined behaviour.
        
       | raphlinus wrote:
       | The main distinction I'd draw is whether you'd doing formal or
       | informal reasoning. In the formal reasoning world, undefined
       | behavior is mostly a good thing. On one side of the contract,
       | it's a set of proof obligations for the code, and not
       | _especially_ onerous in the grand scheme of things - correct
       | programs won 't do UB. On the other side, it's a clear statement
       | of what the compiler is and is not allowed to optimize. The more
       | UB, the more opportunities to optimize.
       | 
       | When you're doing informal reasoning, the calculus changes.
       | There's all kinds of stuff that can go wrong that is _not_
       | motivated by what the machine is actually doing. In fact, it 's
       | something of a nightmare. Doing a memcpy of a struct that has
       | padding in it? What are the exact semantics of restrict? And
       | threading. Benign data races used to be a thing, but in an
       | undefined behavior world, it's game over. C makes things worse
       | than they need to be with its wonky integer rules (left shift of
       | a negative integer and wrapping multiplication of two unsigned
       | shorts are both UB), but a lot of that is potentially fixable and
       | those mistakes won't be repeated in new languages.
       | 
       | In the context of Rust, more undefined behavior makes sense, and
       | Ralf's work takes us much closer to a solid spec. But when you're
       | doing mostly informal reasoning, I can see why people are so
       | emotionally against it, and decisions such as turning off strict
       | aliasing might be justified.
        
         | AlotOfReading wrote:
         | I don't think that's a useful distinction here. For context, I
         | often work with high reliability software, including formal
         | methods.
         | 
         | What's needed for actual programs is the ability to say one of
         | two things:
         | 
         | A) There is no UB in program X or
         | 
         | B) UB in program X cannot lead to a violation of constraint Y
         | 
         | The current situation in the C family, whether you're using
         | formal methods or not, is that you _cannot_ generally prove (A)
         | and the time traveling, no-holds-barred results of UB in the
         | spec means that (B) is impossible.
         | 
         | While rust doesn't entirely solve this, the fact that those
         | statements are true everywhere except unsafe means that the
         | scope of code you have to manually review is limited to
         | something smaller than "everything".
        
           | tialaramex wrote:
           | > the fact that those statements are true everywhere except
           | unsafe
           | 
           | This is not only a technical feature of Rust's standard
           | library, but perhaps more importantly a _cultural_ feature of
           | Rust 's ecosystem. The compiler has no technical problem with
           | your "safe" implementation of Index for your type actually
           | just doing unsafe pointer dereferences internally and
           | trusting users to always pick valid indices, just like C.
           | It's a bad idea, but the compiler is not a cop. However
           | Rust's _culture_ says if you 're providing unsafe stuff that
           | must be marked unsafe so that other people don't cut
           | themselves on the sharp edges of your code by mistake.
           | 
           | You can imagine with a different culture, you'd end up with
           | popular code that's labelled "safe" but has UB all over the
           | place because the interfaces lie everywhere as an
           | "optimisation" and the community just puts up with it.
        
         | jcranmer wrote:
         | While I do sympathize with some of the user complaints with UB,
         | and the issues with things like signed integer overflow and
         | strict aliasing seem entirely gratuitous, I think most users
         | complaining about UB fail to comprehend that the issue with UB
         | is that it's often really hard to constrain just what can
         | possibly go wrong--and that's even without compiler
         | optimizations kicking into play.
         | 
         | It should be pretty clear that memory unsafety produces all
         | sorts of crazy havoc--a write to errant memory could overwrite
         | stack return locations and then basically do whatever it wants
         | given the power of ROP gadgets. At first glance, it looks like
         | uninitialized memory is "merely" an issue of reading more or
         | less random data, but there are cases (e.g., MADV_FREE) where
         | it turns out that the value of uninitialized memory can change
         | underneath you. Traps cause lots of program state to become
         | rather indeterminate, simply because of what may or may not
         | live in a register or in memory, but on some architectures
         | (e.g., Alpha), code may keep running for a while after an
         | instruction traps, to the point that you're no longer even in
         | the same function. Sanely describing what happens in data races
         | are beyond the ken of formal semanticists (see the still-
         | unsettled discussions over the semantics of relaxed atomics);
         | what hope do programmers have of reasoning about these memory
         | semantics?
         | 
         | It also doesn't help that the distinction between undefined,
         | unspecified, and implementation-defined behavior is poorly
         | grasped by a large segment of the community.
        
       | blueflow wrote:
       | I thought Rust doesn't have a specification, just a reference? So
       | all of it is undefined behavior?
        
         | bruce343434 wrote:
         | I'd say all Rust is implementation defined
        
         | yccs27 wrote:
         | "Undefined behavior" is a term of art[0], with the specific
         | meaning as mentioned in the article: The compiler is allowed to
         | assume that UB never happens, and change the compiled code
         | based on that assumption. Not "no one has written down a
         | definition" or anything else.
         | 
         | As the sibling comment points out, a contrasting term is
         | "implementation-defined". Confusing the two is a common
         | misconception when learning C++: You might expect "overflow is
         | undefined behavior" to mean that there is no prescribed result
         | and every compiler might do it differently, but each will
         | produce results in its own consistent way. But that would be
         | implementation-defined behavior; instead by doing unchecked
         | addition, you tell the compiler that you _know_ the addition
         | will never overflow, and don 't care at all about the
         | overflowing case.
         | 
         | [0] aka. "improper noun", see
         | https://news.ycombinator.com/item?id=32673100
        
           | yccs27 wrote:
           | Followup to be precise: Undefined behavior in C++, of course,
           | _can_ have a definition set by the implementation. With the
           | right switches, GCC will guarantee certain overflow
           | behaviors. But a priori, the compiler is not bound by any
           | guarantees.
        
         | planede wrote:
         | IMO differentiating a specification and reference this way is
         | just nitpicking.
         | 
         | However it looks like that at least unsafe rust is
         | underspecified regarding aliasing rules, therefore a bunch of
         | unsafe rust is undefined behavior by definition. That is no
         | authoritative text (whether reference or specification) defines
         | the behavior of those programs.
         | 
         | Key paragraph from the article:
         | 
         | > Stacked Borrows is not part of the Rust spec, and is not the
         | final word for aliasing-related UB in Rust. So there is still
         | the chance that future revisions of this model can be made to
         | better align with programmer intuition. The above code might
         | get accepted because x2 is not actually being used to access
         | memory. Or maybe &mut expr should only make such promises when
         | used outside an unsafe block -- but then, should adding unsafe
         | really change the semantics of the program? As usual, language
         | design is a game of trade-offs.
        
       ___________________________________________________________________
       (page generated 2022-09-23 23:02 UTC)