[HN Gopher] Undefined Behavior deserves a better reputation (2021)
___________________________________________________________________
Undefined Behavior deserves a better reputation (2021)
Author : rramadass
Score : 40 points
Date : 2022-09-20 07:25 UTC (3 days ago)
(HTM) web link (blog.sigplan.org)
(TXT) w3m dump (blog.sigplan.org)
| DougBTX wrote:
| FWIW, the original link in the article shows that the
| panic_bounds_check check is present in optimised code in Rust
| 1.55, but if you update to a more recent Rust, say 1.60, then it
| is optimised away as expected: example::mid:
| test rsi, rsi je .LBB0_1 and
| rsi, -2 mov edx, dword ptr [rdi + 2*rsi]
| mov eax, 1 ret .LBB0_1: xor
| eax, eax ret
|
| https://rust.godbolt.org/z/Wz4Prjed4
| tialaramex wrote:
| Interesting, do you know how this is achieved?
|
| Is there a Rust optimization that notices logically this type
| of construction is always in-bounds and so the bounds check is
| never emitted, or is the LLVM bounds check smarter now and
| realises hey, my parameter is never out of bounds, so no need
| to emit code here ?
| teddyh wrote:
| > _This post is about defending and promoting UB as a concept,
| not UB in C /C++._
| zaphar wrote:
| This is important. He is arguing specifically that when made
| explicit and opted into by the coder that UB can be useful. The
| issues that C and C++ have are that it's too easy for the
| developer to get opted in to UB by the compiler without knowing
| it.
| megous wrote:
| So you just remember the relevant UBs defined in the C
| standard:
|
| https://gist.github.com/Earnestly/7c903f481ff9d29a3dd1
|
| There are not that many UBs in the base language. And even
| fewer are relevant for day to day coding. Most of it is in
| the std library.
| tialaramex wrote:
| Well, it's countably many, but there are about 200 of them,
| so we're not even talking US states or US presidents, but
| more like UN member states. I remember Togo exists, but I
| can't point to it on a map, and if you left it _off_ a map
| I wouldn 't notice.
|
| So we're asking quite a lot to say people should just
| remember all of them _and actually use this knowledge not
| just recite the list_. And these aren 't small things, they
| include very broad ideas, like 'object is referred to
| outside of its lifetime' and even just categories of
| deviations from the standard like 'A "shall" or "shall not"
| requirement that appears outside of a constraint is
| violated'
|
| It's impressive that they enumerated them, A+ gold stars.
| As I understand it WG21 (C++) is still attempting to
| produce a list of the Undefined Behaviours in their
| language - but I don't think "memorize all these vague
| ideas and use that knowledge when programming" is
| practical.
| kazinator wrote:
| > _I have presented Undefined Behavior as a tool that enables the
| programmer to write code that the compiler cannot check for
| correctness, and argued that -- used responsibly -- it is a
| useful component in a language designer 's toolbox._
|
| That is simply wrong. Code that the compiler cannot or does not
| check for correctness isn't "undefined behavior". Traditionally
| that is called "unsafe code": the programmer ensures safety
| through analyzing all the cases that may occur and making sure
| they have reliable consequences.
|
| (The article talks about Rust a lot; maybe undefined and unsafe
| are the same thing in Rust?)
|
| Undefined behavior with ill consequence may co-occur with safety.
| A case of undefined behavior may be flagged by the compiler by a
| diagnostic "undefined behavior in line 42". Thus, safety was
| ensured; the situation was diagnosed. Yet, an executable program
| could be produced anyway, and if that is run, it may mishbehave
| due to that undefined behavior in line 42.
|
| This is the case in C, when you assign incompatible types, and
| use a compiler like GCC which only warns about that, by default,
| and translates anyway.
|
| Some undefined behavior leads to a documented extension, which
| makes it defined for the given implementation, and safe (if used
| as documented). If you try to perform arithmetic on a void *
| pointer, that is not an ISO C feature; it requires a diagnostic.
| GCC allows void * pointer arithmetic; it behaves like byte
| addressing. Programs which use this feature are invoking
| undefined behavior: they are violating an ISO C constraint rule,
| yet being executed anyway. Furthermore, if the diagnostic isn't
| issued, then GCC is being a non-conforming implementation; it's a
| non-conforming extension.
| overgard wrote:
| Here's my issue with UB in C/C++ (and I think it agrees with the
| article): I would much rather have _direct compiler directives_
| over the compiler being clever in a quiet way. The first example
| in this article is very good -- get_unchecked avoids undefined
| behavior (well, outside of accessing out of bounds), but still
| provides the wanted optimization. If something goes bad in your
| program, one of the first things you 're going to suspect is
| functions with _unchecked at the end of them, so the developer
| ergonomics are great.
|
| My main problem with undefined behavior is that it's rare to know
| when it's happening. I would much rather give hints to the
| compiler than having it "prove" something very subtle underneath
| the hood using UB, because then I understand where those things
| are happening. And it's not even like it's a rare thing! My C++
| is littered with "const" and "[[nodiscard]]" and #pragma pack and
| all sorts of other things that don't change the behavior of the
| program but do indicate something important to the compiler. I
| want more of that, and less subtlety.
| somat wrote:
| "all it does is perform optimizations that are correct under the
| extra assumption that there is no Undefined Behavior."
|
| How did the compiler writers reach the point where they were able
| to read "undefined behavior" as "does not happen" rather than
| "specification left this undefined so that the compiler can
| define it"
|
| I don't think there is anything wrong with undefined behavior, I
| do however expect the compiler documentationto have an extensive
| section on stuff undefined by the specification.
|
| for example: when a signed integer would exceeds the maximum
| value this compiler just lets it go and it does whatever the
| underlying hardware would do. on amd64 this will overflow to a
| negative value.
| kevincox wrote:
| Rather than look at it as "does not happen" it may be helpful
| to think of "if this does happen anything is correct". So you
| can then split the codepaths into parts where the compiler is
| strictly specified in what it has to do (defined behaviour) and
| parts where anything is "correct". Of course the most optimal
| solution is to just produce the best code you can for the first
| category, because what code also happens to be "correct" for
| the second category.
|
| Of course this is equivalent to "does not happen" but may make
| more sense as to why the compiler acts this way.
| tialaramex wrote:
| > How did the compiler writers reach the point where they were
| able to read "undefined behavior" as "does not happen" rather
| than "specification left this undefined so that the compiler
| can define it"
|
| If they want the _compiler_ to define it, the specification can
| say so, and on several things it says exactly that. This is
| what the phrase "implementation defined" is for.
|
| The example of Undefined Behaviour which is most commonly given
| in this sort of argument - and you chose the same - is integer
| overflow. Why can't it just do what I meant? And you're right,
| the language _could_ have chosen to do what you meant here and
| it did not. There are lots of options and you don 't like the
| option C chose. I'll talk about that in a moment.
|
| But more generally, Undefined Behaviour isn't at all like that.
| What happens if I cast my local telephone number to an integer
| pointer and then dereference the pointer ? Today that's
| Undefined Behaviour in C, but if we think that means "the
| compiler documentation should say" then _what_ should it say?
|
| "It does whatever the underlying hardware would do" is a
| circular definition, this is a computer program, what the
| hardware does is _whatever we told it to do_. So maybe you say
| well, it emits this particular machine code. Congratulations
| now your "compiler documentation" reads like the source code
| for the compiler _and_ your users still don 't know the answer
| because guess what, the CPU vendor can't do any better with
| that either.
|
| Now, back to those integer overflows. We can do a whole bunch
| of things here, and they all have different consequences.
|
| WUFFS says this is forbidden, you get a compiler error. If your
| code can overflow, that's a bad WUFFS function it does not
| compile.
|
| Several languages including Python just don't have overflow.
| Your integer types just get bigger, this may be annoyingly slow
| in some cases, but like actual integers from grade school you
| won't just accidentally make one that's too big by mistake and
| something weird happens.
|
| We could wrap the integers as you seem to prefer. This is what
| Rust's Wrapping<> types always do and several languages provide
| alternate arithmetic operators to request wrapping
|
| We could _saturate_ the integers, which means they stop at the
| edge of the overflow. This is what Rust 's Saturating<> types
| do, and again I believe some languages have saturating
| operators (at least addition and multiplication anyway)
|
| We could make arithmetic operations all "checked" so that they
| can fail if there would be overflow. The check could be in the
| form of a soft error, or it could cause something more dramatic
| like Rust's panic.
|
| Or like C we can just say we refuse to define this, don't do
| it.
| jcranmer wrote:
| There's a note in the C99 rationale explaining how compilers
| can use undefined behavior (specifically, with regards to
| signed overflow) to perform certain optimizations
| (specifically, reassociation of integer addition). I'd go back
| to C89, but the documents before the work effort to make C99
| are not available on the WG14 website.
|
| > I do however expect the compiler documentation to have an
| extensive section on stuff undefined by the specification.
|
| There is an entire section of the C specification that lists
| every single undefined behavior. What more do you want?
| [deleted]
| somat wrote:
| My thought process, note that I am not a compiler writer,
| runs along these lines.
|
| undefined behavior is valid code(it is not a syntax error),
| it has to do some thing, that thing should be documented, if
| the spec does not want to document it, then the
| implementation should. don't make the assumption that because
| it is undefined it will never happen.
| jcranmer wrote:
| It is not practical to constrain what happens in the case
| of undefined behavior, even if you discard the potential of
| the compiler to optimize assuming undefined behavior can't
| happen.
|
| For example, what memory has and hasn't been written when a
| trap occurs isn't. Hell, on some architectures (hi Alpha!),
| an unknown amount of code will continue executing _after_
| the trapping instruction before the trap handler gets
| around to being invoked. Similar insanity is also in play
| when data races are involved (if data races are undefined,
| you can pretend that all code is sequentially consistent
| and there is a nice, simple, global total order of memory
| accesses. Data races cause memory accesses to not even be a
| consistent partial order.)
|
| Or take pointer provenance. Writing to an unknown memory
| location may cause printf in another thread to instead call
| system("rm -rf /").
|
| How the hell is one supposed to document the potential
| behavior of undefined behavior when undefined behavior is
| so inherently unconstrainable, and "undefined behavior can
| do anything" is considered unacceptable documentation?
| pklausler wrote:
| What you're talking about is "implementation defined"
| behavior, which is a distinct concept from "undefined
| behavior".
| kps wrote:
| https://www.lysator.liu.se/c/rat/title.html has the C89
| Rationale.
|
| "The terms _unspecified behavior_ , _undefined behavior_ ,
| and _implementation-defined behavior_ are used to categorize
| the result of writing programs whose properties the Standard
| does not, or cannot, completely describe. [...] _Undefined
| behavior_ gives the implementor license not to catch certain
| program errors that are difficult to diagnose."
| jcranmer wrote:
| For the commentary on undefined behavior, I find N790
| (https://www9.open-
| std.org/jtc1/sc22/wg14/www/docs/n790.htm) to be a better
| assessment of what the C committee thinks about undefined
| behavior:
|
| First consider the term "unspecified behavior". Most
| commentators on the Standard are of the opinion that this
| has the following properties:
|
| (1) There are a number of possible courses of actions, or
| the behavior is one that generates a result and then has a
| number of possible results.
|
| (2) The implementation can make any of the available
| choices, and can make different choices at different places
| or times.
|
| (3) The implementation need not document its choices.
|
| (4) No matter what choice the implementation makes, it
| cannot affect anything outside the range of that choice. If
| a value has to be chosen, it must be a valid value for that
| type.
|
| Property number 4 is the interesting one: it is usually
| taken to mean that the implementation cannot generate a
| spurious signal, branch to a random place in the code, or
| choose a trap representation. All of these, of course, are
| valid "undefined behavior".
| fsckboy wrote:
| >> _" specification left this undefined so that the compiler
| can define it"_
|
| >> _I do however expect the compiler documentation to have an
| extensive section on stuff undefined by the specification_
| and he should have said _which is defined by the compiler_
|
| so your response is not to what he meant
|
| > _There is an entire section of the C specification that
| lists every single undefined behavior. What more do you
| want?_
|
| a similar section wrt the compiler he is using, because he's
| talking about "behavior undefined by C but defined by
| implementation"
| jcranmer wrote:
| > a similar section wrt the compiler he is using, because
| he's talking about "behavior undefined by C but defined by
| implementation"
|
| That's called implementation-defined behavior. And there is
| a section in C that lists all the implementation-defined
| behavior, and all the compilers are supposed to document
| what they define the implementation-defined behavior to be
| (glares at Clang).
| [deleted]
| kps wrote:
| I've said before1 that I'm convinced that 'undefined behavior'
| was an actual mistake by the C89 committee that would not have
| been accepted if anyone at the time had realized the future
| implications. It is precisely "a license for the compiler to
| undertake aggressive optimizations that are completely legal by
| the committee's rules, but make hash of apparently safe
| programs" (as dmr said of `noalias`2).
|
| 1 https://news.ycombinator.com/item?id=30024867
|
| 2 https://www.lysator.liu.se/c/dmr-on-noalias.html
| rramadass wrote:
| > was an actual mistake by the C89 committee
|
| Agreed. Apparently the first compiler written by Dennis
| Ritchie had no UB (i have not been able to find more info. on
| this).
|
| See also the article published in IEEE : _Dealing With C 's
| Original Sin by Chris Hathhorn and Grigore Rosu._
| jessermeyer wrote:
| Compiler, Language, and Software engineer people do not often
| overlap, and so neither do their constraints. Over time you see
| opposing drift occurring, where language people end up
| designing languages nice for language development, software
| engineers design software nice for software development, and so
| on..
|
| Consider what is nice for compiler development but is not nice
| for software engineering.
| mort96 wrote:
| The specification has a separate concept for "specification
| left this undefined so that the compiler can define it":
| implementation-defined behaviour.
| raphlinus wrote:
| The main distinction I'd draw is whether you'd doing formal or
| informal reasoning. In the formal reasoning world, undefined
| behavior is mostly a good thing. On one side of the contract,
| it's a set of proof obligations for the code, and not
| _especially_ onerous in the grand scheme of things - correct
| programs won 't do UB. On the other side, it's a clear statement
| of what the compiler is and is not allowed to optimize. The more
| UB, the more opportunities to optimize.
|
| When you're doing informal reasoning, the calculus changes.
| There's all kinds of stuff that can go wrong that is _not_
| motivated by what the machine is actually doing. In fact, it 's
| something of a nightmare. Doing a memcpy of a struct that has
| padding in it? What are the exact semantics of restrict? And
| threading. Benign data races used to be a thing, but in an
| undefined behavior world, it's game over. C makes things worse
| than they need to be with its wonky integer rules (left shift of
| a negative integer and wrapping multiplication of two unsigned
| shorts are both UB), but a lot of that is potentially fixable and
| those mistakes won't be repeated in new languages.
|
| In the context of Rust, more undefined behavior makes sense, and
| Ralf's work takes us much closer to a solid spec. But when you're
| doing mostly informal reasoning, I can see why people are so
| emotionally against it, and decisions such as turning off strict
| aliasing might be justified.
| AlotOfReading wrote:
| I don't think that's a useful distinction here. For context, I
| often work with high reliability software, including formal
| methods.
|
| What's needed for actual programs is the ability to say one of
| two things:
|
| A) There is no UB in program X or
|
| B) UB in program X cannot lead to a violation of constraint Y
|
| The current situation in the C family, whether you're using
| formal methods or not, is that you _cannot_ generally prove (A)
| and the time traveling, no-holds-barred results of UB in the
| spec means that (B) is impossible.
|
| While rust doesn't entirely solve this, the fact that those
| statements are true everywhere except unsafe means that the
| scope of code you have to manually review is limited to
| something smaller than "everything".
| tialaramex wrote:
| > the fact that those statements are true everywhere except
| unsafe
|
| This is not only a technical feature of Rust's standard
| library, but perhaps more importantly a _cultural_ feature of
| Rust 's ecosystem. The compiler has no technical problem with
| your "safe" implementation of Index for your type actually
| just doing unsafe pointer dereferences internally and
| trusting users to always pick valid indices, just like C.
| It's a bad idea, but the compiler is not a cop. However
| Rust's _culture_ says if you 're providing unsafe stuff that
| must be marked unsafe so that other people don't cut
| themselves on the sharp edges of your code by mistake.
|
| You can imagine with a different culture, you'd end up with
| popular code that's labelled "safe" but has UB all over the
| place because the interfaces lie everywhere as an
| "optimisation" and the community just puts up with it.
| jcranmer wrote:
| While I do sympathize with some of the user complaints with UB,
| and the issues with things like signed integer overflow and
| strict aliasing seem entirely gratuitous, I think most users
| complaining about UB fail to comprehend that the issue with UB
| is that it's often really hard to constrain just what can
| possibly go wrong--and that's even without compiler
| optimizations kicking into play.
|
| It should be pretty clear that memory unsafety produces all
| sorts of crazy havoc--a write to errant memory could overwrite
| stack return locations and then basically do whatever it wants
| given the power of ROP gadgets. At first glance, it looks like
| uninitialized memory is "merely" an issue of reading more or
| less random data, but there are cases (e.g., MADV_FREE) where
| it turns out that the value of uninitialized memory can change
| underneath you. Traps cause lots of program state to become
| rather indeterminate, simply because of what may or may not
| live in a register or in memory, but on some architectures
| (e.g., Alpha), code may keep running for a while after an
| instruction traps, to the point that you're no longer even in
| the same function. Sanely describing what happens in data races
| are beyond the ken of formal semanticists (see the still-
| unsettled discussions over the semantics of relaxed atomics);
| what hope do programmers have of reasoning about these memory
| semantics?
|
| It also doesn't help that the distinction between undefined,
| unspecified, and implementation-defined behavior is poorly
| grasped by a large segment of the community.
| blueflow wrote:
| I thought Rust doesn't have a specification, just a reference? So
| all of it is undefined behavior?
| bruce343434 wrote:
| I'd say all Rust is implementation defined
| yccs27 wrote:
| "Undefined behavior" is a term of art[0], with the specific
| meaning as mentioned in the article: The compiler is allowed to
| assume that UB never happens, and change the compiled code
| based on that assumption. Not "no one has written down a
| definition" or anything else.
|
| As the sibling comment points out, a contrasting term is
| "implementation-defined". Confusing the two is a common
| misconception when learning C++: You might expect "overflow is
| undefined behavior" to mean that there is no prescribed result
| and every compiler might do it differently, but each will
| produce results in its own consistent way. But that would be
| implementation-defined behavior; instead by doing unchecked
| addition, you tell the compiler that you _know_ the addition
| will never overflow, and don 't care at all about the
| overflowing case.
|
| [0] aka. "improper noun", see
| https://news.ycombinator.com/item?id=32673100
| yccs27 wrote:
| Followup to be precise: Undefined behavior in C++, of course,
| _can_ have a definition set by the implementation. With the
| right switches, GCC will guarantee certain overflow
| behaviors. But a priori, the compiler is not bound by any
| guarantees.
| planede wrote:
| IMO differentiating a specification and reference this way is
| just nitpicking.
|
| However it looks like that at least unsafe rust is
| underspecified regarding aliasing rules, therefore a bunch of
| unsafe rust is undefined behavior by definition. That is no
| authoritative text (whether reference or specification) defines
| the behavior of those programs.
|
| Key paragraph from the article:
|
| > Stacked Borrows is not part of the Rust spec, and is not the
| final word for aliasing-related UB in Rust. So there is still
| the chance that future revisions of this model can be made to
| better align with programmer intuition. The above code might
| get accepted because x2 is not actually being used to access
| memory. Or maybe &mut expr should only make such promises when
| used outside an unsafe block -- but then, should adding unsafe
| really change the semantics of the program? As usual, language
| design is a game of trade-offs.
___________________________________________________________________
(page generated 2022-09-23 23:02 UTC)