[HN Gopher] Undefined behavior in C is a reading error
       ___________________________________________________________________
        
       Undefined behavior in C is a reading error
        
       Author : zdw
       Score  : 119 points
       Date   : 2021-05-20 14:29 UTC (8 hours ago)
        
 (HTM) web link (www.yodaiken.com)
 (TXT) w3m dump (www.yodaiken.com)
        
       | jart wrote:
       | > license for the kinds of dramatic and unintuitive
       | transformations we've seen from the compilers, and any indication
       | that undefined behavior should be a vehicle for permitting
       | optimizations.
       | 
       | Does anyone have an example of a time where Clang or GCC actually
       | did something bad upon witnessing undefined behavior, rather than
       | simply doing nothing, as the standard proposes? I ask because
       | every time I've seen people get mad about UB it's always because
       | they envisioned that the compiler would do something, but instead
       | the compiler did nothing.
        
         | Jiro wrote:
         | This may not necessarily count as an example in the wild, but
         | the 2013 Underhanded C contest at
         | http://www.underhanded-c.org/_page_id_25.html includes this
         | example:                 h = abs(h) % HASHSIZE;       // Extra
         | sanity check       if (h < 0 || h >= HASHSIZE)         h = 0;
         | return h;
         | 
         | where h=INT_MIN causes the h to become negative and the sanity
         | check is optimized out because abs(INT_MIN) is UB.
        
           | jart wrote:
           | That's such a good example for teaching purposes, because it
           | manages to combine the lack of understanding surrounding the
           | remainder operator (i.e. unlike python % in c is _not_
           | modulus) with the lack of understanding surrounding
           | 0x80000000 (two 's complement bane) into a single example.
           | However it still makes my point that in this circumstance,
           | the compiler's strategy still is to do nothing, because it
           | can prove that the check could only be true under undefined
           | behavior circumstances, so doing nothing means not compiling
           | the check. I'm fine with that. Would anyone prefer that the
           | compiler's internal definition of logic assume that absolute
           | values can be negative? Must we throw out centuries of math
           | because we've optimized integers to be 32-bits?
           | 
           | The only thing that's problematic is we need better tools to
           | bring logic assumptions to our attention. Currently, UBSAN
           | can only warn us about that when two's bane actually gets
           | passed to the function at runtime. So the only way to spot
           | faulty logic we failed to consider is to both enable UBSAN
           | and be super methodical about unit testing.
           | 
           | Well, another thing I like to do is just read the assembly
           | output. Constantly. Whenever I write something like a parser
           | I've got a keyboard mapping that shows me the assembly output
           | in Emacs with UBSAN enabled. If I turn off the noisy ones
           | like pointer overflow then I can avoid these issues
           | altogether by writing my code so that no UBSAN assembly gets
           | generated. Since the compiler won't show that as a warning.
           | You literally have to read the -S assembly output to get the
           | compiler warnings that are actually meaningful.
        
         | anarazel wrote:
         | What do you define as "bad"?
         | 
         | Arguably several.
         | 
         | I've seen cases of overflow checks that were implemented
         | assuming signed overflow (which all relevant platforms
         | implement!) getting optimized away. Correct, given "signed
         | overflow is UB" and thus can be assumed to not happen.
         | Problematic given for widespread such checks are, and given
         | that there's no easy portable alternative.
         | 
         | Entire checks getting optimized away because of a presumed
         | strict aliasing violation. IIRC between structure with a
         | compatible layout. Pretty code, no. UB yes. Reasonable, IDK.
        
         | ben0x539 wrote:
         | Sounds like you're interested specifically in
         | https://blog.tchatzigiannakis.com/undefined-behavior-can-lit...
        
         | bhk wrote:
         | I've seen a real-world example something like this:
         | int a[32] = {...};             int flag = 1 << index;        if
         | (index < ARRAYSIZE) {           a[index] = x;          return
         | flag;        } else {           return 0;        }
         | 
         | The "1 << index" operation is _undefined behavior_ (!) when
         | index is greater than 32 (on a platform with 32-bit integers),
         | even if the result is never used!
         | 
         | The compiler inferred that index must always be less than 32,
         | which allowed it to optimize out the array bounds check, which
         | turns the code into a write-anywhere gadget.
         | 
         | Note that if the standard had not declared "n << 32" to be all-
         | bets-are-off UB, but instead had said something like, "it
         | results in some implementation-specific value, or maybe traps"
         | -- as a rational person would presume -- then this would not
         | turn into a security problem.
        
           | msbarnett wrote:
           | > Note that if the standard had not declared "n << 32" to be
           | all-bets-are-off UB, but instead had said something like, "it
           | results in some implementation-specific value, or maybe
           | traps" -- as a rational person would presume -- then this
           | would not turn into a security problem.
           | 
           | But also note that a lot of existing code doing bitshift-and-
           | index inside a hot loop that never went out of bounds would
           | now _get slower_ if it started having to run bounds checks it
           | had previously elided in an optimization pass.
           | 
           | Let's not pretend that "it results in some implementation-
           | specific value, or maybe traps" is a clear win with no
           | downsides that Standards Authors and Compiler Engineers are
           | ignoring out of some kind of malice - there are very real
           | performance tradeoffs here, and a new version of the standard
           | that makes a lot of existing real-world code slower isn't
           | going to be a popular one with many people.
        
             | bhk wrote:
             | It isn't clear to me precisely what example you have in
             | mind.
             | 
             | If you are saying that deleting array bounds checks might
             | have performance benefits that outweigh the security
             | concerns, then I disagree.
             | 
             | If you are saying that the compiler would have to _insert_
             | bounds checks, I don 't see how you arrive at that.
             | 
             | I have seen claims that gratuitous UB is important for
             | enabling meaningful optimizations, but in every such case
             | the examples did not hold up to scrutiny. In the end, the
             | same optimization remains possible without the gratuitous
             | UB, although it might involve a little more work on the
             | part of the compiler engineer.
             | 
             | Regarding "malice": "Never attribute to malice..."
        
               | bhk wrote:
               | For some reason I couldn't 'Reply' to compiler-guy's
               | reply directly, so I'll try here.
               | 
               | I'm familiar with the Chris Lattner article. Most of it
               | (especially the second installment) shows bad outcomes
               | from UB optimization. When it comes to signed integer
               | overflow UB, I see two examples where performance is
               | cited as a motivation.
               | 
               | One mentions unrolling, and gives an example similar to
               | one elsewhere in this thread:
               | https://news.ycombinator.com/item?id=27223870 In my reply
               | to that I explain how unrolling is not actually enabled
               | by UB.
               | 
               | The other instance of integer overflow UB in the Lattner
               | article is optimizing X*2/2 to X. That's perhaps a
               | stronger case, but I haven't seen any numbers on the
               | real-world implications of this particular optimization.
        
               | steveklabnik wrote:
               | > For some reason I couldn't 'Reply' to compiler-guy's
               | reply directly, so I'll try here.
               | 
               | Hacker News has a timeout where if you try to reply to
               | someone too quickly, it will hide the reply button. This
               | timeout increases the deeper the comment tree gets.
        
               | msbarnett wrote:
               | > If you are saying that deleting array bounds checks
               | might have performance benefits that outweigh the
               | security concerns, then I disagree.
               | 
               | I'm saying that there is existing code in this world in
               | which some variation on                      /* insanely
               | hot loop where ARRAYSIZE > 32 */            while(true) {
               | ...                int x = 1 << index;                if
               | (index < ARRAYSIZE) {                    a[index] = x;
               | } else {                      a[index] = 0;
               | }                 ...            }
               | 
               | exists that's currently compiling down to just "a[index]
               | = 1 << index", with everything working fine.
               | 
               | I'm saying that the authors and their customers are
               | unlikely to be excited when your new compiler release
               | stops assuming that index is < 32 (which it always was in
               | practice) and their program gets slower because there's
               | now extra tests in this part of the hot loop, which is
               | also consuming more icache, evicting other important bits
               | of the loop. "There's some work-around to win that
               | performance back, given enough effort by the compiler
               | author to give you some means to tell it what it had
               | previously assumed" isn't likely to sell people on your
               | patch, particularly if they'd have to make many such
               | annotations. "They could just remove the tests if they
               | know that index < 32" in this synthetic example, yes, but
               | there are cases when this is less obvious but nonetheless
               | true. And compiler updates that force you to go delete
               | working code, work out un-obvious deductions the compiler
               | had previously made, and re-validate just to regain the
               | status quo still aren't going to make anybody happy.
               | 
               | The point, broadly: People care a lot about performance.
               | These UB discussions in which people blithely assert that
               | compilers "should" do XYZ conservative assumption while
               | eliding any mention of the real-world performance impact
               | the changes they want would have on existing code are,
               | frankly, masturbatory.
               | 
               | Compiler engineers have to care when PostgreSQL and a
               | dozen other critical programs get 4x slower because they
               | stopped assuming that "1 << index" wouldn't happen with
               | index > 32, or that loop bounds won't overflow. Like all
               | software engineering, decision making here has to be
               | driven by balancing tradeoffs, not by insisting that one
               | treatment of the spec is obviously "the best approach"
               | while ignoring any inconvenient consequences that change
               | would have vs the status quo.
        
               | [deleted]
        
               | compiler-guy wrote:
               | There are a half-dozen examples on this very thread and
               | in linked bug reports, with detailed explanations by
               | professional compiler writers.
               | 
               | If you think they don't hold up to scrutiny, then you
               | should get to work implementing these things, because you
               | are likely a better compiler writer than most others in
               | the world, including Christian Lattner of llvm fame, who
               | provides many examples here.
               | 
               | https://blog.llvm.org/2011/05/what-every-c-programmer-
               | should...
        
           | pklausler wrote:
           | I think the compiler that you were using is broken. One can't
           | infer "index < 32" unless on a code path to a _use_ of
           | "flag", and that inference can't override the predicates that
           | dominate that use.
        
             | bonzini wrote:
             | No, the initializer already counts as a use of the
             | undefined expression.
        
               | pklausler wrote:
               | You're right. Thanks!
        
         | steveklabnik wrote:
         | A fun historical example: https://feross.org/gcc-ownage/
         | 
         | But yes, generally, UB manifests as "the compiler is allowed to
         | assume this doesn't happen" and the bad stuff is a consequence
         | of doing things following those assumptions, not the compiler
         | going "oops well lol UB time to screw with this person."
        
         | SloopJon wrote:
         | The types of surprises that I've seen have to do with
         | inferences that the compiler draws: this program would be
         | undefined if _foo_ was negative, therefore _foo_ is not
         | negative, therefore I can optimize away this condition.
        
       | wallstprog wrote:
       | True story -- someone (not me) decided to initialize a C++
       | virtual class by defining a default ctor which does a
       | "memset(this, 0, sizeof(*this))", then uses placement new to re-
       | create the vptr.
       | 
       | Newer compilers complained more and more until gcc 8 which just
       | silently ignored this awful hack -- no diagnostic, no error, just
       | crickets.
       | 
       | Strictly speaking silently ignoring this atrocity is allowed by
       | the standard, but it sure took a while to figure out. So be
       | careful with code that "just works" even if it shouldn't.
        
       | zokier wrote:
       | Reading error or not, the ship has sailed long ago. As the
       | article notes, C99 formalized the current UB interpretation. What
       | C89 or K&R said is more just historical curiosity than of any
       | real relevance today. I guess you could construct an argument
       | that gcc should disable some optimizations when invoked with
       | -std=c89, but I doubt anyone really cares at this point enough to
       | justify the maintenance burden.
       | 
       | C is a minefield today, and that is the reality we must live
       | with. You can't turn back time 25 years and change what has
       | happened.
        
         | anarazel wrote:
         | > C is a minefield today, and that is the reality we must live
         | with. You can't turn back time 25 years and change what has
         | happened.
         | 
         | I think it's doable to make some of the UB issues considerably
         | less painful. E.g. define facilities to check for overflow
         | safely, add facilities for explicit wrapping operations.
        
       | Animats wrote:
       | Most of this discussion revolves around integer overflow.
       | 
       | Part of the problem is that most of the computer hardware is now
       | twos-complement arithmetic. Programmers think of that as part of
       | the language. It's not, for C. C has run, in the past, on
       | 
       | - 36 bit ones complement machines (DEC and UNIVAC)
       | 
       | - Machines with 7-bit "char" (DEC)
       | 
       | - Machines with 9-bit "char" (UNIVAC, DEC)
       | 
       | - Machines where integer overflow yields a promotion to float
       | (Burroughs)
       | 
       | - Many GPUs provide saturating arithmetic, where INT_MAX + 1 ==
       | INT_MAX.
       | 
       | Go, Java and Rust have explicit byte-oriented models with defined
       | overflow semantics, but C does not. C has undefined overflow
       | semantics.
       | 
       | Many years ago, around the time Ada was being defined, I wrote a
       | note titled "Type Integer Considered Harmful". I was pushing the
       | idea that integer variables should all have explicit range
       | bounds, not type names. As in Ada, overflow would be checked.
       | Intermediate variables in expressions would have bounds such that
       | the intermediate value could not overflow without the final
       | result overflowing and being caught. Intermediate values often
       | have to be bigger than the operands for this to work.
       | 
       | This never caught on, partly because long arithmetic hardware was
       | rare back then, but it illustrates the problem. Numeric typing in
       | programming languages addresses a numeric problem with a
       | linguistic solution. Hence, trouble. Bounds are the right answer
       | numerically, but force people to think too hard about the limits
       | of their values.
        
       | goalieca wrote:
       | If C is just a portable assembler then what if the assembly
       | itself has undefined behaviour. :)
        
         | jart wrote:
         | I seem to recall that Intel and AMD CPUs will behave in strange
         | and unusual ways, particularly when it comes to things like
         | bitshift op flags, if you shift by out of range values, or by 0
         | or 1. So I guess undefined behaviors in C are somewhat
         | consistent with CPUs. But as other people mentioned Intel is
         | much more forgiving than X3J11. If you ever wanted to find all
         | the dirty corner cases that exist between ANSI and hardware, I
         | swear, try writing C functions that emulate the hardware, and
         | then fuzz the two lockstep style. It's harder than you'd think.
         | [don't click here if you intend to try that: https://github.com
         | /jart/cosmopolitan/blob/master/tool/build/...]
        
         | mhh__ wrote:
         | It's not though is it
        
         | kmeisthax wrote:
         | This exists, but the effect of undefined behavior in CPU
         | architectures is a little bit more forgiving than the
         | interpretation of UB in C to mean "literally the entire program
         | has no meaning". Instead, usually the program will execute
         | correctly _up to_ the invalid instruction, and then something
         | happens, and then the CPU will continue executing from that
         | state. It 's actually fairly difficult to build an instruction
         | with undefined behavior that contaminates unrelated parts of
         | the program.
         | 
         | Though it HAS happened: notably, brucedawson explains here [1]
         | that the 360 has an instruction so badly thought out that
         | merely having it in an executable page is enough to make your
         | program otherwise meaningless due to speculative execution.
         | 
         | [1] https://randomascii.wordpress.com/2018/01/07/finding-a-
         | cpu-d...
        
           | dooglius wrote:
           | There actually is a fair amount of truly undefined behavior
           | for CPUs, but it's always at system/kernel mode rather than
           | userspace for security reasons. You can search an ARM ISA for
           | "UNPREDICTABLE" to see examples.
        
           | masklinn wrote:
           | > This exists, but the effect of undefined behavior in CPU
           | architectures is a little bit more forgiving than the
           | interpretation of UB in C to mean "literally the entire
           | program has no meaning".
           | 
           | That is not quite what the interpretation of UB in C is,
           | AFAIK. UB in C is generally interpreted as meaning that any
           | _path_ which would trigger an UB is invalid, because if it
           | will be invalid once the UB is reached, well, if we know for
           | sure we 're going to UB for the instruction before it, we can
           | say that we're already in UB.
           | 
           | Whole-program invalidity can occur when the compiler manages
           | to infer that no execution path is UB-free, in which case yes
           | the program is meaningless. More generally, programs will go
           | off the rail as far ahead of the UB as the compiler managed
           | to determine that the UB would unfailingly be reached.
           | 
           | And it's usually because the compiler works backwards: if a
           | path would trigger an UB, that path can not be legally taken,
           | therefore it can be deleted. That's why e.g. `if (a > a + 1)`
           | gets deleted, that expression makes sense if you assume
           | signed integers can overflow, but the compiler assumes signed
           | integers _can 't_ overflow, therefore this expression can
           | never be true, therefore it's dead code.
           | 
           | This is important, because many such UBs get generated from
           | macro expansion and optimisations (mainly inlining), so the
           | assumption of UB-impossibility (and thus dead code) enables
           | not just specific optimisations but a fair amount of DCE,
           | which reduces function size, which triggers further inlining,
           | and thus the optimisations build upon one another.
        
           | coliveira wrote:
           | The situation is different, because a CPU is by definition an
           | interpreter. It does't perform code transformation, at least
           | not at a higher level as a compiler. The CPU only looks at
           | the next few instructions and perform them. A compiler,
           | however, is responsible for taking a large coding unit and
           | produce a transformation that is efficient. That process
           | requires thinking about what is invalid code and operate on
           | that.
        
           | infogulch wrote:
           | Wow! Interesting to see hints that meltdown exists years
           | before it was officially published.
        
             | mhh__ wrote:
             | Skimmed the article and didn't see a reference to it, you
             | may be interested to know that our good friends and
             | protectors at the NSA may have stumbled on to Meltdown-like
             | issues in the mid 90s
             | 
             | https://en.wikipedia.org/wiki/Meltdown_(security_vulnerabil
             | i...
        
               | infogulch wrote:
               | I see the NSA strategy for 'securing' the nation against
               | technology threats in their 'unique' way was going strong
               | back in 1995.
        
               | coliveira wrote:
               | They "secure" the country by exploiting vulnerabilities
               | and leaving everyone else in the dark. They see the world
               | as just a game between them and other foreign
               | surveillance institutions.
        
         | tlb wrote:
         | It does: reading uninitialized memory, simultaneous writes from
         | multiple threads, using memory below the stack pointer with
         | interrupts enabled, ...
         | 
         | Some of C's UB is due to this, some of it is due to the
         | compiler.
        
       | djoldman wrote:
       | Specifying that anything can be done in the presence of UB is a
       | poor specification. The word "specify" is pretty much the
       | opposite of "anything."
       | 
       | Perhaps compilers should delete all scopes with UB: much more UB
       | would be purged from code as a result (programmers would be
       | forced to enable compiler errors on UB).
        
       | tom_mellior wrote:
       | I'm not convinced. The argument seems to hinge to a very large
       | extent on the sentence:
       | 
       | > Permissible undefined behavior ranges from A, to B, to C.
       | 
       | The observation that "Permissible" has a specific meaning is
       | important and interesting. But what about "ranges from ... to
       | ..."? The author reads this as "Permissible undefined behavior is
       | either A or B or C.", but that seems like a stretch to me.
       | (Unless ISO defines "ranges" to mean just this.)
       | 
       | Also, in the above, A = "ignoring the situation completely with
       | unpredictable results". Both "ignoring" and "unpredictable" do a
       | lot of heavy lifting here. In the signed integer overflow case
       | discussed in the linked GCC bug report
       | (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475), one could
       | very well argue that "ignoring" a case where signed arithmetic
       | overflows is exactly GCC is doing. It "ignores" the possibility
       | that a + 100 might overflow, hence obviously a + 100 > a is true,
       | leading to results that the reporter considers "unpredictable".
       | Somehow the author seems to think that this is not what the
       | wording intended, but they also fail to explain what they think
       | should happen here.
       | 
       | Sounds to me like the wording has always been very vague.
       | 
       | Also:
       | 
       | > Returning a pointer to indeterminate value data, surely a
       | "use", is not undefined behavior because the standard mandates
       | that malloc will do that.
       | 
       | Yes, malloc will do that, but using a _pointer to_ indeterminate
       | data is not the same as using _the indeterminate data itself_.
       | The author is doing themselves a disservice by misreading this.
        
         | hedora wrote:
         | The original intent was for signed overflow to be architecture-
         | specific (not everything was 2s complement back then).
         | 
         | On x86, the correct behavior was previously (and obviously)
         | "perform a 2s complement wraparound".
         | 
         | To "ignore" the situation used to mean "delegate to a lower
         | level". It now means "silently generate code that violates the
         | semantics of the target microarchitecture"
         | 
         | As the article argues, I think this has gone too far. For
         | example, people are starting to claim that asm blocks are
         | undefined behavior. They're clearly implementation specific (so
         | undefined by the spec), but also well defined by each
         | implementation.
         | 
         | In current readings of the spec, compilers are free to ignore
         | them completely. Doing so would break all known operating
         | systems, and many user space programs, so they have not managed
         | to do so yet.
         | 
         | Edit: for signed overflow, other architecture-specific behavior
         | (such as optionally trapping floating point exceptions) would
         | also have been permissible, assuming the architecture supported
         | it.
        
           | jcranmer wrote:
           | > The original intent was for signed overflow to be
           | architecture-specific (not everything was 2s complement back
           | then).
           | 
           | The term for that is implementation-defined, not undefined.
           | If it were to be architecture-specific, back in C89, they
           | would have used the term implementation-defined, as they do
           | for things like the size of pointers.
        
           | anarazel wrote:
           | There's "implementation defined" for the concept you
           | describe.
        
         | Jiro wrote:
         | >Somehow the author seems to think that this is not what the
         | wording intended, but they also fail to explain what they think
         | should happen here.
         | 
         | What "should" happen is that the implementation-defined
         | behavior in the assert matches the implementation-defined
         | behavior in the calculation. That is, assert(a+100>a) should
         | produce the same result as                 int x = a + 100;
         | int y = a;       if (x > y)           ;       else
         | printf("assertion failed\n");
        
           | richardwhiuk wrote:
           | That behaviour is no less undefined in the process of
           | overflow.
        
             | Jiro wrote:
             | Okay, then let's rephrase without a code example: The
             | implementation-defined behavior in the assert should
             | produce "true" if and only if the number printed for x+100
             | (also using implementation-defined behavior) is always
             | larger than the number printed for x.
        
       | aw1621107 wrote:
       | A related post with a similar claim: "How One Word Broke C"
       | (https://news.quelsolaar.com/2020/03/16/how-one-word-broke-c/).
       | HN comments at https://news.ycombinator.com/item?id=22589657
        
       | anarazel wrote:
       | First: I do dislike how hard it is to avoid some UB / how
       | impractical some of the rules are.
       | 
       | But I also think a lot of discussions of this topic caricaturize
       | compiler writers to a ridiculous degree. Almost describing them
       | to write optimization passes looking for UB so they can over-
       | optimize something, while cackling loudly in glee about all the
       | programs they can break.
       | 
       | The set of people doing so overlaps with the set of people
       | complaining that the compiler doesn't optimize their code
       | sufficiently to a significant degree.
       | 
       | Lots of compiler optimizations need to know the ranges of values,
       | hence logic to infer ranges. One of the sources for that is
       | "can't happen" style logic - which nearly all of the time are
       | things the code author would agree with if they thought long and
       | hard. Not just about the code as written, but also good the code
       | looks like after inlining (across TUs with LTO).
        
         | rectang wrote:
         | > _But I also think a lot of discussions of this topic
         | caricaturize compiler writers to a ridiculous degree._
         | 
         | The inflamed backlash should tell you just how damaging it is
         | to impose silent failure on meticulously written, previously
         | fine programs.
        
           | vinkelhake wrote:
           | If the program was previously "fine" on version x.y.z of some
           | compiler, then it is most likely still fine on it. That's the
           | target that the program was written for.
           | 
           | There's some disagreement on whether you can call a program
           | "fine" that breaks after switching to a newer version, or a
           | different compiler.
           | 
           | I see a lot of programmers out there that unfortunately use
           | the behavior of their code on whatever compiler they're using
           | at the moment as a proxy for what the language actually
           | guarantees.
        
           | jcranmer wrote:
           | > meticulously written, previously fine programs
           | 
           | With relatively few exceptions, if your program hits
           | undefined behavior, then your program was already doing
           | something pretty wrong to begin with. Signed overflow is a
           | poignant example: in how many contexts is INT_MAX + 1
           | overflowing to INT_MIN actually sane semantics? Unless you're
           | immediately attempting to check the result to see if it
           | overflowed (which is extremely rare in the code I see), this
           | overflow is almost certain to be unexpected, and a program
           | which would have permitted this was not "previously fine" nor
           | "meticulously written."
           | 
           | I feel compelled right now to point out that software
           | development is a field where it is routine to tell users that
           | it's their fault for expecting our products to work (that's
           | what the big all-caps block of every software license and
           | EULA says, translated into simple English).
        
             | rocqua wrote:
             | I recall a bug-report discussion that I sadly have never
             | been able to find. It contains a pretty bad side-effect of
             | this.
             | 
             | It had code like:                   int *p;         // lots
             | of code                  if (p != NULL)             return
             | 1;         // use p
             | 
             | Then a later refactor wrongly added a single line before
             | the if statement:                   int *p;         // lots
             | of code                  int a = *p;         if (p != NULL)
             | return 1;         // use p
             | 
             | This meant the null check was optimized away, since de
             | referencing a null pointer is undefined behavior, so the
             | if-statement can be assumed to be always false. This then
             | lead to actual errors (perhaps even an exploit, I do not
             | recall) arising from the removed null-check.
             | 
             | I think in general the "sanity check" cases are the worst.
             | It is hard to determine whether an expression causes
             | undefined behavior if you cannot try and evaluate it.
             | Perhaps a compiler intrinsic that checks (at runtime)
             | whether an expression causes undefined behavior could be
             | useful here. Though I can imagine such an intrinsic being
             | essentially impossible to implement.
        
               | tom_mellior wrote:
               | This was in the Linux kernel, which is compiled with
               | special kernel flags which make dereferencing null
               | pointers legal. In the context of that code,
               | dereferencing a pointer and later checking it for null
               | was absoutely meaningful. Optimizing the later null check
               | was _a compiler bug that was acknowledged and fixed_. The
               | compiler here didn 't respect the semantics it had
               | promised to kernel code. This was entirely
               | uncontroversial; it was _not_ a case of  "unwanted
               | optimization based on undefined behavior", it was a case
               | of "compiler bug breaking well-defined code". Again, all
               | this in a _kernel_ context.
               | 
               | In _user code_ GCC will still happily remove the null
               | check because in user code _this is an actual bug in the
               | user 's code_.
        
               | nwallin wrote:
               | https://lwn.net/Articles/342330/
               | 
               | https://lwn.net/Articles/342420/
        
             | rectang wrote:
             | > _extremely rare in the code I see_
             | 
             | Well, I learned of the change in compiler behavior some
             | years back because I had written loop code with a sanity
             | check which depended on signed integer overflow wrapping,
             | along with a test case to prove that the sanity check
             | worked, and that test case started failing:
             | not ok 2 - catch overflow in token position calculation
             | Failed test 'catch overflow in token position calculation'
             | at t/152-inversion.t line 70.
             | 
             | To the extent I can be, I'm done with C. I leave it to
             | people who think that silently optimizing away previously
             | functional sanity checks is an acceptable engineering
             | tradeoff, and who disparage those of us who have been
             | bitten.
        
             | anarazel wrote:
             | > in how many contexts is INT_MAX + 1 overflowing to
             | INT_MIN actually sane semantics? Unless you're immediately
             | attempting to check the result to see if it overflowed
             | (which is extremely rare in the code I see)
             | 
             | It's not that rare - I know that postgres got bit by
             | particularly that issue, and several other other projects
             | as well. Particularly painful because that obviously can
             | cause security issues.
        
             | lixtra wrote:
             | > With relatively few exceptions, if your program hits
             | undefined behavior, then your program was already doing
             | something pretty wrong to begin with.
             | 
             | The author claims: "We have the absurd situation that C,
             | specifically constructed to write the UNIX kernel, cannot
             | be used to write operating systems. In fact, Linux and
             | other operating systems are written in an unstable dialect
             | of C that is produced by using a number of special flags
             | that turn off compiler transformations based on undefined
             | behavior (with no guarantees about future "optimizations").
             | The Postgres database also needs some of these flags as
             | does the libsodium encryption library and even the machine
             | learning tensor-flow package."
             | 
             | So basically the programers of the most used C programs
             | consider the C standard so broken that they force the
             | compiler to deviate from it. (Or they are not able to do
             | things right?)
        
         | mjw1007 wrote:
         | I agree.
         | 
         | I don't have much sympathy for people who were doing things
         | like writing multithreaded programs in the days before C
         | documented its memory model and then becoming unhappy because
         | new optimisations that legitimately help single-threaded code
         | broke their programs.
         | 
         | In my experience C compiler maintainers have generally been
         | open to the idea of offering guarantees beyond a narrow reading
         | of the standard, but they want to be able to clearly state what
         | it is that they're guaranteeing. "Keep old programs running"
         | isn't enough.
         | 
         | I think the "Prevailing interpretation" that Yodaiken complains
         | about is coming from the same place as suspicion of the "be
         | lenient in what you accept" IETF principle: that sort of thing
         | doesn't lead to robustness in the long run.
         | 
         | The way forward at this point is surely to define more things
         | that are currently undefined (whether in the standard or by
         | widely-implemented extensions).
        
         | vyodaiken wrote:
         | The current situation is not good for compiler writers either.
         | But nobody has ever shown that either C programmers want to
         | sacrifice safety for "optimizations", or that these UB
         | optimizations actually improve performance of anything.
        
           | ynik wrote:
           | What do you mean by "these UB optimizations"? C is a low-
           | level language; it's basically impossible for a compiler to
           | reason about the code unless it makes certain assumptions. It
           | needs to assume the code is not self-modifying to do pretty
           | much any code-generation more intelligent than a macro
           | assembler. It needs to assume the code isn't messing with the
           | stack frames/return addresses in order to inline functions.
           | It needs to assume the code isn't using an out-of-bounds
           | pointer to access a neighboring local variable, so that it
           | can move local variables into registers. "gcc -O0" is a good
           | approximation for the performance you get if the compiler
           | isn't allowed to optimize based on UB.
           | 
           | Yes, that means C without optimizing without UB is slower
           | than Java. Optimizations need some form of reasoning about
           | what's happening. For Java it's optimizing based on
           | guarantees provided by the language (there's no raw pointers
           | that could mess with the things listed above). But C doesn't
           | provide any hard guarantees, so instead it needs to blindly
           | assume that the code will behave sanely.
           | 
           | Also note that for many of the more manageable sources of UB,
           | most compilers provide a choice (-fwrapv, -fno-strict-
           | aliasing, ...). Yet few projects use these options, even when
           | they use other gcc/clang-specific features. Doesn't that
           | indicate that C programmers indeed want to sacrifice safety
           | for optimizations?
        
             | gpderetta wrote:
             | Exactly.
             | 
             | For example there were programs 30-40 years ago that relied
             | on exact stack layouts. These days everybody would agree
             | they are completely broken.
             | 
             | The issue of course is that it is extremely hard to write
             | programs that have no UB. It would be nice for compilers to
             | have an option to automatically introduce assetions
             | whenever they rely on some UB-derived axiom, basically as a
             | sort of lightweight sanitizer.
             | 
             | In fact if we had sanitizers 30-40 years ago probably
             | things would be better today.
        
               | MauranKilom wrote:
               | > It would be nice for compilers to have an option to
               | automatically introduce assetions whenever they rely on
               | some UB-derived axiom
               | 
               | Modifying a value from a different thread without
               | synchronization is UB. The compiler assumes this does not
               | happen in order to e.g. move things into registers. Could
               | you elaborate how (and how often) you would like to have
               | this kind of UB-derived axiom ("this value remains the
               | same from here to there") checked with assertions?
        
               | gpderetta wrote:
               | Obviously you wouldn't be able to catch many, or even
               | most cases. Use-after-free is anothe case that would ve
               | very expensive to detect.
        
               | vyodaiken wrote:
               | That's good example, because nobody would complain if
               | stack layouts changed and those programs failed. But if
               | the compiler chooses to "optimize away" checks on stack
               | layout, that's a different thing altogether. Also note
               | that if you use pthreads or Linux clone or you are
               | writing an operating system you can need to rely on exact
               | stack layouts even today.
        
             | vyodaiken wrote:
             | For your last point, the extent of UB driven changes to
             | semantics is still not widely known in the programmer
             | community. Programmers don't read the standard - they read
             | K&R, and K&R is right now describing a different language.
             | We've had 15 years of programmers repeatedly filing bug
             | reports to be told that the expected, tested, relied on,
             | behavior was ephemeral. Only very sophisticated projects
             | figure out about UB.
             | 
             | Of course compilers have to make assumptions. The debate is
             | (a) over what assumptions it is proper to make and (b) what
             | are the permissible behaviors. The false dichotomy: either
             | do without any optimizations at all or accept whatever UB
             | gives you, is not a useful approach.
        
               | MauranKilom wrote:
               | So what optimizations do you mean with "these UB
               | optimizations" then? And would it change your mind to see
               | a benchmark proving the usefulness of that particular UB
               | optimization?
        
               | vyodaiken wrote:
               | e.g. assuming UB can't happen: deleting overflow or null
               | pointer checks, deleting comparisons between pointers
               | that are assumed to point at different objects, ...
        
           | rndgermandude wrote:
           | To do UB "optimizations", the compiler first needs to figure
           | out that there is an UB it can "optimize" anyway. At this
           | point instead of "optimizing" it could, and in my humble
           | opinion absolutely should, blow up the compilation by
           | generating an UB error, so people can fix their stuff.
           | 
           | What about backwards compatibility in regards to a new
           | compiler version deciding to issue errors on UB now? You
           | don't have any guarantees about what happens with UB right
           | now, so if you upgrade to a new version compiler that
           | generates errors instead of "optimizations" everything would
           | be still as before: no guarantees. And it's frankly a lot
           | better to blow up the compilation with errors than to have
           | the compiler accept the UB code and roll a dice on how the
           | final binary will behave later. You can either fix the code
           | to make it compile again, or use an older "known good"
           | version of the compiler that you previously used as a stopgap
           | measure.
           | 
           | I fail to see any reason whatsoever why compilers are still
           | doing all kinds of stupid stuff with UB instead of doing the
           | right thing and issuing errors when they encounter UB.
           | 
           | I also fail to see why the C language designers still insist
           | on keeping so much of the legacy shit around.
        
             | gpderetta wrote:
             | > To do UB "optimizations", the compiler first needs to
             | figure out that there is an UB it can "optimize" anyway.
             | 
             | That's not how compwillrs work. In fact in the general case
             | it is impossible to figure out at compile time that "there
             | is an UB".
             | 
             | The compiler instead assumes as an axiom that no UB can
             | ever happen and uses the axioms to prove properties of the
             | code.
             | 
             | These days if you want to catch UB, compile with
             | -fsanitize=undefined-behaviour. The program wll then trap
             | if UB is actually detected at runtime.
        
             | nullc wrote:
             | > To do UB "optimizations", the compiler first needs to
             | figure out that there is an UB it can "optimize" anyway.
             | 
             | The compiler assumes UB will never happen and it makes
             | transformations that will be valid if there happens to be
             | no UB. This doesn't require any explicit detection of UB,
             | and in some cases UB or not is simply undecidable at
             | compile time (as in _no_ compiler could detect it without
             | incorrect results).
             | 
             | Without these assumptions the resulting compiled code would
             | be much slower, though some optimizations have different
             | danger vs speed impact and there certainly can be a case
             | that there are some optimizations that should be eschewed
             | because they're a poor trade-off.
             | 
             | There are many cases where current compilers will warn you
             | when you've done something that is UB. It's probably not
             | the case that they warn for every such detectable case and
             | if so it would be reasonable to ask them to warn about more
             | of them.
             | 
             | I think your irritation is just based on a misunderstanding
             | of the situation.
             | 
             | Compiler authors are C(++) programmers too, they also don't
             | like footguns. They're not trying to screw anyone over.
             | They don't waste their time adding optimizations that don't
             | make real performance improvements just to trip up invalid
             | code.
        
         | dooglius wrote:
         | The caricatures are somewhat accurate though, optimizations
         | that look at UB adversarially are never anywhere close to
         | justified.
         | 
         | > The set of people doing so overlaps with the set of people
         | complaining that the compiler doesn't optimize their code
         | sufficiently to a significant degree.
         | 
         | There's no contradiction here, and the overlap is generally
         | just "people who care". The optimizations that are not safe
         | shouldn't exist, and the optimizations that are safe should be
         | good.
         | 
         | > nearly all of the time are things the code author would agree
         | with if they thought long and hard
         | 
         | I highly doubt this is the case for even one situation.
        
           | MauranKilom wrote:
           | Name one such optimization. We'll be happy to refute your
           | points for that one.
        
           | pornel wrote:
           | > optimizations that look at UB adversarially
           | 
           | The whole point is that there isn't such adversarial thing
           | like "we're going to find the UB right there, won't even
           | print a warning about it, and mess up your crappy code,
           | haha!"
           | 
           | Optimizers aren't reasoning about code like people do (start
           | to finish with high-level understanding of the whole
           | function), but rather as series of mostly dumb, mostly
           | isolated small passes, each pass changing one little thing
           | about the code.
           | 
           | It just happens that one pass marks certain instructions as
           | "can't happen" (like the spec says), then another pass
           | simplifies expressions, and then another pass that removes
           | code that doesn't do anything, usually left over from the
           | previous steps. They sometimes combine in an "adversarial"
           | way, but individually each pass is justified and necessary.
           | 
           | Compilers already have lots of different passes. Splitting
           | optimizations into passes is a way to keep complexity closer
           | to O(n) rather than O(n^2), but this architecture makes
           | interactions between passes very delicate and difficult to
           | coordinate, so it's difficult to instrument the data to avoid
           | only cases of annoying UB without pessimizing cases that
           | users want optimized.
        
       | GuB-42 wrote:
       | The C standard is not the Bible, it is not written by an almighty
       | god. As respectable K&R are, they are humans who wrote a standard
       | for their own needs, based on the state of the art of that time.
       | Sadly C is the work of mortals...
       | 
       | Reading the standard trying to understand the way of god like
       | religious scholars do is a pointless exercise. Modern compiler
       | developers found that exploiting undefined behavior the way we do
       | new leads to interesting optimization, others found it reasonable
       | so now it is the standard.
       | 
       | I think the issue most people have now is that compilers use
       | advanced solvers that are able to infer a lot from undefined
       | behavior and other things, so UB is no longer just "it works or
       | it crashes".
        
       | ChrisSD wrote:
       | I think the real problem stems from the mismatch between modern
       | processors and the processors C was originally designed for.
       | 
       | C programmers want their code to be fast. Vanilla C no longer
       | gives them the tools to do that on a modern processor. Either the
       | language needs to be extended or the compiler needs to get more
       | creative in interpreting the existing language. The latter is the
       | least disruptive and it doesn't stop judicious use of the former.
       | 
       | So, in short, UB is what gives room for the compiler to be more
       | creative without the programmer having to change their code. It
       | wasn't a reading error, it was an opportunity the compiler devs
       | enthusiastically embraced.
        
         | bhk wrote:
         | Even on modern processors, an ADD instruction does not corrupt
         | memory. The C standard, in declaring that an integer overflow
         | results in all-bets-are-off UB, is not enabling compilers to
         | provide valuable optimizations.
        
           | nullc wrote:
           | It would be nice if that were true, but it's not. The ability
           | to assume incrementing a counter won't overflow makes range
           | analysis possible in many cases where it otherwise wouldn't
           | be. Because of this you can perform vectorization because the
           | compiler can assume it knows how many times the loop will
           | run.
           | 
           | The performance differences are not small.
           | 
           | You can also tell some compilers to treat signed integer
           | overflow as wrapping-- but people don't usually do this
           | because the optimizations you lose are valuable.
        
             | vyodaiken wrote:
             | I bet the performance differences are minimal except
             | perhaps on some contrived cases. The compiler should not
             | assume things it doesn't know. But show me a benchmark.
        
         | rectang wrote:
         | When compiler writers have get "creative" with C undefined
         | behavior, programming C no longer produces predictable results.
         | 
         | > least disruptive
         | 
         | Like starting to optimize away loop checks that can "never
         | happen" because signed integer overflow is UB, suddenly
         | changing the behavior of programs that were fine for years?
         | 
         | I wish I could just fence off this insanity by never starting
         | another project in C. Unfortunately, C is ubiquitous in the
         | ecosystem so all of us are stuck cleaning it up.
        
           | msbarnett wrote:
           | > Like starting to optimize away loop checks that can "never
           | happen" because signed integer overflow is UB, suddenly
           | changing the behavior of programs that were fine for years?
           | 
           | Yeah. Not doing that on modern processors is actually quite
           | disruptive.
           | 
           | Here:                   for(i = offset; i < (offset + 16);
           | i++) {             arr[i] = i + 32;         }
           | 
           | What C compilers currently do is, in line with the standard,
           | ignore the case that offset + 16 might overflow. This makes
           | this eligible for loop unrolling, and depending on the
           | specifics of the math inside the loop, the compiler can do a
           | lot to pre-calculate things because it knows this is
           | happening 16 times exactly.
           | 
           | If, instead, we force compilers to think about the fact that
           | offset + 16 _could_ have some implementation-defined meaning
           | like wrapping, then all bets are off  & we have to throw a
           | bunch of optimization opportunities out the window.
           | 
           | Lots and lots of hot & tight loops which are currently able
           | to be compiled into something suitable for the preferences of
           | modern CPUs instead has to be to be naively compiled because
           | of the need to hold back due to the _possibility_ of
           | something that largely wasn't happening, happening.
           | 
           | Most people write most loops this way, never expecting or
           | intending to overflow anything. Most loops are benefitting
           | from this optimization. A lot of code would get slower, and
           | programmers would have to do a lot more fragile hand-
           | unrolling of operations to get that performance back. And
           | they'd need to update that more often, as whatever the
           | optimal "stride" of unrolling changes with the evolution of
           | CPU pipelines.
           | 
           | It's slower code and more work more often for more people, to
           | satisfy a minority use-case that should really just have its
           | own separate "please wrap this" construct.
        
             | rectang wrote:
             | > to satisfy a minority use-case
             | 
             | Every single C program is potentially in that "minority".
             | Nobody can tell when the compiler writers are going to
             | change up behavior on you.
             | 
             | It doesn't matter how carefully the codebase has been
             | written, whether you've had `-Wall -Wextra` enabled. What
             | was fine at one time is no longer fine today. Any C program
             | may suddenly start exhibiting misbehavior from innocuous to
             | catastrophic to horrendously insecure.
             | 
             | It's psycho, maddening, irresponsible. And the only way to
             | deal with it is to purge C programs compiled by these
             | psychotic compilers from our systems.
        
               | msbarnett wrote:
               | > Every single C program is potentially in that
               | "minority". Nobody can tell when the compiler writers are
               | going to change up behavior on you.
               | 
               | This is ridiculously hyperbolic, and bringing unthinking
               | emotional responses like "psycho" and "irresponsible"
               | only obscures the fact that there are very serious
               | engineering tradeoffs involved in trying to balance "not
               | surprising people whose code contains an assumption that
               | some case is going to behave a certain way when by-the-
               | standard-as-written that case can be ignored" and "not
               | making everything with a hot loop 8x slower because we
               | can't assume anything about loop bounds any more", and
               | that compilers that do the latter are unlikely to prove
               | popular with a lot of people either.
        
             | dooglius wrote:
             | > minority use-case
             | 
             | The amount of code that looks like this in a big enough hot
             | loop to make a difference is negligible. Can you provide
             | even one real-world example where this makes a difference,
             | i.e. not some microbenchmark? The amount of code that can
             | break as a result of signed overflows being UB, on the
             | other hand, is huge.
             | 
             | > programmers would have to do a lot more fragile hand-
             | unrolling of operations to get that performance back
             | 
             | Much easier ways to do this, e.g. by using an assert
             | wrapper around __builtin_unreachable. Alternatively, an
             | unsafe_int_t could be defined that gives the optimize-able
             | behavior. The important thing is to make it opt-in;
             | sensible defaults matter.
        
               | MauranKilom wrote:
               | > The amount of code that can break as a result of signed
               | overflows being UB, on the other hand, is huge
               | 
               | C++ recently decided to _not_ make signed overflow
               | defined, despite having the explicit opportunity to do
               | so. Here is the reasoning:
               | 
               | http://www.open-
               | std.org/jtc1/sc22/wg21/docs/papers/2018/p090...
               | 
               | > Performance concerns, whereby defining the behavior
               | prevents optimizers from assuming that overflow never
               | occurs;
               | 
               | > Implementation leeway for tools such as sanitizers;
               | 
               | > Data from Google suggesting that over 90% of all
               | overflow is a bug, and defining wrapping behavior would
               | not have solved the bug.
               | 
               | Presumably, data from Google disproves your assertion
               | that the amount of code that breaks due to signed
               | overflow being UB is huge.
        
               | msbarnett wrote:
               | > Can you provide even one real-world example where this
               | makes a difference, i.e. not some microbenchmark
               | 
               | Sure. I don't even have to leave this thread to find one:
               | https://news.ycombinator.com/item?id=27223954 reports a
               | measurable speed impact to PostgreSQL when compiled with
               | -fwrapv, which rules out the exact optimization in
               | question.
               | 
               | This shouldn't be surprising; loops are extremely common
               | and superscalar processors benefit enormously from almost
               | anything other than a naive translation of them.
               | 
               | Here's -fwrapv cutting performance of a function in half
               | in Cython vs the non-fwrapv compilation:
               | https://stackoverflow.com/questions/46496295/poor-
               | performanc...
        
             | vyodaiken wrote:
             | This is a common argument, but it is flat out wrong. If, as
             | claimed, compilers have complete freedom of choice about
             | how to handle UB, they have the choice to e.g. make this
             | behavior depend on the processor architecture. Compiler
             | developers are _choosing_ to use UB to make C semantics
             | defective.
        
             | adwn wrote:
             | > _If, instead, we force compilers to think about the fact
             | that offset + 16 could have some implementation-defined
             | meaning like wrapping, then all bets are off & we have to
             | throw a bunch of optimization opportunities out the
             | window._
             | 
             | Uh huh. If `i` is declared as `unsigned int` instead of
             | `int`, then overflow _is_ defined and the compiler can 't
             | apply those optimizations. And yet the world doesn't end
             | and the sun will still rise tomorrow...
        
               | tom_mellior wrote:
               | The world doesn't end, but in the "int" case you get nice
               | vector code and in the "unsigned int" case you get much
               | less nice scalar code:
               | https://gcc.godbolt.org/z/cje6naYP4
        
               | adwn wrote:
               | Yes, that is true. The proper way for a compiler to
               | handle this, would be to add a single overflow check
               | before the loop, which branches to a scalar translation
               | of the loop. Most realistic code will need a scalar
               | version anyway, to deal with the prolog/epilog of the
               | unrolled loop for iteration counts that aren't multiples
               | of the unrolling factor.
               | 
               | Surely you agree that treating unsigned overflow
               | differently from signed does not make any sense
               | semantically? Why is signed overflow UB, but unsigned
               | wrapping, and not the other way around? The terms
               | 'signed' and 'unsigned' denote the value range, not
               | "operations on this type might overflow/will never
               | overflow".
        
               | rocqua wrote:
               | To a mathematician, wrapping 2^n+1 back to 0 is a lot
               | more intuitive than wrapping 2^n to -2^n. Mathematically
               | the two systems are largely equivalent. They are
               | equivalent when considering addition and multiplication.
               | Both implement arithmetic modulo 2^n+1.
               | 
               | However, the canonical representation of this system runs
               | from 0 to 2^n+1. Hence, if you were going to make one
               | kind of integer overflow, and not the other, C made the
               | correct choice.
               | 
               | That leaves out the question of whether the difference
               | between the two cases is significant enough to have a
               | difference in how overflow works.
        
               | tom_mellior wrote:
               | > The proper way for a compiler to handle this, would be
               | to add a single overflow check before the loop, which
               | branches to a scalar translation of the loop. Most
               | realistic code will need a scalar version anyway, to deal
               | with the prolog/epilog of the unrolled loop for iteration
               | counts that aren't multiples of the unrolling factor.
               | 
               | That's true, I agree that that would be a clever way to
               | handle this particular case. It would still happily
               | invoke undefined behavior if the indices don't match the
               | array's length, of course. Many assumptions about the
               | programmer knowing what they are doing goes into the
               | optimization of C code.
               | 
               | > Surely you agree that treating unsigned overflow
               | differently from signed does not make any sense
               | semantically?
               | 
               | Yes. Silently wrapping unsigned overflow is also very
               | often semantically meaningless.
        
               | msbarnett wrote:
               | > And yet the world doesn't end and the sun will still
               | rise tomorrow...
               | 
               | No, you just get much slower, non-vectorized code because
               | the compiler is forced to forgo an optimization if you
               | use unsigned int as the loop bound (EDIT: tom_mellior's
               | reply illustrates this extremely well:
               | https://gcc.godbolt.org/z/cje6naYP4)
               | 
               | Which is precisely the point: forcing a bunch of existing
               | code with int loop bounds, which currently enjoys
               | optimization, to take on the unsigned int semantics and
               | get slower, is just going to piss off a different (and
               | probably larger) set of people than the "compilers
               | shouldn't assume that unsigned behaviour can't happen"
               | set of people.
               | 
               | It's a tradeoff with some big downsides; this isn't the
               | obvious win the anti-optimization crowd pretends it is.
        
             | bhk wrote:
             | Well-defined integer overflow would not preclude loop
             | unrolling in this case. One simple alternative would be for
             | the compiler to emit a guard, skipping unrolling in the
             | case that (offset+16) overflows. This guard would be
             | outside the unrolled loop. Furthermore, unsigned values are
             | often used for indices (the unsigned-ness of size_t pushes
             | programmers in that direction) and unsigned overflow _is_
             | well-defined, so any compiler engineer implementing
             | unrolling should be be able to emit such a guard so that
             | the optimization can be applied to loops with unsigned
             | indices.
        
               | [deleted]
        
               | msbarnett wrote:
               | > Well-defined integer overflow would not preclude loop
               | unrolling in this case. One simple alternative would be
               | for the compiler to emit a guard, skipping unrolling in
               | the case that (offset+16) overflows.
               | 
               | To what end?                       for(i = offset; i <
               | (offset + 16); i++) {                 arr[i] = i + 32;
               | }
               | 
               | like most loops in most programs isn't _designed_ to
               | overflow. The program isn 't any more correct for
               | emitting two translations of the loop, one unrolled and
               | one which is purely a bugged case anyways.
               | 
               | Changing the way the UB manifests while altering the
               | nature of the optimization hasn't actually fixed anything
               | at all here. All this would seem to accomplish would be
               | to increase pressure on the icache.
        
         | tomcam wrote:
         | > Vanilla C no longer gives them the tools to do that on a
         | modern processor
         | 
         | Can you elaborate on this point?
        
           | I_Byte wrote:
           | C was created during a time where instructions were executed
           | linearly with no vectorization, memory was a flat space with
           | no CPU caches, and there wasn't a branch predictor that may
           | or may not execute the correct program branch in advance. The
           | list goes on but the rest is beyond my scope.
           | 
           | C was designed for a now obsolete computer architecture model
           | and over the years this old model has essentially become an
           | abstraction that sits between C and the CPU. As such, C
           | programmers aren't really programming in their CPU's domain
           | anymore and C, by default, lacks the commands necessary to
           | effectively utilize these new developments. It is left up to
           | the compiler to translate C code from the old architecture
           | abstraction into efficient machine code for our new machines.
           | 
           | For a more in depth look into this topic I recommend you
           | check out (0).
           | 
           | (0) - https://queue.acm.org/detail.cfm?id=3212479
        
       | david2ndaccount wrote:
       | _Compilers have become more powerful (opening up new ways to
       | exploit undefined behavior) and the primary C compilers are free
       | software with corporate sponsors, not programmer customers (or
       | else perhaps Andrew Pinski would not have been so blithe about
       | ignoring his customer Felix-gcc in the GCC bug report cited
       | above)._
       | 
       | This is the real problem. We have reached a situation where a
       | small number of compilers dominate the space, but yet do not
       | charge the users and so do not treat them as customers. The C
       | standard is a product of an era where you would pay for your
       | tools and so would demand a refund from any compiler vendor that
       | would treat undefined behavior in an absurd manner.
        
         | [deleted]
        
         | mhh__ wrote:
         | Wouldn't the largest customers (by far) still be the companies
         | funding development of the compilers already?
        
         | msbarnett wrote:
         | > The C standard is a product of an era where you would pay for
         | your tools and so would demand a refund from any compiler
         | vendor that would treat undefined behavior in an absurd manner.
         | 
         | Since current compilers aren't out to do anything malicious
         | with UB, but instead simply treat it as "assume this can't
         | happen and proceed accordingly", it's not clear at all to me
         | what you think paid compilers would do here instead: refuse to
         | compile vast swaths of code that currently compiles? Or compile
         | it very pessimistically, forgoing any optimization
         | opportunities by instead assuming it can happen and inserting a
         | bunch of runtime checks in order to catch it then?
         | 
         | In either case, I doubt there's any real market for "pay money
         | for this compiler and it either won't build your code, or it
         | will run more slowly". I'm just old enough to remember the paid
         | C compiler market and the thing that was driving everybody to
         | pay for the latest and greatest upgrade was "how much better is
         | it at optimization than before?"
        
           | vyodaiken wrote:
           | I see this argument a lot, but it's silly. I don't have to
           | care whether gcc changed the settled semantics without notice
           | or documentation out of sincere belief that they were helping
           | or out of a desire to beat a benchmark that doesn't matter to
           | me, or out of habit. It's not the intent of the compiler
           | authors that matters, but their disregard of the needs of
           | application programmers.
        
             | Quekid5 wrote:
             | What are they to do, exactly?
             | 
             | Optimization is a very 'generic' process and application
             | programmers want optimization. The only sensible thing to
             | do is to assume UB cannot occur and optimize accordingly.
             | 
             | What else is there?
             | 
             | I can already predict that whatever you suggest will very
             | shortly end up in Halting Problem territory _or_ will mean:
             | No optimization. There are a _lot_ of UBs that (if defined)
             | would require run-time checking to define. That wouldn 't
             | inhibit optimization per se, but it would ultimately mean
             | slower execution.
        
               | vyodaiken wrote:
               | C programmers prefer control to "optimization". And if
               | you assume UB cannot occur, you should not generate code
               | that makes it happen. Radical UB is not required for
               | optimization: in fact it appears to mostly do nothing
               | positive. There is not a single paper or study showing
               | significant better performance for substantial C code
               | that depends on assuming UB can't happen - just a bunch
               | of hand waving.
        
         | swiley wrote:
         | Are you arguing that e.g. Turbo C and friends from the 80s were
         | higher quality than modern C compilers?
        
           | setr wrote:
           | I believe he's arguing that they were more sane/pragmatic
           | compilers -- they would be inherently less comfortable
           | exploiting UB to do anything other than what people expected
           | or were used to, because there is more real possibility of
           | retribution (GCC could get away with making demons fly out
           | your nose and just lose marketshare [that it doesn't directly
           | depend on anyways] where a commercial compiler would go out
           | of business)
        
             | msbarnett wrote:
             | > I believe he's arguing that they were more sane/pragmatic
             | compilers
             | 
             | I can't imagine they have any actual experience with
             | Borland or Symantec's C or C++ compilers, then. These
             | things had notoriously shakey standards compliance and
             | their own loopy implementations of certain things, along
             | with legions of bugs - it's not hard to find older C++ libs
             | with long conditional compilation workarounds for Borland's
             | brain damage. Microsoft's C++ compiler was for years out of
             | step with the standard in multiple ways.
             | 
             | Part of the reason GCC ate these compilers' markets for
             | lunch was the much more rigorous and reliable standards
             | adherence, not just the lack of cost.
             | 
             | This reads like nostalgia for an age that never was.
        
               | [deleted]
        
       | cperciva wrote:
       | The author seems to be missing this essential text: "the
       | implementor may augment the language by providing a definition of
       | the officially undefined behavior."
       | 
       | Making a system call is undefined behavior in the C standard, but
       | it's not undefined behavior in clang-on-FreeBSD, because the
       | implementors of clang on FreeBSD have defined what those system
       | calls do.
       | 
       | Ditto for "asm" (UD unless/until you're running on a compiler
       | which defines what that does), all of the tricks which make
       | "malloc" work, and all of his other examples of acceptable uses
       | of code which the C standard does not define.
        
         | bitwize wrote:
         | The thing about UB is that it tends to happen when the C
         | standard refuses to specify when a program segment is erroneous
         | or valid. Some C environments treat memory as a large array of
         | undifferentiated bytes or words, by design. Other C
         | environments have tagged, bounds-checked regions of memory,
         | again by design. (For example, the C compiler for the Lisp
         | machine.) Usually, indirecting through a null pointer or
         | walking off the end of an array are erroneous, but sometimes
         | you _want_ to read from memory location 0, or scan through all
         | of available memory. The C standard allows for both kinds of
         | environments by stating that these behaviors are undefined,
         | allowing the implementation to error out or do something
         | sensible, depending on the environment.
         | 
         | The idea that UB is carte blanche for implementations to do
         | whatever is an unintended consequence of the vague language of
         | the standard. Maybe a future C standard should use "safe" and
         | "unsafe" instead of UB for some of these operations, and
         | clarify that unsafe code will be erroneous in a safe
         | environment and do something sensible but potentially dangerous
         | in an unsafe environment so you must really know what you're
         | doing.
        
           | [deleted]
        
           | Rusky wrote:
           | > The idea that UB is carte blanche for implementations to do
           | whatever is an unintended consequence of the vague language
           | of the standard.
           | 
           | Whether or not this was originally intended, it's certainly
           | become the way the standard is written and used today, so
           | that's kind of beside the point.
           | 
           | Further, this is not some new idea that arose from the C
           | standard. It's a basic, core idea in both software
           | engineering and computer science! You define some meaning for
           | your input, which may or may not cover all possible inputs,
           | so that you can go on to process it without considering
           | inputs that don't make sense.
           | 
           | Now, to be fair, the "guardrail-free" approach where UB is
           | _silent_ is a bit out of the ordinary. A lot of software that
           | makes assumptions about its input will at least try to
           | validate them first, and a lot of programming language
           | research will avoid UB by construction. But C is in a unique
           | place where neither of those approaches fully work.
           | 
           | > The C standard allows for both kinds of environments by
           | stating that these behaviors are undefined, allowing the
           | implementation to error out or do something sensible,
           | depending on the environment.
           | 
           | This is true, but it doesn't mean that "something sensible"
           | is actually something the programmer should rely on! That's
           | just asking too much of UB- programmers need to work _with_
           | the semantics implemented by their toolchain, not make up an
           | intuitive /"sensible" meaning for their undefined program and
           | then get mad when it doesn't work.
           | 
           | For example, if you want to scan through a bunch of memory,
           | _tell the language that 's what you're doing._ Is that memory
           | at a fixed address? Tell the linker about it so it can show
           | up as a normal global object in the program. Is it dynamic?
           | Memory allocators fabricate new objects in the abstract
           | machine all the time, perhaps your compiler supports an
           | attribute that means "this function returns a pointer to a
           | new object."
           | 
           | The solution is not just to shrug and say "do something
           | sensible but potentially dangerous." It's to precisely define
           | the operations available to the programmer, and then provide
           | tools to help them avoid misuse. If an operation isn't in the
           | language, we can add it! If it's too easy to mess up, we can
           | implement sanitizers and static analyzers, or provide
           | alternatives! Yelling about a supposed misreading of
           | "undefined behavior" is never going to be anywhere near as
           | effective.
        
             | vyodaiken wrote:
             | One issue is that under the prevailing interpretation, the
             | existing semantics is not reliable. You do not know when or
             | if the compilers will take advantage of UB to completely
             | change the semantics they are providing. That's not
             | tenable.
        
               | Rusky wrote:
               | That's not how it works. Taking advantage of UB doesn't
               | change the semantics, it just exposes which behaviors
               | were never in the semantics to begin with. Barring
               | compiler or spec bugs, we _do_ in principle know exactly
               | when the compiler may take advantage of UB. That 's the
               | point of a document like the standard- it describes the
               | semantics in a precise way.
               | 
               | To be fair, the existing semantics are certainly complex
               | and often surprising, and people sometimes disagree over
               | what they are, perhaps even to an untenable degree, but
               | that's a very different thing from being unreliable.
        
               | vyodaiken wrote:
               | The net result of your argument is the language has no
               | semantics. I write and test with -O0 and show that
               | f(k)=m. Then I run with -O3 and f(k)=random. Am I
               | required to be an expert on C Standard and compiler
               | development in order to know that, with no warning, my
               | code has always been wrong? What about if f(k)=m under
               | Gcc 10, but now under Gcc 10.1 that whole section of code
               | is skipped? What you are asking programmers to do is to
               | both master the fine points of UB (which is impractical)
               | and look into the future to see what changes may be
               | invisibly committed to the compiler code.
        
         | marcosdumay wrote:
         | The C standards have the perfectly fine name "implementation
         | dependent" to describe those things. Undefined behavior is much
         | less constrained than implementation dependent, adn thus more
         | problematic.
        
           | cperciva wrote:
           | No. "Implementation defined" says "the standard doesn't
           | specify what happens here but the compiler must document what
           | it does". That's a step removed from "the compiler _may_
           | define what this does ".
        
           | tom_mellior wrote:
           | > The C standards have the perfectly fine name
           | "implementation dependent" to describe those things.
           | 
           | That term is not used by the C standards. Do you mean
           | "implementation-defined"? asm is not among the explicitly
           | specified implementation-defined behaviors, it's listed under
           | "Common extensions". I don't see any mention at all of
           | syscalls in C99. (I'm working with http://www.dragonwins.com/
           | courses/ECE1021/STATIC/REFERENCES/... here.)
        
             | formerly_proven wrote:
             | I'm not sure why syscalls would be UB; it's just not
             | something defined by the C standard.
             | 
             | Edit: To clarify, I meant UB in the sense it is typically
             | used in these discussions, where the standard more-or-less
             | explicitly says "If you do X, the behavior is undefined."
             | Not in the literal sense of "ISO C does not say anything
             | about write(2), hence using write(2) is undefined behavior
             | according to the C standard", which seems like a rather
             | tautological and useless statement to me.
        
               | hvdijk wrote:
               | What do you think UB is if not something where the
               | behaviour is not defined?
        
               | Google234 wrote:
               | So is everything UB since all hardware isn't perfect?
        
               | tsimionescu wrote:
               | UB, in this context, is very explicitly used in the
               | standard: it is undefined behavior related to a construct
               | that the standard describes.
        
               | faho wrote:
               | So if it is behavior that is not defined by the C
               | standard, would that not make it undefined behavior?
        
               | [deleted]
        
               | colejohnson66 wrote:
               | While _technically_ correct, "undefined behavior" in
               | terms of C and C++ refer to what the _standard_ calls out
               | explicitly as undefined, and not a simple "it's not
               | referenced, therefore it's undefined."
               | 
               | For example, signed(?) integer overflow is explicitly
               | undefined by the standard, but as @formally_proven said,
               | just because write(2) isn't mentioned doesn't mean usage
               | of it is undefined.
        
               | _kst_ wrote:
               | Actually, that's exactly what it means:
               | 
               | > If a "shall" or "shall not" requirement that appears
               | outside of a constraint or runtime-constraint is
               | violated, the behavior is undefined. Undefined behavior
               | is otherwise indicated in this International Standard by
               | the words "undefined behavior" or by the omission of any
               | explicit definition of behavior. There is no difference
               | in emphasis among these three; they all describe
               | "behavior that is undefined".
               | 
               | write() is a function, and a call to it behaves like a
               | function call, but the C standard says nothing about what
               | that function does. You could have a function named
               | "write" that writes 0xdeadbeef over the caller's stack
               | frame. Of course if "write" is the function defined by
               | POSIX, then POSIX defines how it behaves.
        
               | steveklabnik wrote:
               | There is a difference between jargon in context and the
               | use of those words in a general sense. It can be
               | "undefined behavior" in a general sense, but not
               | necessarily "undefined behavior" in the jargon sense.
               | 
               | After all, if I were to use the words "undefined
               | behavior" in a sentence unrelated to the standards, the
               | definition in the standard of "behavior, upon use of a
               | nonportable or erroneous program construct or of
               | erroneous data, for which this International Standard
               | imposes no requirements." would be nonsense. Same goes in
               | the other direction.
        
         | jcranmer wrote:
         | Neither system calls nor the asm keyword are undefined behavior
         | in the sense that C uses the term. They are, simply put, not
         | covered by the standard at all.
         | 
         | System calls--assuming you're referring to the C prototypes you
         | call--work as normal external function definitions, just having
         | semantics which are defined by the library (i.e., the kernel)
         | and not the C specification itself. The asm keyword is a
         | compiler language extension and is effectively implementation-
         | defined (as C would call it), although compilers today tend to
         | poorly document the actual semantics of their extensions.
        
         | Joker_vD wrote:
         | > all of the tricks which make "malloc" work,
         | 
         | What are those, exactly? AFAIK, you can safely track memory
         | addresses by storing them as intptr_t/uintptr_t.
        
           | cperciva wrote:
           | The C standard says very little about how those types work.
           | In particular, you can cast a pointer to one of them and then
           | cast back to a pointer -- but only if you cast the exact same
           | value back, and the intptr values are not guaranteed to be in
           | any way meaningful.
           | 
           | In particular, casting a pointer to intptr_t, doing
           | arithmetic on it, and casting back is not guaranteed to do
           | anything useful. It almost certainly will, since most systems
           | treat it as roughly the same as casting to char *, but the
           | standard does not guarantee it.
        
             | magnostherobot wrote:
             | Do you have an example of a situation in which you'd want
             | to cast the result of arithmetic intptr_t values to a
             | pointer? The situations I can think of off the top of my
             | head would be better done as arithmetic between pointers.
        
               | bonzini wrote:
               | Arithmetic on pointers in turn is only defined if the
               | pointers point within the same object (or right past the
               | end of that object).
               | 
               | One example of using intptr_t would be going from a
               | pointer passed to free() to a metadata block for the
               | memory that must be freed.
        
           | jacquesm wrote:
           | Oh, for instance, on some implementations there is a lot of
           | interesting stuff just prior to the allocated block returned.
           | Not exactly the pinnacle of elegance but it gets the job
           | done.
        
       | kmm01 wrote:
       | I'm afraid the author themselves is misreading the definition of
       | undefined behavior. Undefined behavior is not "behavior upon use
       | of a nonportable or erroneous program construct". That rephrasing
       | completely changed the meaning. Undefined behavior is, as the C
       | standard states, "behavior [, ...,] for which the Standard
       | imposes no requirements". The whole ", upon use of...," part is
       | just exemplifying situations in which undefined behavior can
       | occur. The standard will sometimes say that a certain construct
       | results in undefined behavior but more importantly any construct
       | for which the standard does not specify a certain behavior has
       | (what else?) undefined behavior.
        
         | warkdarrior wrote:
         | True. I think the point of the author is that the C standards
         | group is not doing their job by leaving so much room for
         | compilers to interpret undefined behavior.
        
       | mumblemumble wrote:
       | > C can accommodate significant optimization while regaining
       | semantic coherence - if only the standard and the compilers stop
       | a lazy reliance on a mistaken reading of "impose no
       | requirements".
       | 
       | That wouldn't be enough, because, sadly for us, rewriting history
       | is only an option in Git repositories and speculative fiction. On
       | this timeline, it doesn't much matter how the standard should
       | have been interpreted (I'll refrain from opining on that), what
       | matters is how the standard was interpreted, and its influence on
       | the behavior of major compilers.
       | 
       | Given the current situation, it seems to me that reaping the
       | benefits of such an enterprise would require getting everyone on
       | board, a mode that's based on the new interpretation, leaving it
       | turned off by default, and then basically never actually using it
       | for fear of breaking compatibility with toolchains that have to
       | use an older version of the compiler for a good decade or two
       | while we wait on them to age out of existence.
       | 
       | I'm not sure the world actually has much practical use for an
       | unapproachable ivory tower dialect of C. Personally, I'd much
       | rather have a go at a language like Zig that's taken upon itself
       | to dream even bigger.
        
       | alerighi wrote:
       | To me the real problem is that compilers these days do too much.
       | Nobody asked for these optimizations, if you want optimized code,
       | use C++, or Rust, or whatever other "modern" language that has
       | higher level constructs and thus can optimize the code better.
       | 
       | The reason to use C is to have an "high level assembler", and
       | there is no reason to use C if I can't rely on the output of the
       | compiler, and if the compiler eliminates some code that I write.
       | We all know that happens when a signed integer overflows, we all
       | know that most architectures are little endian, etc.
       | 
       | Even if something is "undefined behaviour" for the standard, it
       | usually has some well defined behaviour on the platform that you
       | are targeting.
       | 
       | Unfortunately, for these reasons gcc is dangerous to use, it's
       | not a problem of desktop application (if it crashes, who cares),
       | but I talk about critical systems. It would be unacceptable if a
       | machine ends up killing someone because at gcc they though that
       | they could remove a check that they thought it was useless. And
       | so you have to spend thousand of $ just to get reliable compilers
       | that doesn't do silly optimizations.
       | 
       | I think that they should make a sort of "gcc lite" that does
       | exactly what (at least to me) a C compiler has to do: translate
       | the code from C to assembly, without changing the semantic of the
       | code. If there is an undefined behaviour that behaviour will be
       | translated to the assembly code and you will let the processor to
       | deal with it.
       | 
       | Also, we talk about optimizations that could be made by
       | exploiting undefined behaviour, fine, but also the programmer
       | usually optimizes exploiting the same undefined behaviours by
       | hand.
        
       | curtisf wrote:
       | I think the reasoning here is flowing backwards.
       | 
       | The writer wants to believe that C is a well-designed language
       | suitable for writing large programs (because programmers
       | understandably use it that way; there's not really an alternative
       | to C), and so people reading the spec and finding a minefield
       | _must_ be reading the spec wrong. So many important programs are
       | written in C, and so many of them, with a very strict reading of
       | the C standard, can hit cases where their behavior is
       | "undefined". This _is_ scary, if the C-language-lawyers are
       | right!
       | 
       | The C language was originally largely descriptive, rather than
       | prescriptive. Early "C" compilers disagreed on what to do in
       | strange cases (e.g., one might wrap integer overflow, one might
       | saturate, one might have wider ints). Even when using the less-
       | chaotic "implementation defined behavior", behavior can still
       | diverge wildly: `x == x + 1` is definitely `false` under some of
       | those interpretations and maybe `true` in some of those
       | interpretations.
       | 
       | However, the C spec clearly says that the compiler may "ignore
       | the situation" that "the result is ... not in the range of
       | representable values for its type"; it is "permissible" that `x
       | == x + 1` is replaced with `false` despite the "actual"
       | possibility that adding 1 to x produces the same value, if `+`
       | was compiled to be a saturating add.
       | 
       | This has significant practical consequences, even without the
       | "poisoning" result commonly understood of undefined behavior.
       | Since the value is known statically to be `false`, that might be
       | inlined into a call into a function. That function may
       | _dynamically_ re-check `x == x + 1` and find that it is `true`;
       | obviously that function doesn't have a `if (true && false) {`
       | case, so it results in the function misbehaving arbitrarily
       | (maybe it causes a buffer overrun to the argument of a syscall!).
       | 
       | 'Intuition' does not make a programming-language semantics. You
       | need to write down all the rules. If you want to have a language
       | without undefined behavior, you need to write down the rules for
       | what must happen, keeping in mind that many examples of undefined
       | behavior, like dereferencing out-of-bounds pointers, _cannot_ be
       | detected dynamically in C without massive performance costs. To
       | detect if a pointer is out-of-bounds, you need to always pair it
       | with information about its provenance; you need to track whether
       | or not the object has been freed, or the stack-frame it came from
       | has expired. Is replacing all pointers with fat-pointers
       | indicating their provenance and doing multiple comparisons before
       | every dereference the "right" way to compile C?
        
       | rocqua wrote:
       | Quoting from the same passage in the standard as the article
       | does:
       | 
       | > Permissible undefined behavior ranges from ignoring the
       | situation completely with unpredictable results, to ...
       | 
       | Ignoring the situation completely with unpredictable results
       | seems like it covers the current compiler behavior.
       | 
       | The author does not like how current compilers work. But his
       | argument against it mixes "it would be better if it worked
       | differently" with "A specific pedantic reading of the standard
       | says they are wrong". The second kind of argument seems to
       | undercut his wider point. For his wider point is "Compilers
       | should be reasonable rather than turning on pedantry". At least,
       | that is what I think his point is, and it seems like the much
       | stronger argument to me.
       | 
       | Trying to "trip up" the proponents of current behavior by
       | pointing to a possible miss-reading of a comma is not going to do
       | much. Arguing instead that their practice is harmful seems like a
       | much more likely to work approach. That said, such an argument
       | should probably be civil. The article links to this [1]
       | discussion. The link is supposed to show the peculiar arguments
       | used by proponents of current behavior. What I read there is
       | someone lashing out, calling names, and grand-standing.
       | Convincing compilers to be more reasonable is probably going to
       | require a very different tone. Not one of "how dare you be so
       | stupid" but one of "perhaps you could consider this side-effect
       | of current behavior" and "have you considered this approach".
       | 
       | [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475
        
         | tsimionescu wrote:
         | > The article links to this [1] discussion. The link is
         | supposed to show the peculiar arguments used by proponents of
         | current behavior. What I read there is someone lashing out,
         | calling names, and grand-standing.
         | 
         | Even worse, the person who raised that issue is mostly just
         | wrong: the code they wrote had not worked with any
         | optimizations turned on for about 13 years at the time they
         | raised that bug. Teh fact that it worked in debug builds seems
         | irrelevant, so the whole bug is complaining about a change that
         | had no impact on the 99% of released code which is compiled
         | with -O2 or -O3.
        
       | superjan wrote:
       | I think the best point in this article is that C, a language
       | invented to write OS code, is currently difficult to use for that
       | purpose due to the current handling of undefined behavior. If you
       | write low level OS code, you are keenly aware of the behavior of
       | the machine you target. There is no point trying to define a
       | language like C in such a way that the only valid programs are
       | those that would run identically on any hypothetically allowed
       | hardware. For instance, on x64, there are no alignment
       | restrictions, why should I be punished for making use of that
       | fact? And what about all this C code with SSE intrinsics? should
       | we ban that too?
        
       | Spivak wrote:
       | > So, "i << 32" buried in a million lines of database code makes
       | the entire program junk - supposedly as an "optimization"!
       | 
       | I think the author has it wrong here because they're assuming
       | lines and order has any meaning when the program is compiled all
       | together. Imagine a compiler where (I know that this is not
       | actually possible to do) undefined behavior was a compiler error.
       | 
       | You would sound crazy for complaining that "a syntax error in
       | like 4325 make the whole program uncompileable."
       | 
       | It's not like running C programs have any idea that they are
       | hitting undefined behavior. It's just that the generated assembly
       | is allowed to assume that you know what you're doing.
       | i << runtime_val
       | 
       | Can just generate the shift without worry that it will blow up
       | and the propagate the knowledge up that runtime_val's domain is
       | [0,31] for the optimizer.
        
       | 1MachineElf wrote:
       | This was posted by a user with a name _closely_ matching the
       | domain (perhaps the original author?) 16 hours prior, and
       | flagged: https://news.ycombinator.com/item?id=27215697
       | 
       | When I stumbled across it last night, I couldn't understand why
       | that would be. The content seemed good enough for readers on
       | here, and this one's placement on the 2nd page of HN seems to
       | confirm that. What's going on?
        
       | DangitBobby wrote:
       | It truly boggles my mind that this discussion (linked in TFA)
       | played out the way it did. In my mind it's 100% not okay to (by
       | default) optimize away a check for integer overflow. I've never
       | really written in C (or any unsafe language) before so I had
       | little context for the types of traps C sets for developers.
       | Based on the responses of the person who presumably implemented
       | the optimization, it comes as no surprise that C is so dangerous.
       | After reading this I hope I never have to use gcc.
       | 
       | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475
        
         | tom_mellior wrote:
         | > In my mind it's 100% not okay to (by default) optimize away a
         | check for integer overflow.
         | 
         | But a + 100 > a is not a check for overflow. If a >= INT_MAX -
         | 100, it _is_ an overflow. The  "will this operation overflow"
         | check would be a >= INT_MAX - 100, and GCC would not optimize
         | that away.
        
           | DangitBobby wrote:
           | What you've written doesn't demonstrate the issue outlined in
           | the big report. The issue is that real-world code defends
           | against certain hostile inputs like so, and that the
           | optimization breaks those defenses:                   int
           | a,b,c;         a=INT_MAX;     /\* statement ONE */
           | b=a+2;         /* statement TWO */         c=(b>a);       /*
           | statement THREE \*/
           | 
           | Whether or not you think this is _technically_ permissible by
           | a sufficiently self-serving (from the perspective of a person
           | whose ONLY goal is speed optimization) reading of spec is
           | irrelevant. Any reading of the spec should err on the side of
           | protection against real-world consequences.
           | 
           | I don't want lawyering and technically correct-ing, I want
           | pragmatism.
        
             | tom_mellior wrote:
             | If you want pragmatism, check whether something is legal
             | _before_ you do it.
        
               | DangitBobby wrote:
               | This is addressed quite thoroughly in the thread that I
               | linked. It is generally impossible to determine whether
               | your code has a bug, and you cannot possibly be
               | responsible for every bit of code that your code may
               | happen to touch, but even if you could, mistakes and bad
               | practices will happen. Practically speaking, it does not
               | matter if you can find a way to displace blame onto the
               | writer of the code. Placing blame is not how one
               | minimizes harm. Maybe we have different definitions of
               | pragmatism in our heads.
               | 
               | I would also say that leaving such a common occurrence as
               | integer overflow as undefined behavior rather than to
               | consider that it practically always has platform as
               | implementation specific behavior doesn't make much sense.
               | I mean, it does always have platform specific
               | implementation, right? Does that not exclude it from the
               | definition of undefined behavior? Genuinely curious.
               | 
               | The downvotes on my previous comments tell me this sort
               | of thinking is endemic. I suddenly have more insight into
               | the concern people have about software development as a
               | field.
        
       | olliej wrote:
       | I dislike how UB is used where unspecified would work.
       | 
       | For instance overflowing arithmetic is well specified on every
       | architecture - the behaviour may be different on say sparc vs an
       | hc12 vs x86, but on any of those architectures it will always be
       | the same. Yet compiler devs have instead decided that it is
       | undefined and so can be treated however they like.
       | 
       | There are so many of these UB that could be unspecified that it's
       | absurd - UB should be reserved for when there outcome is not
       | consistent - UaF, invalid pointers, out of range accesses, etc
        
         | MauranKilom wrote:
         | The C++ committee recently (for the C++20 standard) voted
         | against making signed overflow defined, opting to keep it
         | undefined. Primarily for performance reasons (and because it
         | would usually be a bug anyway).
         | 
         | www.open-
         | std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r4.html#r0r1
         | 
         | You're of course free to disagree with the reasoning. But you'd
         | probably agree that a new standard revision that _forces_ the
         | average application to become, say, 5% slower, just so that
         | broken programs would be just as broken, would not be very
         | well-received.
        
       | joelkevinjones wrote:
       | Two points:
       | 
       | The notion that compilers that encounter undefined behavior are
       | allowed to generate any code they want is a new interpretation,
       | for some value of "new". I can't remember the first time I
       | encountered such an interpretation being used by compiler writers
       | to justify something they wanted to do until sometime after 2000.
       | 
       | The notion that John Regehr has (quoted in the article) that
       | undefined behavior implies the whole execution is meaningless is
       | not supported by the language of either the C89 or C99 standard,
       | at least by my reading. The C89 standard has a notion of sequence
       | points. Wouldn't all sequence points executed before undefined
       | behavior is encountered be required to occur as if the undefined
       | behavior wasn't there? It would seem so:
       | 
       | From the C89 standard: 2.1.2.3 Program execution
       | 
       | The semantic descriptions in this Standard describe the behavior
       | of an abstract machine in which issues of optimization are
       | irrelevant. Accessing a volatile object, modifying an object,
       | modifying a file, or calling a function that does any of those
       | operations are all side effects, which are changes in the state
       | of the execution environment. Evaluation of an expression may
       | produce side effects. At certain specified points in the
       | execution sequence called sequence points, all side effects of
       | previous evaluations shall be complete and no side effects of
       | subsequent evaluations shall have taken place.
       | 
       | The C99 standard has nearly identical language:
       | 
       | 5.1.2.3 Program execution 1 The semantic descriptions in this
       | International Standard describe the behavior of an abstract
       | machine in which issues of optimization are irrelevant. 2
       | Accessing a volatile object, modifying an object, modifying a
       | file, or calling a function that does any of those operations are
       | all side effects,11) which are changes in the state of the
       | execution environment. Evaluation of an expression may produce
       | side effects. At certain specified points in the execution
       | sequence called sequence points, all side effects of previous
       | evaluations shall be complete and no side effects of subsequent
       | evaluations shall have taken place.
        
         | AlotOfReading wrote:
         | No, sequence points aren't relevant because they occur at
         | runtime. "Time traveling UB" happens in the compiler, typically
         | in optimization passes and can cause otherwise valid code to
         | exhibit completely different behavior than it would if the UB
         | didn't exist.
        
         | MauranKilom wrote:
         | The standard (in the prevailing reading of the UB section, and
         | also in practice) places no requirements on the behavior of
         | programs containing UB. None of the paragraphs you quoted have
         | any bearing on how an UB-laden program behaves.
        
         | zlynx wrote:
         | > Wouldn't all sequence points executed before undefined
         | behavior is encountered be required to occur as if the
         | undefined behavior wasn't there? It would seem so:
         | 
         | No. Code optimization is a series of logic proofs. It is like
         | playing Minesweeper. If a revealed square has 1 neighboring
         | mine and a count of 1, then you know that all 7 other squares
         | are safe. In other Minesweeper situations you make a proof that
         | is much more complex and allows you to clear squares many steps
         | away from a revealed mine. If you make a false assumption of
         | where a mine is, via a faulty proof, then you explode.
         | 
         | The compiler is exactly like that. "If there is only one
         | possible code path through this function, then I can assume the
         | range of inputs to this function, then I can assume which
         | function generated those inputs..."
         | 
         | You can see how the compiler's optimization proof goes "back in
         | time" proving further facts about the program's valid behavior.
         | 
         | If the only valid array indexes are 0 and 1 then the only valid
         | values used to compute those indexes are those values that
         | produce 0 and 1.
         | 
         | This isn't even program execution. In many cases the code is
         | collapsed into precomputed results which is why code
         | benchmarking is complicated and not for beginners. Many naive
         | benchmark programs collapse 500 lines of code and loops into
         | "xor eax,eax; ret;" A series of putchar, printf and puts calls
         | can be reduced to a single fwrite and a malloc/free pair can be
         | replaced with an implicit stack alloca because all Standard
         | Library functions are known and defined and there is no need to
         | actually call them as written.
        
       | muth02446 wrote:
       | Also relevant:
       | 
       | What every compiler writer should know aboutprogrammers or
       | "Optimization" based on undefined behaviour hurts performance
       | 
       | by M. Anton Ertl
       | 
       | https://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_20...
        
       | mjburgess wrote:
       | Commenters here seem to be missing the core thesis of this
       | article. It's not about what the standard literally means; it's
       | about it's spirit -- and the _reason_ for its spirit.
       | 
       | The issue is "undefined behaviour" should never have been
       | interpreted this extremely. The standard may be silent on how
       | extreme, but it is implausible to suggest that the standard was
       | actually written to enable this.
       | 
       | Compiler writers _for C!_ dismiss severe issues which occur in
       | compiling any program with undefined behaviour; issues which
       | would render any modern language a bad joke.
       | 
       | Compiler writers are using this as a perverse shield to simply
       | fail to optimize for correctness, or provide means to; and to
       | enable them to optimize only for performance.
       | 
       | Are we really saying the _standard_ supports this sleight-of-
       | hand? It seems more like using the Second Amendment to murder
       | someone.
        
         | steveklabnik wrote:
         | Personally, I think that arguing that those who define and
         | implement a standard don't understand one of the most
         | fundamental aspects of said standard is going to be an uphill
         | battle.
         | 
         | You could argue that they've lost their way, and the article
         | flirts with this, but the path forward is the hard part, and
         | IMHO rings a bit hollow: it's asserted that these rules aren't
         | needed for performance, but no evidence is given, and what
         | similar evidence we do have (compiling on lower optimization
         | levels) doesn't seem to support this thesis. You could argue
         | that, the kernel, which turns off aliasing, is plenty
         | performant without it, and that's a decent argument, but it's
         | not clear that it wouldn't be even faster with it, and it's
         | much harder to empirically test this than removing the flag,
         | since it will miscompile.
        
           | bonzini wrote:
           | Different code depends on different optimizations. A loop on
           | an int** might benefit a lot from aliasing optimizations,
           | because the compiler will assume that a[i] will remain the
           | same after writing to a[i][j]. Other code may not benefit at
           | all.
           | 
           | Likewise that loop may not benefit from signed overflow;
           | instead an initialization loop that, by way of multiple level
           | of macros, ends up doing a[i]=b[i]*1000/100, might become
           | twice as fast if signed overflow rules let the compiler
           | rewrite the assignment as a[i]=b[i]*10.
        
       | _kst_ wrote:
       | The author suggests that the text following the definition of
       | "undefined behavior", listing the permitted or possible range of
       | undefined behavior, should be read to restrict the consequences.
       | 
       | But the first possibility listed is "ignoring the situation
       | completely with unpredictable results". Surely that covers any
       | possible consequences.
       | 
       | The author also says:
       | 
       | > Returning a pointer to indeterminate value data, surely a
       | "use", is not undefined behavior because the standard mandates
       | that malloc will do that.
       | 
       | Returning a pointer to data is not a use of that data. The fact
       | that its value is indeterminate isn't relevant until you attempt
       | to read it (without first writing it).
       | 
       | It may be worthwhile to reduce the number of constructs whose
       | behavior is undefined, making them implementation-defined or
       | unspecified instead. For example, if signed integer overflow
       | yielded an unspecified result rather than causing undefined
       | behavior, I wonder if any implementations would be adversely
       | affected. (But it would remove the possibility of aborting a
       | program that computes INT_MAX+1.)
       | 
       | I don't think reinterpreting "undefined behavior" as anything
       | other than "the Standard imposes _no_ requirements " is
       | practical. If a program writes through a dangling pointer and,
       | for example, clobbers a function's return address, what
       | constraints could be imposed on what the program might do next?
        
         | btilly wrote:
         | _The author suggests that the text following the definition of
         | "undefined behavior", listing the permitted or possible range
         | of undefined behavior, should be read to restrict the
         | consequences._
         | 
         |  _But the first possibility listed is "ignoring the situation
         | completely with unpredictable results". Surely that covers any
         | possible consequences._
         | 
         | Absolutely not. In the C89 standard, undefined behavior becomes
         | undefined *UPON USE OF* the thing that is undefined. In current
         | compilers, the existence of undefined behavior anywhere in your
         | program is an excuse to do anything that the compiler wants to
         | with all of the rest of your program. Even if the undefined
         | behavior is never executed. Even if the undefined behavior
         | happens after the code that you have encountered.
         | 
         | So, for example, undefined behavior that can be encountered
         | within a loop makes it allowable to simply remove the loop.
         | Even if the undefined behavior is inside of an if that does not
         | happen to evaluate to true with your inputs.
        
           | aw1621107 wrote:
           | > In the C89 standard, undefined behavior becomes undefined
           | _UPON USE OF_ the thing that is undefined.
           | 
           | Is that still the case for current C standards, or did
           | something change in C99/C11?
        
             | btilly wrote:
             | I don't have the C11 standard. But that part of the passage
             | remained unchanged in C99.
             | 
             | In C89 there was a list of PERMISSIBLE things that
             | compilers could do upon encountering undefined behavior. In
             | C99 that was changed to a list of POSSIBLE things. And
             | compilers have taken full advantage of that.
        
           | Jweb_Guru wrote:
           | This is actually _desired_ though, at least by some programs.
           | For example, say you have a function with a very expensive
           | loop that repeatedly performs a null check and then executes
           | some extra code if it 's null, but never sets the value. This
           | is called from another function which uses the checked value
           | without a null check (proving it's not null) before and after
           | the loop ends. The first function is inlined. You want to
           | tell the compiler not to optimize out the null check and
           | extra code in the loop? Or that it can't optimize stuff out
           | to reuse the value from the first use of the value? If so,
           | what _is_ the compiler allowed to optimize out or reorder?
           | 
           | Now, to see why this might actually produce a bug in working
           | code--say some other thread has access to the not-null value
           | and sets it racily (non-atomically) to null. Or (since most
           | compilers are super conservative about checks of values that
           | escape a function because they can't do proper alias
           | analysis), some code accidentally buffer overflows and
           | updates the pointer to null while intending to do something
           | else. Suddenly, this obvious optimization becomes invalid!
           | 
           | Arguments to the effect of "the compiler shouldn't optimize
           | out that loop due to assuming absence of undefined behavior"
           | are basically arguments for compilers to leave tons of
           | performance on the table, due to the fact that sometimes C
           | programs don't follow the standard (e.g. forgetting to use
           | atomics, or indexing out of bounds). While it's a legitimate
           | argument, I don't think people would be too happy to find
           | their C programs losing to Java in benchmarks on -O3, either.
        
           | MauranKilom wrote:
           | > So, for example, undefined behavior that can be encountered
           | within a loop makes it allowable to simply remove the loop.
           | Even if the undefined behavior is inside of an if that does
           | not happen to evaluate to true with your inputs.
           | 
           | The last sentence is not true. If there is UB inside the if,
           | the compiler may assume that the if condition never evaluates
           | to true (and hence delete that branch of the if), but it may
           | certainly not remove the surrounding loop (unless it can
           | _also_ prove that the condition _must_ be true).
        
         | ynik wrote:
         | > For example, if signed integer overflow yielded an
         | unspecified result rather than causing undefined behavior, I
         | wonder if any implementations would be adversely affected.
         | 
         | You don't need to wonder. You can use -fwrapv to make signed
         | integer overflow defined behavior.
         | 
         | C++20 introduced the guarantee that signed integers are two's
         | complement. The original version of that proprosal also defined
         | the behavior on overflow; but that part was rejected (signed
         | integer overflow remains UB): http://www.open-
         | std.org/jtc1/sc22/wg21/docs/papers/2018/p090... So at least the
         | committee seems to think that the performance advantages are
         | worth it.
        
         | anarazel wrote:
         | > For example, if signed integer overflow yielded an
         | unspecified result rather than causing undefined behavior, I
         | wonder if any implementations would be adversely affected.
         | 
         | I suspect so - makes it harder to reason about loop counts
         | because the compiler can't necessarily guarantee that an
         | incremented loop counter won't become negative and thus the
         | loop needs to iterate more.
         | 
         | E.g. something like for (int i=param; i < param + 16; i++) has
         | a guaranteed loop count with the current rules, but not with
         | yours?
         | 
         | That's not an excuse for but having any way to do proper
         | overflowing operations on signed integers though.
        
           | Asooka wrote:
           | > I suspect so - makes it harder to reason about loop counts
           | because the compiler can't necessarily guarantee that an
           | incremented loop counter won't become negative and thus the
           | loop needs to iterate more.
           | 
           | This is a favourite example that gets thrown around, but for
           | all practical loops GCC and clang seem to have no problem
           | even when you compile with -fwrapv
        
             | anarazel wrote:
             | Postgres compiles with fwrapv for many years now, and yes,
             | it does introduce a measurable CPU overhead. Not 10%, but
             | also not just 0.1%.
        
           | OskarS wrote:
           | Assuming you defined signed integer overflow to follow two's
           | complement rules (the only reasonable interpretation other
           | than UB), it would still be a guaranteed loop count of 16.
           | (EDIT: i'm a dumbass, this is obvs not true. disregard this
           | paragraph)
           | 
           | There's an interesting thing to note with that example
           | though: even if you did make signed integer overflow defined,
           | that code is still obviously incorrect if param + 16
           | overflows. Like, the fact that signed integer overflow is UB
           | is totally fine in this example: making it defined behavior
           | doesn't fix the code, and if making it UB allows the compiler
           | to optimize, then why not?
           | 
           | Arguably, this is the case with the vast majority of signed
           | integer overflow examples: the UB isn't really the issue, the
           | issue is that the programmer didn't consider overflow, and if
           | overflow happens the code is incorrect regardless. Why
           | cripple the compilers ability to optimize to protect cases
           | which are almost certainly incorrect anyway?
        
             | Gibbon1 wrote:
             | The real problem is in a better world 'int' would be
             | replaced by types that actually exhibit the correct
             | behavior.
             | 
             | for a loop counter you want an index type that will seg
             | fault on overflow. If you think not having that check is
             | worth it the programmer would need to tag it with unsafe.
             | 
             | It's also problematic because it's size is defined as at
             | least 16 bits. But programmers which means you should never
             | use it to store a constant larger than 16 bits. But people
             | do that all the time.
        
               | thewakalix wrote:
               | Many people write C programs that are not intended to be
               | portable to 16-bit architectures.
        
               | OskarS wrote:
               | I'm not sure I agree. If signed overflow is UB, loops
               | like this can be optimized the hell out of. The most
               | obvious way would be to unroll it and eliminate the loop
               | (and loop variable) entirely, but you can also do things
               | like vectorize it, maybe turn it in to just a small
               | number of SIMD instructions. The performance gains are
               | potentially enormous if this is in a hot loop.
               | 
               | With your magic int that traps on overflow, you couldn't
               | do that if the compiler was forced to rely on that
               | behaviour. This is exactly why signed overflow is UB in
               | C, and I don't think that's an unreasonable case for a
               | language like C.
               | 
               | To be clear, my point is that this program is incorrect
               | if overflow happens regardless of whether overflow is UB
               | or not. So you might as well make it UB and optimize the
               | hell out of it.
        
               | cygx wrote:
               | The broader argument is that signedness of the integer
               | type used for indexing is a non-obvious gotcha affecting
               | vectorizability. It makes sense once you understand C
               | integer semantics, but putting on a language designer
               | hat, I'd go with something more explicit.
        
           | _kst_ wrote:
           | for (int i=param; i < param + 16; i++)
           | 
           | does not have a guaranteed loop count with the current rules.
           | The loop body will execute 16 times if param <= INT_MAX-16,
           | but if the expression "param + 16" can overflow, the behavior
           | is undefined. (I'm assuming param is of type int.)
        
             | BruiseLee wrote:
             | The compiler is allowed to act as if this loop executes
             | exactly 16 times. That means it could unroll and vectorize
             | it for example.
        
               | vyodaiken wrote:
               | It is completely useless to allow compilers to assume
               | false things about the code they generate.
        
               | fooker wrote:
               | It's not useless. The assumption is not false if the
               | program doesn't have undefined behavior. The assumption
               | allows the code to be a few times faster. To disallow
               | this assumption would inhibit these optimizations.
        
             | msbarnett wrote:
             | > does not have a guaranteed loop count with the current
             | rules. The loop body will execute 16 times if param <=
             | INT_MAX-16, but if the expression "param + 16" can
             | overflow, the behavior is undefined. (I'm assuming param is
             | of type int.)
             | 
             | And the standard permits us (among other responses) to
             | ignore undefined behaviour, so it _does_ have a guaranteed
             | loop count under a reading of the standard which the
             | standard specifically and explicitly allows.
        
               | _kst_ wrote:
               | No, the standard permits the implementation to ignore the
               | behavior "with unpredictable results".
               | 
               | If the value of param is INT_MAX, the behavior of
               | evaluating param + 16 is undefined. It doesn't become
               | defined behavior because a particular implementation
               | makes a particular choice. And the implementation doesn't
               | have to tell you what choice it makes.
               | 
               | What the standard means by "ignoring the situation
               | completely" is that the implementation doesn't have to be
               | aware that the behavior is undefined. In this particular
               | case:
               | 
               | for (int i=param; i < param + 16; i++)
               | 
               | that means the compiler can assume there's no overflow
               | and generate code that always executes the loop body
               | exactly 16 times, or it can generate naive code that
               | computes param + 16 and uses whatever result the hardware
               | gives it. And the implementation is under no obligation
               | to tell you how it decides that.
        
               | tsimionescu wrote:
               | > If the value of param is INT_MAX, the behavior of
               | evaluating param + 16 is undefined. It doesn't become
               | defined behavior because a particular implementation
               | makes a particular choice. And the implementation doesn't
               | have to tell you what choice it makes.
               | 
               | The compiler writer argument is as follows:
               | 
               | The program is either UB (when param is INT-MAX - 15
               | higher) or has exactly 16 iterations. Since we are free
               | to give any semantics to a UB program, it is standard-
               | compliant to always execute 16 times regardless of
               | param's value.
        
               | msbarnett wrote:
               | > that means the compiler can assume there's no overflow
               | and generate code that always executes the loop body
               | exactly 16 times
               | 
               | Right. That's what I said.
               | 
               | And just to be super-precise about the wording, the
               | standard doesn't say "ignore the behavior 'with
               | unpredictable results'" it says "Permissible undefined
               | behavior ranges from ignoring the situation completely
               | with unpredictable results". Nitpicky, but the former
               | wording could be taken to imply that ignoring behavior is
               | only permissible if the behavior is unpredictable, when
               | what the standard actually says is that you can ignore
               | the behavior, even if the results of ignoring it are
               | unpredictable.
        
               | _kst_ wrote:
               | And my point is that as far as the language is concerned,
               | there is no guaranteed loop count under any
               | circumstances. (An implementation is allowed, but not
               | required, to define the behavior for that
               | implementation.)
        
               | Smaug123 wrote:
               | The two of you are not disagreeing except insofar as
               | you're both using the word "guaranteed" to mean
               | completely different things. _kst_, you're using it to
               | mean "the programmer can rely on it". msbarnett, you're
               | using it to mean "the compiler can rely on it".
        
               | dooglius wrote:
               | Either the limit on param is guaranteed in some way by
               | the rest of the program, or it is not. If it is, then the
               | loop count is guaranteed in both cases. If it is not, the
               | loop count is not guaranteed in either case.
        
               | msbarnett wrote:
               | That you wish that the C Standard mandated this
               | interpretation does not change the fact that this is not
               | what the C Standard says.
        
               | dooglius wrote:
               | You are mistaken, the C standard is quite clear that it
               | does not make any guarantees regarding the behavior of
               | programs that exhibit undefined behavior, and that signed
               | integer overflow is undefined behavior.
        
               | masklinn wrote:
               | They're not mistaken. What compilers will do is assume
               | that UB don't happen. If no UB happens, that means `param
               | + 16` never overflowed, therefore there are always
               | exactly 16 operations.
        
               | msbarnett wrote:
               | "for (int i=param; i < param + 16; i++) does not have a
               | guaranteed loop count in the presence of undefined
               | behavior" is true, but it's equally true that the C
               | standard is quite clear that undefined behavior can be
               | ignored, so we can validly treat "for (int i=param; i <
               | param + 16; i++)" as if it were guaranteed to loop 16
               | times in all cases.
        
               | _kst_ wrote:
               | No, the C standard doesn't say that "undefined behavior
               | can be ignored" (which would mean what, making it
               | defined?).
               | 
               | It says, "NOTE Possible undefined behavior ranges from
               | ignoring the situation completely with unpredictable
               | results, ...".
               | 
               | It doesn't say that the _behavior_ can be ignored. It
               | says that the _undefinedness_ can be ignored. The
               | implementation doesn 't have to take notice of the fact
               | that the behavior is undefined.
               | 
               | Let's take a simpler example:
               | printf("%d\n", INT_MAX + 1);
               | 
               | The behavior is undefined. The standard does not
               | guarantee anything about it. A conforming implementation
               | can reject it at compile time, or it can generate code
               | that crashes, or it can generate code that emits an ADD
               | instruction and print whatever the hardware returns, or
               | it can play roge at compile time. (The traditional joke
               | is that can make demons fly out of your nose. Of course
               | it can't, but an implementation that did so would be
               | physically impossible, not non-conforming.)
               | 
               | An implementation might define the behavior, but it's
               | still "undefined behavior" as that term is defined by the
               | ISO C standard.
        
               | msbarnett wrote:
               | "undefined behavior can be ignored" (meaning: the case
               | where this _could_ overflow need not be considered and
               | can be treated as though it does not exist) vs  "The
               | implementation doesn't have to take notice of the fact
               | that the behavior is undefined" strikes me as a
               | distinction without a difference given that we land in
               | exactly the same spot: the standard allows us to treat
               | "for (int i=param; i < param + 16; i++)" as if it were
               | guaranteed to loop 16 times in all cases.
               | 
               | > An implementation might define the behavior, but it's
               | still "undefined behavior" as that term is defined by the
               | ISO C standard.
               | 
               | The point where we seem to disagree (and the pedantry
               | here is getting tiresome so I don't know that there's any
               | value in continuing to go back and forth of on it) is
               | that yes, it's undefined behavior by the ISO C standard.
               | BUT, the ISO C standard _also_ defines the allowable
               | interpretations of and responses _to_ undefined
               | behaviour. Those responses don 't exist "outside" the
               | standard - they flow directly from it.
               | 
               | So it's simultaneously true that the standard does not
               | define it and that the standard gives us a framework in
               | which to give its undefinedness some treatment and
               | response, even if that response is "launch angband" or,
               | in this case, "act as if it loops 16 times in all cases".
        
               | _kst_ wrote:
               | Of course an implementation can do anything it likes,
               | including defining the behavior. That's one of the
               | infinitely many ways of handling it -- precisely because
               | it's _undefined behavior_.
               | 
               | I'm not using "undefined behavior" as the English two-
               | word phrase. I'm using the technical term as it's defined
               | by the ISO C standard. "The construct has _undefined
               | behavior_ " and "this implementation defines the behavior
               | of the construct" are not contradictory statements.
               | 
               | And "ignoring the situation completely" does not imply
               | any particular behavior. You seemed to be suggesting that
               | "ignoring the situation completely" would result in the
               | loop iterating exactly 16 tyimes.
        
               | msbarnett wrote:
               | > Of course an implementation can do anything it likes,
               | including defining the behavior. That's one of the
               | infinitely many ways of handling it -- precisely because
               | it's undefined behavior.
               | 
               | An implementation can do whatever it likes _within the
               | proscribed bounds the standard provides for reacting to
               | "undefined behavior"_, and conversely whatever the
               | implementation chooses to do within those bounds is
               | consistent with the standard.
               | 
               | Which, again, is the entire point of this: "the loop
               | iterates exactly 16 times" is a standards-conforming
               | interpretation of the code in question. There's nothing
               | outside the standard or non-standard about that. That is,
               | in fact, exactly what the standard says that it is
               | allowed to mean.
               | 
               | > I'm not using "undefined behavior" as the English two-
               | word phrase. I'm using the technical term as it's defined
               | by the ISO C standard.
               | 
               | So am I. Unlike you, I'm merely taking into account the
               | part of the standard that says "NOTE: Possible undefined
               | behavior ranges from ignoring the situation completely
               | with unpredictable results" and acknowledging that things
               | that _do so_ are standards-conforming.
               | 
               | > You seemed to be suggesting that "ignoring the
               | situation completely" would result in the loop iterating
               | exactly 16 tyimes.
               | 
               | I'm merely reiterating what the standard says: that the
               | case in which the loop guard overflows can be ignored,
               | allowing an implementation to conclude that the loop
               | iterates exactly sixteen times in all scenarios it is
               | required to consider.
               | 
               | All you seem to be doing here is reiterating, over and
               | over again, "the standard says the behavior of the loop
               | is undefined" to argue that the loop has no meaning,
               | while ignoring that a different page of the same standard
               | actual gives an allowable range of meanings to what it
               | means for "behavior to be undefined", and that therefore
               | anyone of those meanings is, in fact, precisely within
               | the bounds of the standard.
               | 
               | We can validly say that the standard says "for (int
               | i=param; i < param + 16; i++)" means "iterate 16 times
               | always". We can validly say that the standard says "for
               | (int i=param; i < param + 16; i++)" means "launch angband
               | when param + 16 exceeds MAX_INT". Both are true
               | statements.
        
               | thewakalix wrote:
               | Perhaps there's an implicit quantifier here: "for all
               | valid implementations of the C standard, the loop count
               | is guaranteed to be 16" versus "there exists a valid
               | implementation of the C standard in which...".
        
               | thewakalix wrote:
               | Perhaps there's an implicit quantifier here: "for all
               | valid implementations of the C standard, the loop count
               | is guaranteed to be 16" versus "there exists a valid
               | implementation of the C standard in which...".
               | 
               | (This line of thought inspired by RankNTypes, "who
               | chooses the type", etc.)
        
             | anarazel wrote:
             | That's precisely my point? Because the overflow case is
             | undefined, the compiler can assume it doesn't happen and
             | optimize based on the fixed loop count.
        
               | rurban wrote:
               | The overflow case is not UB. param can be unsigned, of
               | fwrapv may be declared. Or the compiler chooses to
               | declare fwrapv by default. In no case is the compiler
               | allowed to declare the overflow away, unless it knows
               | from before that param can not overflow. The optimization
               | on loop count 16 can still happen with a runtime guard.
        
               | _kst_ wrote:
               | If param is unsigned, then "param + 16" cannot overflow;
               | rather, the value wraps around in a language-defined
               | manner. I've been assuming that param is of type int (and
               | I stated that assumption).
        
               | OskarS wrote:
               | The loop counter is signed even if param is not, so i++
               | could overflow. fwrapv is a compiler flag, it is not part
               | of the standard: it is a flag that mandates a certain
               | behaviour in this case, but in standard C, the loop
               | variable overflowing is definitely UB. No runtime guard
               | needed, C compilers are just allowed to assume a fixed
               | length. This is the whole reason signed overflow is UB in
               | C, for exactly cases like this.
        
           | not2b wrote:
           | That's the exact reason why this rule was introduced into the
           | standard: it was so C compilers could compete with Fortran
           | compilers (Fortran has similar rules and at the time they
           | were beating C compilers on equivalent scientific codes by
           | 2-3x).
           | 
           | Fortran has even more restrictive aliasing rules than C: a
           | function is allowed to assume that any two array arguments
           | passed as arguments do not overlap. If they do, the behavior
           | is undefined.
        
           | simias wrote:
           | I don't know if there exists a C compiler that leverages this
           | feature but there are ISAs (for instance MIPS) that can trap
           | on signed overflow.
           | 
           | The fact that it's UB in C means that you can tell the
           | compiler to generate these exception-generating instructions,
           | which could make some overflow bugs easier to track down
           | without any performance implications. And your compiler would
           | still be 100% compliant with the standard.
           | 
           | That being said I just tried and at least by default GCC
           | emits the non-trapping "ADDU" even for signed adds, so maybe
           | nobody actually uses that feature in practice.
        
             | anarazel wrote:
             | That doesn't really help with the compiler optimization
             | aspect : A typical use of the range information would be to
             | unroll the loop - in which case there's no addition to trap
             | on anymore.
        
               | tsimionescu wrote:
               | To be fair, if you want to make sure that loop is
               | unrolled even in the presence of -fwrapv, writing it as
               | for (int i=0; i < 16; i++) {/* use i+param */} is a very
               | simple change for you to make even today. You'll have to
               | make much uglier changes to code if you're at the level
               | of optimization where loop unrolling really matters for
               | your code on a modern processor.
        
         | vyodaiken wrote:
         | e.g. removing a check for for overflow is definitely NOT
         | ignoring the behavior. Deleting write because it would be
         | undefined behavior for a pointer to point at some location is
         | also NOT ignoring the behavior. Ignoring the behavior is
         | exactly what the rationale is describing when it says UB allows
         | compilers to not detect certain kinds of errors.
         | 
         | Returning a pointer is certainly a use. In any event, the
         | prevailing interpretation makes it impossible to write a
         | defined memory allocator in C.
         | 
         | If a program writes through a dangling pointer and clobbers a
         | return address, the programmer made an error and unpredictable
         | results follow. C is inherently memory unsafe. No UB based
         | labrynth of optimizations can change that. It is not designed
         | to be memory safe: it has other design goals.
        
           | aw1621107 wrote:
           | > e.g. removing a check for for overflow is definitely NOT
           | ignoring the behavior. Deleting write because it would be
           | undefined behavior for a pointer to point at some location is
           | also NOT ignoring the behavior.
           | 
           | Depending on how you look at it, this _is_ ignoring the
           | behavior.
           | 
           | For example, say you have this:                   int f(int
           | a) {             if (a + 1 < a) {                 // Handle
           | error             }             // Do work         }
           | 
           | You have 2 situations:                   1. a + 1 overflows
           | 2. a + 1 does not overflow
           | 
           | Situation 1 contains undefined behavior. If the compiler
           | decides to "ignor[e] the situation completely", then
           | Situation 1 can be dropped from consideration, leaving
           | Situation 2. Since this is the only situation left, the
           | compiler can then deduce that the condition is always false,
           | and a later dead code elimination pass would result in the
           | removal of the error handling code.
           | 
           | So the compiler is ignoring the behavior, but makes the
           | decision to do so by not ignoring the behavior. It's slightly
           | convoluted, but not unreasonable.
        
             | vyodaiken wrote:
             | More than slightly convoluted. The obvious intention is
             | that the compiler ignores overflow and lets the processor
             | architecture make the decision. Assuming that overflow
             | doesn't happen is assuming something false. There's no
             | excuse for that and it doesn't "optimize" anything.
        
         | UncleMeat wrote:
         | IMO, implementation defined is worse. It is still a time bomb
         | but now it is a time bomb that you cannot use compiler errors
         | to prevent automatically.
        
         | pif wrote:
         | Unspecified result means the compiler must think about what
         | could happen in case I made an error.
         | 
         | UB means the compiler will trust me and concentrate on generate
         | the fastest code ever.
         | 
         | C is for clever programmers; if you don't want to be clever,
         | you are free to use Go or something like that.
        
           | masklinn wrote:
           | > UB means the compiler will trust me and concentrate on
           | generate the fastest code ever.
           | 
           | In reality, UB means the compiler will assume it doesn't
           | happen and work from there.
           | 
           | Of course a more expressive language could just make it so
           | the compiler doesn't have to assume this e.g. a C compiler
           | will consider a dereference as meaning the pointer is non-
           | null, both backwards and forwards.
           | 
           | But if the language had non-null pointers, it would not need
           | to bother with that, it would have a non-null pointer in the
           | first place. It could still optimise nullable pointers (aka
           | lower nullable pointers to non-nullable if they're provably
           | non-nullable, usually after a few rounds of inlining), but
           | that would be a much lower priority.
        
           | cygx wrote:
           | It's not so much about cleverness, but knowledge and
           | vigilance. You first have to be aware of all the footguns,
           | and then be careful not to let any of them slip through...
        
             | pif wrote:
             | > You first have to be aware of all the footguns,
             | 
             | Knowing your tools is part of being a professional. C is
             | not for amateurs.
        
               | rurban wrote:
               | Then use such a tool, but don't call it C, rather
               | -std=gnuc-opt11, which always knows better than the
               | author, without any warning.
               | 
               | Call it randomC, unsuitable for professional programmers,
               | but extremly suitable for benchmark games and managers.
               | Who prefer to ignore pesty overflows, underflows, memset,
               | memcpy, dereferencing NULL pointers and other rare cases.
        
         | nwallin wrote:
         | > For example, if signed integer overflow yielded an
         | unspecified result rather than causing undefined behavior, I
         | wonder if any implementations would be adversely affected.
         | 
         | Yes.
         | 
         | There are several architectures where signed integer overflow
         | traps, just like division by 0 on x86. (which is why division
         | by 0 is UB) If a C compiler for those architectures was
         | required to yield an unspecified result instead of trapping,
         | every time the code performed a signed integer
         | addition/subtraction, it would need to update a trap handler
         | before and afterward to return an unspecified value instead of
         | invoking the normal trap handler.
        
       | jacinabox wrote:
       | Sort of a second amendment for footguns.
        
       | layer8 wrote:
       | Because there seems to be some confusion in this thread:
       | 
       | - "Implementation-defined behavior" means that the C standard
       | specifies the allowable behaviors that a C implementation must
       | choose from, and the implementation must document its particular
       | choice.
       | 
       | - "Unspecified behavior" means that the C standard places no
       | particular restrictions on the behavior, but a C implementation
       | must pick a behavior and document its choice.
       | 
       | - "Undefined behavior" means that C implementations are allowed
       | to assume that the respective runtime condition does not ever
       | occur, and for example can generate optimized code based on that
       | assumption. In particular, it is free to _not_ decide any
       | behavior for the condition (let alone document it). As a
       | consequence, if the runtime condition actually _does_ occur, this
       | can affect the behavior of _any_ part of the program, even the
       | behavior of code executed before the condition would occur. This
       | is because from a false assumption the truth of _any_ statement
       | can be logically derived (principle of explosion [0]). And that
       | is why the C standard does not restrict the behavior of the whole
       | program if it contains undefined behavior.
       | 
       | [0] https://en.m.wikipedia.org/wiki/Principle_of_explosion
        
         | tsimionescu wrote:
         | I would note that the article is explicitly contesting the
         | definition of UB that you are giving here (though you are
         | absolutely right that this is the de facto definition used by
         | all major compilers, and the commtitee).
         | 
         | Basically the article is arguing that UB should be similar to
         | Unspecified behavior - behavior that the implementation leaves
         | up to the hardware and/or OS.
         | 
         | I'm not sure where I fall to this issue, though I would note
         | that the definition of UB in the standard needs quite a bit of
         | interpretation to arrive at the commonly used definition you
         | are quoting. That is, while I think the definition you give is
         | _compatible_ with the one in the standard, I don 't think it
         | _is_ the definition of the standard, which is much softer about
         | how UB should impact the semantics of a C program. In
         | particular, nothing in the wording of the standard expliclty
         | says that an implementation is expected to assume UB doesn 't
         | happen, or that a standard-conforming program can't have UB.
        
           | Gaelan wrote:
           | > and the commtitee
           | 
           | I'd argue that for a document like the C standard, if there's
           | a well-known intended meaning, that _is_ the meaning of the
           | document - any other interpretation is purely academic.
        
             | simonh wrote:
             | These are all interpretations. Your trying to special
             | privilege one interpretation by calling it the intended
             | meaning when really it's what I'd call an authoritative
             | interpretation. The accepted interpretation by an
             | authority. That doesn't necessarily make it the actual
             | intended meaning though.
        
               | Gaelan wrote:
               | I was under the impression the "committee" here was the
               | people who wrote the C standard? So the accepted
               | interpretation by the people who wrote the document must
               | be the intended meaning, no? If it wasn't, they'd have
               | written something else.
               | 
               | (Sure, it's possible they wrote something, then only
               | later came to understand the implications of this. But
               | the standard has been understood as allowing nasal demons
               | since long before the latest revision of the C standard,
               | so I'd argue that by publishing a new version without
               | rewording that section, they're making clear that they
               | intend its current interpretation to be part of the
               | standard.)
        
           | bregma wrote:
           | from ISO/IEC 9899:2011 "Programming Languages -- C"
           | 3.4.3              1 undefined behavior behavior, upon use of
           | a nonportable or erroneous program   construct or of
           | erroneous data, for which this International Standard imposes
           | no requirements              2 NOTE Possible undefined
           | behavior ranges from ignoring the situation completely with
           | unpredictable results, to behaving during translation or
           | program execution in a documented manner characteristic of
           | the environment (with or without the issuance of a diagnostic
           | message), to terminating a translation or execution (with the
           | issuance of a diagnostic message).
           | 
           | It doesn't look like a leap to go from this definition to
           | that of ignoring the situation completely with unpredictable
           | results.
        
             | alerighi wrote:
             | To me ignore the situation completely with unpredictable
             | results would mean: the compiler generates an assembly that
             | could not be correct, and then the behaviour of the program
             | is determined by what the processor does.
             | 
             | Doing something like removing checks is not ignoring the
             | situation: is acting in some particular way when undefined
             | behaviour is detected.
             | 
             | And it has neither unpredictable results: it specifies what
             | happens, since these checks are added systematically.
             | 
             | I don't see anywhere in the standard that the compilers are
             | free to change at their choice the semantic of the program
             | if undefined behaviour is detected. Rather undefined means
             | to me that the compiler generates code where the result
             | cannot be known because it will depend on external factors
             | (basically the hardware implementation).
        
             | fiter wrote:
             | This quote is the topic of the original article and the
             | article goes into detail about how it believes the quote
             | should be interpreted.
        
               | MauranKilom wrote:
               | ...and yet the article completely ignores the "with
               | _unpredictable_ results " part and instead spends a lot
               | of time discussing all the other valid consequences
               | (which are also only mentioned as examples, at least in
               | the common understanding of "from ... to ...").
               | 
               | Downthread commenters go into more detail regarding the
               | "ignoring e.g. the possibility of signed overflow may
               | mean to assume that it never happens" reading, so I won't
               | elaborate on it here.
        
               | _Nat_ wrote:
               | The original article's interpretation seemed untenable.
               | 
               | While the difference between " _Permissible_ " and "
               | _Possible_ " could be quite significant, in this case, it
               | was qualifying:
               | 
               | > [ _Permissible_ / _Possible_ ] undefined behavior
               | ranges from ignoring the situation completely with
               | unpredictable results, to behaving during translation or
               | program execution in a documented manner characteristic
               | of the environment (with or without the issuance of a
               | diagnostic message), to terminating a translation or
               | execution (with the issuance of a diagnostic message).
               | 
               | The previously-" _Permissible_ " behaviors were so broad
               | that they basically allowed anything, including
               | translating the source-code in any documented manner..
               | which basically means that, as long as a compiler says
               | how it'll treat undefined-behavior, it can do it that
               | way, because it's free to completely reinterpret the
               | source-code in any (documented) manner.
        
         | cygx wrote:
         | To my knowledge, in case of unspecified behaviour, it's not
         | required to pick and document a particular choice. Behaviour
         | may even vary on a case by case basis where it makes sense (eg
         | order of evaluation of function arguments, whether an inline or
         | external definition of an inline function gets used, etc).
         | 
         | The need to properly document the choice is the defining
         | characteristic of implementation-defined behaviour.
        
         | nwallin wrote:
         | > - "Undefined behavior" means that C implementations are
         | allowed to assume that the respective runtime condition does
         | not ever occur, and for example can generate optimized code
         | based on that assumption.
         | 
         | Please note that the article is making the specific argument
         | that this interpretation of UB is an _incorrect_
         | interpretation. The author is arguing that you, me, the llvm
         | and gcc teams are wrong to interpret UB that way.
         | 
         | Linux had a bug in it a few years ago; the code would
         | dereference a pointer, then check if it was null, then returned
         | an error state if it was null, or continued performing the
         | important part of the function. The compiler deduced that if
         | the pointer had been null when it was dereferenced, that's UB,
         | so the null check was unnecessary, and optimized the null check
         | out. The trouble was that in that context, a null pointer
         | dereference didn't trap, (because it was kernel code? not
         | sure.) so the bug was not detected. It ended up being an
         | exploitable security vulnerability in the kernel, I think a
         | local privilege escalation.
         | 
         | The article is making the argument that the compiler should
         | _not_ be free to optimize out the null check before subsequent
         | dereferences. The compiler is permitted to summon nasal demons
         | where the pointer is dereferenced the first time, but should
         | not be free to summon nasal demons at later lines of code,
         | after the no-nasal-demons-please check.
         | 
         | (The linux kernel now uses -fno-delete-null-pointer-checks to
         | ensure that doesn't happen again. The idea is that even though
         | it was a bug that UB was invoked, the failure behavior should
         | be safe instead of granting root privileges to an unprivileged
         | user.)
         | 
         | Fun with NULL pointers part 1 https://lwn.net/Articles/342330/
         | 
         | Fun with NULL pointers part 2 https://lwn.net/Articles/342420/
        
           | tom_mellior wrote:
           | > The trouble was that in that context, a null pointer
           | dereference didn't trap, (because it was kernel code? not
           | sure.) so the bug was not detected.
           | 
           | Yes, because it was kernel code. Because that dereference is
           | completely legal in kernel code. The C code was fine,
           | assuming that it was compiled with appropriate kernel flags.
           | This was not a bug in Linux, at least not on the level of the
           | C code itself.
           | 
           | > The linux kernel now uses -fno-delete-null-pointer-checks
           | to ensure that doesn't happen again.
           | 
           | I also seem to remember that it was already using other
           | "please compile this as kernel code" flags that _should have_
           | implied  "no-delete-null-pointer-checks" behavior, and that
           | the lack of this implication was considered a bug in GCC and
           | fixed.
        
           | alerighi wrote:
           | By the way, dereferencing NULL is a well defined behaviour on
           | every computer architecture: you are basically reading at
           | address 0 of memory. It just causes a crash if you have an
           | operating system since it will cause a page fault, but in
           | kernel mode or in devices without an OS is a legit thing to
           | do (and even useful in some cases).
           | 
           | Why should C compilers make it undefined? The standard
           | doesn't mandate that undefined behaviour should change the
           | semantic of the program. Just define all the undefined
           | behaviour that you could, to me keeping them undefined makes
           | no sense (even from the standard point, everyone knows that
           | if you overflow an int it wraps around, why should it be
           | undefined??)
        
       | gameswithgo wrote:
       | Lots of us use languages that have no/almost no undefined
       | behavior, it is really weird to see people trying to come up with
       | reasons why this is hard. Perhaps there are still computing
       | environments so slow that undefined behavior still makes sense in
       | C, and C should remain that way. If so, use something with less
       | of this for anything else.
        
       | ho_schi wrote:
       | This is interesting. Humans interpret standards and compilers
       | interpret code, too :)
       | 
       | We have the benefits of a standard, portability and multiple
       | compilers. But it also comes with duties.
       | 
       | BTW.
       | 
       | The C++ people seem to tackle some of the issues around
       | widespread 'undefined behavior' and defined a lot of behavior:
       | 1) If a side effect on a scalar object is unsequenced   relative
       | to another side effect on the same scalar object,   the behavior
       | is undefined.       i = ++i + 2;       // undefined behavior
       | until C++11       i = i++ + 2;       // undefined behavior until
       | C++17       f(i = -2, i = -2); // undefined behavior until C++17
       | f(++i, ++i);       // undefined behavior until C++17, unspecified
       | after C++17       i = ++i + i++;     // undefined behavior
       | 2) If a side effect on a scalar object is unsequenced relative to
       | a value computation using the value of the same scalar object,
       | the behavior is undefined.       cout << i << i++; // undefined
       | behavior until C++17       a[i] = i++;       // undefined
       | behavior until C++17       n = ++i + i;      // undefined
       | behavior
       | 
       | Source: https://en.cppreference.com/w/cpp/language/eval_order
       | 
       | They replaced the entires "Sequence Point Rules" by "Sequenced-
       | before rules". Compiler implementers cannot choose what to do
       | (implementation defined) or neglect the issue (undefined
       | behavior) - the must act well defined now in many situations.
        
       | mpweiher wrote:
       | https://news.ycombinator.com/item?id=17454467 made very much the
       | same argument.
        
       | dfabulich wrote:
       | I think the problem is cultural in the C community. C programmers
       | have Stockholm Syndrome around UB optimizations.
       | 
       | As TFA notes, "There is No Reliable Way to Determine if a Large
       | Codebase Contains Undefined Behavior"
       | https://blog.llvm.org/2011/05/what-every-c-programmer-should...
       | 
       | That's because UB is a bug that occurs as your program runs (e.g.
       | dereferencing a null pointer). You'd have to prove that your
       | program is free of bugs to prove that you won't encounter UB at
       | runtime, which is not possible in general.
       | 
       | But C programmers believe they deserve to be abused this way.
       | 
       | "Sure, the C compiler is allowed to nuke your entire program
       | whenever there's a bug at runtime, but all you have to do is
       | prove that your program is bug free! Why, any _real_ C programmer
       | can write bug-free code, so if you have any UB bugs, you 're
       | probably not good enough to write C. Sure, I've been bruised by
       | UB bugs in the past, we all have, but when I made those mistakes,
       | I deserved to be bruised, and those bruises made me a better
       | programmer."
        
         | pif wrote:
         | You don't get it: if you are ready to give up top performance,
         | you can choose among a multitude of more forgiving languages.
         | 
         | There is simply no point in having a C language with less than
         | top level optimizations.
        
           | kstrauser wrote:
           | Name a feature of C that makes it inherently "faster" than
           | another compiles-to-machine-code language. I can name plenty
           | that make it slower.
        
         | vyodaiken wrote:
         | Good point. C programmers have come to assume that C is a
         | shitty language with terrible semantics and it's their fault
         | for not using Rust or something. They blame the language for
         | the mess made by standards/compilers.
        
         | jerf wrote:
         | I expect there's already been a lot of evaporative cooling in
         | the C community and there will only be more over time.
         | Increasingly the people who are going to be left in the C
         | community are precisely those people who think C is generally
         | OK. Anyone who has developed a high degree of fear of C is
         | ever-increasingly doing what it takes to move away.
         | 
         | I have tried to phrase this neutrally, as to not be a comment
         | about whether or not C "really is" generally OK. I am not sure
         | I have 100% succeeded, as I have my opinions. But the point
         | stands regardless; the people in the C community are generally
         | going to be the ones who are OK with C.
        
         | anarazel wrote:
         | The situation has gotten at least somewhat better since then.
         | Ubsan and friends are not a guarantee, but make it much more
         | realistic to find UB issues.
        
         | optymizer wrote:
         | Anecdotally, I find about as many bugs in my Java code at work,
         | as I do in my hobby C code, including "dereferencing null
         | pointers" aka NullPointerExceptions.
        
           | anarazel wrote:
           | NPE isn't really a relevant comparison point - it raises an
           | exception of sorts in both languages. Accessing an array past
           | its end, accessing freed memory, reading uninitialized memory
           | seem more apt.
        
             | optymizer wrote:
             | These are all issues a compiler could insert checks for,
             | saving you the time to do it manually, while trading
             | performance for security or correctness.
             | 
             | C doesn't trade performance away, so if you'd like to pay
             | the price of these checks, it's on you, the programmer to
             | add them in. C programmers have to put in extra effort to
             | make the executables safer and output correct results.
             | 
             | Other languages trade off performance for safety and/or
             | correctness. The programmers using them have to put in
             | extra effort to make the executables run as fast as they do
             | in C.
             | 
             | Ultimately, programmers tend to make rational choices, and
             | the stockholm syndrome mentioned above is really just C
             | programmers dealing with the consequences of the trade-off
             | they made by using C.
        
               | vyodaiken wrote:
               | C is designed to allow programmers to insert or remove
               | checks as performance tuning. Compiler UB "optimizations"
               | that remove that ability from the programmer make the
               | language unusable.
        
           | jerf wrote:
           | "Undefined behavior" is a term of art relevant to C, meaning
           | that the standard no longer has _any_ comment about what
           | happens. Thus the comments about launching nukes, or
           | destroying your machine, etc., being standards complaint,
           | even though obvious real compilers won 't actually emit code
           | that does that.
           | 
           | Dereferencing a nil pointer in Java is not undefined
           | behavior. It is defined; it throws a NullPointerException,
           | and "it throws a NullPointerException" is a very specific
           | term of art in the Java community that is very precisely
           | defined.
           | 
           | Many languages don't have an explicit concept of "undefined
           | behavior" in their standard (though, of course, due to
           | deficiencies in the standard they may have implicitly
           | undefined behavior), and of those that do I doubt anyone
           | matches C or C++ in the commonly-used category of languages.
        
             | optymizer wrote:
             | Point taken about NPEs being defined behavior, but from a
             | practical point of view, a bug is a bug (and bugs is what
             | the parent comment was referring to).
             | 
             | Whether the bug was due to a well-defined or undefined
             | behavior seems like an exercise in assigning blame to
             | another entity (the language, the committee, the compiler,
             | etc).
        
               | jerf wrote:
               | Correct assignment of (technical) blame is an important
               | engineering task, though. You sound to me like you are
               | thinking that's some sort of wrong thing to do, but I
               | would disagree. Identifying whether the bug is a defined
               | (in C) behavior or the result of the compiler making a
               | certain choice as the result of your code invoking
               | undefined behavior is an important element of fixing the
               | bug; if you don't know which is which you're operating
               | under a critical handicap.
        
         | Asooka wrote:
         | > But C programmers believe they deserve to be abused this way.
         | 
         | I don't know of any C programmers who think that way. We accept
         | the state of affairs because there is little alternative. We're
         | not going to rewrite our code in another language, because we
         | mostly need to write in C for various reasons. Also Rust
         | inherits LLVM's undefined behaviour, so it's no panacea. We
         | mostly keep adding flags to turn off the most excessive
         | optimisations and pray.
        
           | steveklabnik wrote:
           | Rust inheriting it is the equivalent of a codegen bug, and
           | they get patched. The cause here isn't LLVM, it's language
           | semantics.
        
       | pjmlp wrote:
       | Duplicate, https://news.ycombinator.com/item?id=27217981
        
         | yesenadam wrote:
         | That only got 1 comment.
        
           | pjmlp wrote:
           | Just like this one, when I posted mine here
        
             | setr wrote:
             | Linking a failed discussion from another (possibly) failed
             | discussion seems kind of pointless.
        
               | pjmlp wrote:
               | Just like this thread.
        
       | whatshisface wrote:
       | I don't see any utility in inventing a new reading of the
       | standard. Getting everyone to agree on a new interpretation of a
       | sentence can't possibly be easier than getting everyone to agree
       | on a more clearly worded sentence. The actual thing you'd have to
       | convince everyone of (the utility of the new consensus) and the
       | people you'd have to convince (compiler writers, documentation
       | authors) are the same in both cases.
        
         | setr wrote:
         | The difference is that, once decided and written, no party
         | (both old and new) can't chime in with a new interpretation.
        
           | JadeNB wrote:
           | > The difference is that, once decided and written, no party
           | (both old and new) can't chime in with a new interpretation.
           | 
           | That double negative ("no party ... can't") was accidental,
           | right?
        
       | klodolph wrote:
       | I think one of the best attempts to solve this is the attempt to
       | classify undefined behavior into "bounded" and "critical" UB,
       | which is a distinction that hasn't gained as much traction as I'd
       | like.
       | 
       | Something like 1<<32 is "bounded" UB, and can't result in
       | anything happening to your program... it is permitted to result
       | indeterminate values, it is permitted to trap, but that's
       | basically it.
       | 
       | Critical UB is stuff like writing to a const object, calling a
       | function pointer after casting it to an incompatible type,
       | dereferencing an invalid pointer, etc. Basically, anything goes.
       | 
       | This is part of the "analyzability" optional extension which I
       | wish would gain more traction.
        
         | MauranKilom wrote:
         | > Something like 1<<32 is "bounded" UB, and can't result in
         | anything happening to your program... it is permitted to result
         | indeterminate values, it is permitted to trap, but that's
         | basically it.
         | 
         | That is what implementation-defined behavior is for. Feel free
         | to advocate for changing the standard accordingly.
        
           | klodolph wrote:
           | > Feel free to advocate for changing the standard
           | accordingly.
           | 
           | I am, in fact, describing the existing C standard, not
           | advocating for changes. Please refer to Annex L
           | "Analyzability", which defines the terms I used: "bounded
           | undefined behavior" and "critical undefined behavior". Note
           | that this section is conditional... it is not widely adopted.
           | I am advocating for increased adoption of this part of the
           | standard, or barring that, revisions that would make adoption
           | more palatable.
           | 
           | And in general, "If you don't like it, advocate changing the
           | standard" is not a useful or insightful response. It is
           | completely reasonable and normal to complain about something
           | without trying to fix it.
        
             | MauranKilom wrote:
             | You're right, apologies.
        
       | jcranmer wrote:
       | It's easy to pick on undefined behavior in C when you focus on
       | the more gratuitous undefined behaviors such as signed overflow
       | or oversized shifts. I'm not certain why these are undefined
       | behavior instead of implementation-defined, but my suspicion is
       | that these caused traps on some processors, and traps are
       | inherently undefined behavior.
       | 
       | Instead, if you dislike undefined behavior, I challenge you to
       | come up with workable semantics for non-gratuitous scenarios, of
       | which I'll lay out three. Remember, if it's not undefined, then
       | there are some defined semantics that need to be preserved, even
       | if implementation-defined, so you should be able to explain those
       | semantics. If you can't find such semantics, then maybe undefined
       | behavior _isn 't_ such a bad thing after all.
       | 
       | The first case is the compiler hint. A function marked _Noreturn
       | returns. Or you alias two pointers marked restrict. Remember that
       | the entire point of these hints is to permit the compiler to not
       | generate code to check for these scenarios.
       | 
       | The second case is uninitialized memory. You've probably already
       | come up with such semantics, so I'll point out an optimization
       | scenario that you probably don't object to that your semantics
       | didn't cover:                 static int y; /* uninitialized */
       | _Bool cond = f();       int x = cond ? 2 : y; /* Is it legal to
       | fold this to int x = 2; ? */
       | 
       | Hopefully, you'll agree that that is a reasonable optimization.
       | Now consider this code:                 static int y; /*
       | uninitialized */       _Bool cond1 = f(), cond2 = g();       int
       | x1 = cond1 ? 2 : y; /* So int x1 = 2; */       int x2 = cond2 ? 3
       | : y; /* So int x2 = 3; */       if (!cond1 && !cond2) {
       | assert(x1 == x2); /* Uh... */       }
       | 
       | This is just scraping the tip of the surface of uninitialized
       | values. Developing sane semantics around these sorts of values
       | that also allow reasonable optimizations is challenging. You can
       | look at the very lengthy saga that is undef, poison, and freeze
       | in LLVM (still ongoing!) to see what it looks like in practice.
       | 
       | The third category is traps. Let's pick an easy example: what
       | happens if you dereference a null pointer? Now let's consider the
       | consequences of those semantics on some more code examples:
       | int *x = NULL;       int *y = &*x; /* Can this be lowered to int
       | *y = x; ? */       size_t offset = (size_t)&((struct foo
       | *)NULL)->field; /* Remember, x->a is actually (*x).a */       int
       | *z = get_pointer();       *z; /* No one uses the result of the
       | load, can I delete it? */       for (int i = 0; i < N; i++) {
       | foo(*z); /* Can I hoist the load out of the loop? */       }
       | 
       | Note that all of the optimizations I'm alluding to here are ones
       | that would have existed all the way back in the 1980s when C was
       | being standardized, and these are pretty basic, pedestrian
       | optimizations that you will cover in Compilers 101.
        
         | vyodaiken wrote:
         | "static int y; /* uninitialized _/ _Bool cond = f(); int x =
         | cond ? 2 : y;  /_ Is it legal to fold this to int x = 2; ? _/
         | Hopefully, you 'll agree that that is a reasonable
         | optimization. Now consider this code:"
         | 
         | Do not agree. That's not an optimization it's just false
         | reasoning. Reasonable would be to either ignore it so the code
         | depends on the uniitialized data, whatever it is, or to flag an
         | error. Also according to the standard, static variable of
         | arithmetic type are initialized to zero by default so there is
         | no UB at all.
         | 
         | Second example has the same problem.
         | 
         | Third example has, for example, assumptions that C does not let
         | the compiler make e.g. that z is a restricted pointer. Imagine
         | that f(int c) { z = getsamepointer(); _z +=1;return *z; }
         | 
         | And none of those optimizations existed in the 1980s.
        
         | anarazel wrote:
         | > It's easy to pick on undefined behavior in C when you focus
         | on the more gratuitous undefined behaviors such as signed
         | overflow or oversized shifts.
         | 
         | I wouldn't even mind the signed integer overflow thing that
         | much if there were a reasonable way in standard C to check
         | whether a signed operation would overflow.
         | 
         | It's not impossible to do correctly in a compiler independent
         | way, but ridiculously hard. And slow.
        
           | anarazel wrote:
           | Another thing is that the wording around some of the UB
           | issues is just plain bad. The most extreme probably is the
           | rules around strict aliasing. That there, for quite a while,
           | was uncertainty whether the rules allow type punning by
           | reading a union member when the last write was to another
           | member of a good example of not taking reality into account.
           | Yes memcpy exists - but it is even less type safe!
        
             | jcranmer wrote:
             | The union punning trick is UB in C89 and well-defined in
             | C99 and later, although it was erroneously listed in the
             | (non-normative) Annex listing UBs in C99 (removed by C11).
             | 
             | Strict aliasing is another category of UB that I'd consider
             | gratuitous.
        
               | anarazel wrote:
               | > The union punning trick is UB in C89 and well-defined
               | in C99 and later, although it was erroneously listed in
               | the (non-normative) Annex listing UBs in C99 (removed by
               | C11).
               | 
               | Right, that's my point. If the standard folks can't
               | understand their standard, how are mere mortals supposed
               | to?
               | 
               | > Strict aliasing is another category of UB that I'd
               | consider gratuitous.
               | 
               | I'm of a bit split minds on it. It can yield substantial
               | speedups. But is also impractical in a lot of cases. And
               | it's often but strong enough anyway, requiring explicit
               | restrict annotations for the compiler to understand two
               | pointers don't alias. Turns out two pointers of the same
               | (or compatible) type aren't rare in performance critical
               | sections...
               | 
               | Realistically it should have been opt-in.
        
               | nullc wrote:
               | > Strict aliasing is another category of UB that I'd
               | consider gratuitous.
               | 
               | Without it you cannot vectorize (or even internally re-
               | order) many loops which are currently vectorizable
               | because the compiler can't statically prove arguments
               | won't alias otherwise.
        
           | anarazel wrote:
           | See e.g. https://github.com/postgres/postgres/blob/8bdd6f563a
           | a2456de6...
        
         | tlb wrote:
         | What version of C doesn't require `static int y;` to be
         | initialized to 0?
        
           | jcranmer wrote:
           | You're right, it should be automatic storage duration instead
           | of static storage duration.
        
         | Lvl999Noob wrote:
         | I am not a compiler dev or a high skill coder so my opinion
         | might not matter much but I'll still lay them out here.
         | 
         | > The first case is the compiler hint...
         | 
         | The compiler should refuse to compile if it comes upon such a
         | case. Those hints are as much used by programmers as they are
         | by the compiler. It should emit a warning and necessary checks
         | + abort code if those hints can neither be proven nor disproven
         | statically.
         | 
         | > The second case is uninitialized memory...
         | 
         | For the first example, unless the compiler knows that f() must
         | always be true, it should give a compile time error. In case
         | that it does know that f() is always true, it should still emit
         | a warning (or give an error anyways).
         | 
         | > The third category is traps...
         | 
         | I am honestly not sure about this one. It would depend on the
         | behaviour defined for multiple references / dereferences done
         | simultaneously and type casting. I would still probably expect
         | the compiler to give out warnings atleast.
         | 
         | Edit: language / grammar / typos
        
           | jcranmer wrote:
           | As a rule of thumb, yes, compilers should issue warnings
           | where undefined behavior is obviously occurring. (And there
           | is something to be said for compiling with -Werror). However,
           | that's not going to always work, and there are two reasons
           | for this.
           | 
           | The first reason is that undefined behavior is ultimately a
           | statement about dynamic execution of the program. The set of
           | expressions that could potentially cause undefined behavior
           | is essentially all of them--every signed arithmetic
           | expression, every pointer dereference, hell, almost all
           | function calls in C++--and figuring out whether or not they
           | actually do cause undefined behavior is effectively
           | impossible at the compiler level. This is why sanitizers were
           | developed, and also why sanitizers only work as a dynamic
           | property.
           | 
           | For a concrete example, consider the following code:
           | extern void do_something(int * restrict a, int * restrict b,
           | size_t n);            void my_function(int *y, int *z, size_t
           | x, size_t n) {         if (x > n)           do_something(y,
           | z, n);       }
           | 
           | This code could produce undefined behavior. Or it could not.
           | It depends on whether or not y and z overlaps. Maybe the
           | check of the if statement is sufficient to guarantee it.
           | Maybe it's not. It's hard to advocate that the compiler
           | should warn, let alone error, about this kind of code.
           | 
           | The second issue to be aware of is that there is a general
           | separation of concerns between the part of the compiler that
           | gives warnings and errors (the frontend), and the part that
           | is actually optimizing the code. It is difficult, if not
           | impossible, to give any kind of useful warning or error
           | message in the guts of the optimizer; by the time code
           | reaches that stage, it is often incredibly transformed from
           | the original source code, to the point that its correlation
           | with the original can be difficult to divine.
           | 
           | So I once came across some really weird code that broke an
           | optimization pass I was working on. It looked roughly like
           | this (approximate C translation of the actual IR):
           | if (nullptr != nullptr) {         int *x = nullptr;
           | do {           /* do some stuff with x */           x++;
           | } while (x != nullptr);       }
           | 
           | What hideous code creates a loop that iterates a pointer
           | through all of memory? Why, this (after reducing the test
           | case):                 void foo() {
           | std::vector<std::set<int>> x;         x.emplace_back();
           | }
           | 
           | So the crazy code was generated from the compiler very
           | heavily inlining the entire details of the STL, and the
           | original code was a more natural iteration from a start to an
           | end value. The compiler figured out enough to realize that
           | the start and end values were both null pointers, but didn't
           | quite manage to actually fully elide the original loop in
           | that case. Warning the user about the resulting undefined
           | behavior in this case is completely counterproductive; it's
           | not arising from anything they did, and there isn't much they
           | can to do to silence that warning.
        
         | bhk wrote:
         | > It's easy to pick on undefined behavior in C when you focus
         | on the more gratuitous undefined behaviors ...
         | 
         | That is the _whole point_. There are scores of instances of
         | gratuituous UB.
         | 
         | > I'm not certain why these are undefined behavior instead of
         | implementation-defined, but my suspicion is that these caused
         | traps on some processors, and traps are inherently undefined
         | behavior.
         | 
         | Traps are not inherently undefined. The C standard discusses
         | floating point traps in detail. Many details may of course be
         | left to the implementation or platform to describe, but that's
         | very different from saying "all bets are off".
         | 
         | The real reason for the gratuitous UB was misunderstanding and
         | ignorance.
         | 
         | > Hopefully, you'll agree that that is a reasonable
         | optimization.
         | 
         | Hopefully, you'll agree that the code is buggy, and should be
         | fixed. We end up running expensive and cumbersome static
         | analysis tools that try to detect situations just like this ...
         | which the compiler itself has detected, but chosen not to warn
         | us.
        
           | richardwhiuk wrote:
           | The first code isn't necessarily buggy.
        
       ___________________________________________________________________
       (page generated 2021-05-20 23:01 UTC)