[HN Gopher] Undefined behavior in C is a reading error
___________________________________________________________________
Undefined behavior in C is a reading error
Author : zdw
Score : 119 points
Date : 2021-05-20 14:29 UTC (8 hours ago)
(HTM) web link (www.yodaiken.com)
(TXT) w3m dump (www.yodaiken.com)
| jart wrote:
| > license for the kinds of dramatic and unintuitive
| transformations we've seen from the compilers, and any indication
| that undefined behavior should be a vehicle for permitting
| optimizations.
|
| Does anyone have an example of a time where Clang or GCC actually
| did something bad upon witnessing undefined behavior, rather than
| simply doing nothing, as the standard proposes? I ask because
| every time I've seen people get mad about UB it's always because
| they envisioned that the compiler would do something, but instead
| the compiler did nothing.
| Jiro wrote:
| This may not necessarily count as an example in the wild, but
| the 2013 Underhanded C contest at
| http://www.underhanded-c.org/_page_id_25.html includes this
| example: h = abs(h) % HASHSIZE; // Extra
| sanity check if (h < 0 || h >= HASHSIZE) h = 0;
| return h;
|
| where h=INT_MIN causes the h to become negative and the sanity
| check is optimized out because abs(INT_MIN) is UB.
| jart wrote:
| That's such a good example for teaching purposes, because it
| manages to combine the lack of understanding surrounding the
| remainder operator (i.e. unlike python % in c is _not_
| modulus) with the lack of understanding surrounding
| 0x80000000 (two 's complement bane) into a single example.
| However it still makes my point that in this circumstance,
| the compiler's strategy still is to do nothing, because it
| can prove that the check could only be true under undefined
| behavior circumstances, so doing nothing means not compiling
| the check. I'm fine with that. Would anyone prefer that the
| compiler's internal definition of logic assume that absolute
| values can be negative? Must we throw out centuries of math
| because we've optimized integers to be 32-bits?
|
| The only thing that's problematic is we need better tools to
| bring logic assumptions to our attention. Currently, UBSAN
| can only warn us about that when two's bane actually gets
| passed to the function at runtime. So the only way to spot
| faulty logic we failed to consider is to both enable UBSAN
| and be super methodical about unit testing.
|
| Well, another thing I like to do is just read the assembly
| output. Constantly. Whenever I write something like a parser
| I've got a keyboard mapping that shows me the assembly output
| in Emacs with UBSAN enabled. If I turn off the noisy ones
| like pointer overflow then I can avoid these issues
| altogether by writing my code so that no UBSAN assembly gets
| generated. Since the compiler won't show that as a warning.
| You literally have to read the -S assembly output to get the
| compiler warnings that are actually meaningful.
| anarazel wrote:
| What do you define as "bad"?
|
| Arguably several.
|
| I've seen cases of overflow checks that were implemented
| assuming signed overflow (which all relevant platforms
| implement!) getting optimized away. Correct, given "signed
| overflow is UB" and thus can be assumed to not happen.
| Problematic given for widespread such checks are, and given
| that there's no easy portable alternative.
|
| Entire checks getting optimized away because of a presumed
| strict aliasing violation. IIRC between structure with a
| compatible layout. Pretty code, no. UB yes. Reasonable, IDK.
| ben0x539 wrote:
| Sounds like you're interested specifically in
| https://blog.tchatzigiannakis.com/undefined-behavior-can-lit...
| bhk wrote:
| I've seen a real-world example something like this:
| int a[32] = {...}; int flag = 1 << index; if
| (index < ARRAYSIZE) { a[index] = x; return
| flag; } else { return 0; }
|
| The "1 << index" operation is _undefined behavior_ (!) when
| index is greater than 32 (on a platform with 32-bit integers),
| even if the result is never used!
|
| The compiler inferred that index must always be less than 32,
| which allowed it to optimize out the array bounds check, which
| turns the code into a write-anywhere gadget.
|
| Note that if the standard had not declared "n << 32" to be all-
| bets-are-off UB, but instead had said something like, "it
| results in some implementation-specific value, or maybe traps"
| -- as a rational person would presume -- then this would not
| turn into a security problem.
| msbarnett wrote:
| > Note that if the standard had not declared "n << 32" to be
| all-bets-are-off UB, but instead had said something like, "it
| results in some implementation-specific value, or maybe
| traps" -- as a rational person would presume -- then this
| would not turn into a security problem.
|
| But also note that a lot of existing code doing bitshift-and-
| index inside a hot loop that never went out of bounds would
| now _get slower_ if it started having to run bounds checks it
| had previously elided in an optimization pass.
|
| Let's not pretend that "it results in some implementation-
| specific value, or maybe traps" is a clear win with no
| downsides that Standards Authors and Compiler Engineers are
| ignoring out of some kind of malice - there are very real
| performance tradeoffs here, and a new version of the standard
| that makes a lot of existing real-world code slower isn't
| going to be a popular one with many people.
| bhk wrote:
| It isn't clear to me precisely what example you have in
| mind.
|
| If you are saying that deleting array bounds checks might
| have performance benefits that outweigh the security
| concerns, then I disagree.
|
| If you are saying that the compiler would have to _insert_
| bounds checks, I don 't see how you arrive at that.
|
| I have seen claims that gratuitous UB is important for
| enabling meaningful optimizations, but in every such case
| the examples did not hold up to scrutiny. In the end, the
| same optimization remains possible without the gratuitous
| UB, although it might involve a little more work on the
| part of the compiler engineer.
|
| Regarding "malice": "Never attribute to malice..."
| bhk wrote:
| For some reason I couldn't 'Reply' to compiler-guy's
| reply directly, so I'll try here.
|
| I'm familiar with the Chris Lattner article. Most of it
| (especially the second installment) shows bad outcomes
| from UB optimization. When it comes to signed integer
| overflow UB, I see two examples where performance is
| cited as a motivation.
|
| One mentions unrolling, and gives an example similar to
| one elsewhere in this thread:
| https://news.ycombinator.com/item?id=27223870 In my reply
| to that I explain how unrolling is not actually enabled
| by UB.
|
| The other instance of integer overflow UB in the Lattner
| article is optimizing X*2/2 to X. That's perhaps a
| stronger case, but I haven't seen any numbers on the
| real-world implications of this particular optimization.
| steveklabnik wrote:
| > For some reason I couldn't 'Reply' to compiler-guy's
| reply directly, so I'll try here.
|
| Hacker News has a timeout where if you try to reply to
| someone too quickly, it will hide the reply button. This
| timeout increases the deeper the comment tree gets.
| msbarnett wrote:
| > If you are saying that deleting array bounds checks
| might have performance benefits that outweigh the
| security concerns, then I disagree.
|
| I'm saying that there is existing code in this world in
| which some variation on /* insanely
| hot loop where ARRAYSIZE > 32 */ while(true) {
| ... int x = 1 << index; if
| (index < ARRAYSIZE) { a[index] = x;
| } else { a[index] = 0;
| } ... }
|
| exists that's currently compiling down to just "a[index]
| = 1 << index", with everything working fine.
|
| I'm saying that the authors and their customers are
| unlikely to be excited when your new compiler release
| stops assuming that index is < 32 (which it always was in
| practice) and their program gets slower because there's
| now extra tests in this part of the hot loop, which is
| also consuming more icache, evicting other important bits
| of the loop. "There's some work-around to win that
| performance back, given enough effort by the compiler
| author to give you some means to tell it what it had
| previously assumed" isn't likely to sell people on your
| patch, particularly if they'd have to make many such
| annotations. "They could just remove the tests if they
| know that index < 32" in this synthetic example, yes, but
| there are cases when this is less obvious but nonetheless
| true. And compiler updates that force you to go delete
| working code, work out un-obvious deductions the compiler
| had previously made, and re-validate just to regain the
| status quo still aren't going to make anybody happy.
|
| The point, broadly: People care a lot about performance.
| These UB discussions in which people blithely assert that
| compilers "should" do XYZ conservative assumption while
| eliding any mention of the real-world performance impact
| the changes they want would have on existing code are,
| frankly, masturbatory.
|
| Compiler engineers have to care when PostgreSQL and a
| dozen other critical programs get 4x slower because they
| stopped assuming that "1 << index" wouldn't happen with
| index > 32, or that loop bounds won't overflow. Like all
| software engineering, decision making here has to be
| driven by balancing tradeoffs, not by insisting that one
| treatment of the spec is obviously "the best approach"
| while ignoring any inconvenient consequences that change
| would have vs the status quo.
| [deleted]
| compiler-guy wrote:
| There are a half-dozen examples on this very thread and
| in linked bug reports, with detailed explanations by
| professional compiler writers.
|
| If you think they don't hold up to scrutiny, then you
| should get to work implementing these things, because you
| are likely a better compiler writer than most others in
| the world, including Christian Lattner of llvm fame, who
| provides many examples here.
|
| https://blog.llvm.org/2011/05/what-every-c-programmer-
| should...
| pklausler wrote:
| I think the compiler that you were using is broken. One can't
| infer "index < 32" unless on a code path to a _use_ of
| "flag", and that inference can't override the predicates that
| dominate that use.
| bonzini wrote:
| No, the initializer already counts as a use of the
| undefined expression.
| pklausler wrote:
| You're right. Thanks!
| steveklabnik wrote:
| A fun historical example: https://feross.org/gcc-ownage/
|
| But yes, generally, UB manifests as "the compiler is allowed to
| assume this doesn't happen" and the bad stuff is a consequence
| of doing things following those assumptions, not the compiler
| going "oops well lol UB time to screw with this person."
| SloopJon wrote:
| The types of surprises that I've seen have to do with
| inferences that the compiler draws: this program would be
| undefined if _foo_ was negative, therefore _foo_ is not
| negative, therefore I can optimize away this condition.
| wallstprog wrote:
| True story -- someone (not me) decided to initialize a C++
| virtual class by defining a default ctor which does a
| "memset(this, 0, sizeof(*this))", then uses placement new to re-
| create the vptr.
|
| Newer compilers complained more and more until gcc 8 which just
| silently ignored this awful hack -- no diagnostic, no error, just
| crickets.
|
| Strictly speaking silently ignoring this atrocity is allowed by
| the standard, but it sure took a while to figure out. So be
| careful with code that "just works" even if it shouldn't.
| zokier wrote:
| Reading error or not, the ship has sailed long ago. As the
| article notes, C99 formalized the current UB interpretation. What
| C89 or K&R said is more just historical curiosity than of any
| real relevance today. I guess you could construct an argument
| that gcc should disable some optimizations when invoked with
| -std=c89, but I doubt anyone really cares at this point enough to
| justify the maintenance burden.
|
| C is a minefield today, and that is the reality we must live
| with. You can't turn back time 25 years and change what has
| happened.
| anarazel wrote:
| > C is a minefield today, and that is the reality we must live
| with. You can't turn back time 25 years and change what has
| happened.
|
| I think it's doable to make some of the UB issues considerably
| less painful. E.g. define facilities to check for overflow
| safely, add facilities for explicit wrapping operations.
| Animats wrote:
| Most of this discussion revolves around integer overflow.
|
| Part of the problem is that most of the computer hardware is now
| twos-complement arithmetic. Programmers think of that as part of
| the language. It's not, for C. C has run, in the past, on
|
| - 36 bit ones complement machines (DEC and UNIVAC)
|
| - Machines with 7-bit "char" (DEC)
|
| - Machines with 9-bit "char" (UNIVAC, DEC)
|
| - Machines where integer overflow yields a promotion to float
| (Burroughs)
|
| - Many GPUs provide saturating arithmetic, where INT_MAX + 1 ==
| INT_MAX.
|
| Go, Java and Rust have explicit byte-oriented models with defined
| overflow semantics, but C does not. C has undefined overflow
| semantics.
|
| Many years ago, around the time Ada was being defined, I wrote a
| note titled "Type Integer Considered Harmful". I was pushing the
| idea that integer variables should all have explicit range
| bounds, not type names. As in Ada, overflow would be checked.
| Intermediate variables in expressions would have bounds such that
| the intermediate value could not overflow without the final
| result overflowing and being caught. Intermediate values often
| have to be bigger than the operands for this to work.
|
| This never caught on, partly because long arithmetic hardware was
| rare back then, but it illustrates the problem. Numeric typing in
| programming languages addresses a numeric problem with a
| linguistic solution. Hence, trouble. Bounds are the right answer
| numerically, but force people to think too hard about the limits
| of their values.
| goalieca wrote:
| If C is just a portable assembler then what if the assembly
| itself has undefined behaviour. :)
| jart wrote:
| I seem to recall that Intel and AMD CPUs will behave in strange
| and unusual ways, particularly when it comes to things like
| bitshift op flags, if you shift by out of range values, or by 0
| or 1. So I guess undefined behaviors in C are somewhat
| consistent with CPUs. But as other people mentioned Intel is
| much more forgiving than X3J11. If you ever wanted to find all
| the dirty corner cases that exist between ANSI and hardware, I
| swear, try writing C functions that emulate the hardware, and
| then fuzz the two lockstep style. It's harder than you'd think.
| [don't click here if you intend to try that: https://github.com
| /jart/cosmopolitan/blob/master/tool/build/...]
| mhh__ wrote:
| It's not though is it
| kmeisthax wrote:
| This exists, but the effect of undefined behavior in CPU
| architectures is a little bit more forgiving than the
| interpretation of UB in C to mean "literally the entire program
| has no meaning". Instead, usually the program will execute
| correctly _up to_ the invalid instruction, and then something
| happens, and then the CPU will continue executing from that
| state. It 's actually fairly difficult to build an instruction
| with undefined behavior that contaminates unrelated parts of
| the program.
|
| Though it HAS happened: notably, brucedawson explains here [1]
| that the 360 has an instruction so badly thought out that
| merely having it in an executable page is enough to make your
| program otherwise meaningless due to speculative execution.
|
| [1] https://randomascii.wordpress.com/2018/01/07/finding-a-
| cpu-d...
| dooglius wrote:
| There actually is a fair amount of truly undefined behavior
| for CPUs, but it's always at system/kernel mode rather than
| userspace for security reasons. You can search an ARM ISA for
| "UNPREDICTABLE" to see examples.
| masklinn wrote:
| > This exists, but the effect of undefined behavior in CPU
| architectures is a little bit more forgiving than the
| interpretation of UB in C to mean "literally the entire
| program has no meaning".
|
| That is not quite what the interpretation of UB in C is,
| AFAIK. UB in C is generally interpreted as meaning that any
| _path_ which would trigger an UB is invalid, because if it
| will be invalid once the UB is reached, well, if we know for
| sure we 're going to UB for the instruction before it, we can
| say that we're already in UB.
|
| Whole-program invalidity can occur when the compiler manages
| to infer that no execution path is UB-free, in which case yes
| the program is meaningless. More generally, programs will go
| off the rail as far ahead of the UB as the compiler managed
| to determine that the UB would unfailingly be reached.
|
| And it's usually because the compiler works backwards: if a
| path would trigger an UB, that path can not be legally taken,
| therefore it can be deleted. That's why e.g. `if (a > a + 1)`
| gets deleted, that expression makes sense if you assume
| signed integers can overflow, but the compiler assumes signed
| integers _can 't_ overflow, therefore this expression can
| never be true, therefore it's dead code.
|
| This is important, because many such UBs get generated from
| macro expansion and optimisations (mainly inlining), so the
| assumption of UB-impossibility (and thus dead code) enables
| not just specific optimisations but a fair amount of DCE,
| which reduces function size, which triggers further inlining,
| and thus the optimisations build upon one another.
| coliveira wrote:
| The situation is different, because a CPU is by definition an
| interpreter. It does't perform code transformation, at least
| not at a higher level as a compiler. The CPU only looks at
| the next few instructions and perform them. A compiler,
| however, is responsible for taking a large coding unit and
| produce a transformation that is efficient. That process
| requires thinking about what is invalid code and operate on
| that.
| infogulch wrote:
| Wow! Interesting to see hints that meltdown exists years
| before it was officially published.
| mhh__ wrote:
| Skimmed the article and didn't see a reference to it, you
| may be interested to know that our good friends and
| protectors at the NSA may have stumbled on to Meltdown-like
| issues in the mid 90s
|
| https://en.wikipedia.org/wiki/Meltdown_(security_vulnerabil
| i...
| infogulch wrote:
| I see the NSA strategy for 'securing' the nation against
| technology threats in their 'unique' way was going strong
| back in 1995.
| coliveira wrote:
| They "secure" the country by exploiting vulnerabilities
| and leaving everyone else in the dark. They see the world
| as just a game between them and other foreign
| surveillance institutions.
| tlb wrote:
| It does: reading uninitialized memory, simultaneous writes from
| multiple threads, using memory below the stack pointer with
| interrupts enabled, ...
|
| Some of C's UB is due to this, some of it is due to the
| compiler.
| djoldman wrote:
| Specifying that anything can be done in the presence of UB is a
| poor specification. The word "specify" is pretty much the
| opposite of "anything."
|
| Perhaps compilers should delete all scopes with UB: much more UB
| would be purged from code as a result (programmers would be
| forced to enable compiler errors on UB).
| tom_mellior wrote:
| I'm not convinced. The argument seems to hinge to a very large
| extent on the sentence:
|
| > Permissible undefined behavior ranges from A, to B, to C.
|
| The observation that "Permissible" has a specific meaning is
| important and interesting. But what about "ranges from ... to
| ..."? The author reads this as "Permissible undefined behavior is
| either A or B or C.", but that seems like a stretch to me.
| (Unless ISO defines "ranges" to mean just this.)
|
| Also, in the above, A = "ignoring the situation completely with
| unpredictable results". Both "ignoring" and "unpredictable" do a
| lot of heavy lifting here. In the signed integer overflow case
| discussed in the linked GCC bug report
| (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475), one could
| very well argue that "ignoring" a case where signed arithmetic
| overflows is exactly GCC is doing. It "ignores" the possibility
| that a + 100 might overflow, hence obviously a + 100 > a is true,
| leading to results that the reporter considers "unpredictable".
| Somehow the author seems to think that this is not what the
| wording intended, but they also fail to explain what they think
| should happen here.
|
| Sounds to me like the wording has always been very vague.
|
| Also:
|
| > Returning a pointer to indeterminate value data, surely a
| "use", is not undefined behavior because the standard mandates
| that malloc will do that.
|
| Yes, malloc will do that, but using a _pointer to_ indeterminate
| data is not the same as using _the indeterminate data itself_.
| The author is doing themselves a disservice by misreading this.
| hedora wrote:
| The original intent was for signed overflow to be architecture-
| specific (not everything was 2s complement back then).
|
| On x86, the correct behavior was previously (and obviously)
| "perform a 2s complement wraparound".
|
| To "ignore" the situation used to mean "delegate to a lower
| level". It now means "silently generate code that violates the
| semantics of the target microarchitecture"
|
| As the article argues, I think this has gone too far. For
| example, people are starting to claim that asm blocks are
| undefined behavior. They're clearly implementation specific (so
| undefined by the spec), but also well defined by each
| implementation.
|
| In current readings of the spec, compilers are free to ignore
| them completely. Doing so would break all known operating
| systems, and many user space programs, so they have not managed
| to do so yet.
|
| Edit: for signed overflow, other architecture-specific behavior
| (such as optionally trapping floating point exceptions) would
| also have been permissible, assuming the architecture supported
| it.
| jcranmer wrote:
| > The original intent was for signed overflow to be
| architecture-specific (not everything was 2s complement back
| then).
|
| The term for that is implementation-defined, not undefined.
| If it were to be architecture-specific, back in C89, they
| would have used the term implementation-defined, as they do
| for things like the size of pointers.
| anarazel wrote:
| There's "implementation defined" for the concept you
| describe.
| Jiro wrote:
| >Somehow the author seems to think that this is not what the
| wording intended, but they also fail to explain what they think
| should happen here.
|
| What "should" happen is that the implementation-defined
| behavior in the assert matches the implementation-defined
| behavior in the calculation. That is, assert(a+100>a) should
| produce the same result as int x = a + 100;
| int y = a; if (x > y) ; else
| printf("assertion failed\n");
| richardwhiuk wrote:
| That behaviour is no less undefined in the process of
| overflow.
| Jiro wrote:
| Okay, then let's rephrase without a code example: The
| implementation-defined behavior in the assert should
| produce "true" if and only if the number printed for x+100
| (also using implementation-defined behavior) is always
| larger than the number printed for x.
| aw1621107 wrote:
| A related post with a similar claim: "How One Word Broke C"
| (https://news.quelsolaar.com/2020/03/16/how-one-word-broke-c/).
| HN comments at https://news.ycombinator.com/item?id=22589657
| anarazel wrote:
| First: I do dislike how hard it is to avoid some UB / how
| impractical some of the rules are.
|
| But I also think a lot of discussions of this topic caricaturize
| compiler writers to a ridiculous degree. Almost describing them
| to write optimization passes looking for UB so they can over-
| optimize something, while cackling loudly in glee about all the
| programs they can break.
|
| The set of people doing so overlaps with the set of people
| complaining that the compiler doesn't optimize their code
| sufficiently to a significant degree.
|
| Lots of compiler optimizations need to know the ranges of values,
| hence logic to infer ranges. One of the sources for that is
| "can't happen" style logic - which nearly all of the time are
| things the code author would agree with if they thought long and
| hard. Not just about the code as written, but also good the code
| looks like after inlining (across TUs with LTO).
| rectang wrote:
| > _But I also think a lot of discussions of this topic
| caricaturize compiler writers to a ridiculous degree._
|
| The inflamed backlash should tell you just how damaging it is
| to impose silent failure on meticulously written, previously
| fine programs.
| vinkelhake wrote:
| If the program was previously "fine" on version x.y.z of some
| compiler, then it is most likely still fine on it. That's the
| target that the program was written for.
|
| There's some disagreement on whether you can call a program
| "fine" that breaks after switching to a newer version, or a
| different compiler.
|
| I see a lot of programmers out there that unfortunately use
| the behavior of their code on whatever compiler they're using
| at the moment as a proxy for what the language actually
| guarantees.
| jcranmer wrote:
| > meticulously written, previously fine programs
|
| With relatively few exceptions, if your program hits
| undefined behavior, then your program was already doing
| something pretty wrong to begin with. Signed overflow is a
| poignant example: in how many contexts is INT_MAX + 1
| overflowing to INT_MIN actually sane semantics? Unless you're
| immediately attempting to check the result to see if it
| overflowed (which is extremely rare in the code I see), this
| overflow is almost certain to be unexpected, and a program
| which would have permitted this was not "previously fine" nor
| "meticulously written."
|
| I feel compelled right now to point out that software
| development is a field where it is routine to tell users that
| it's their fault for expecting our products to work (that's
| what the big all-caps block of every software license and
| EULA says, translated into simple English).
| rocqua wrote:
| I recall a bug-report discussion that I sadly have never
| been able to find. It contains a pretty bad side-effect of
| this.
|
| It had code like: int *p; // lots
| of code if (p != NULL) return
| 1; // use p
|
| Then a later refactor wrongly added a single line before
| the if statement: int *p; // lots
| of code int a = *p; if (p != NULL)
| return 1; // use p
|
| This meant the null check was optimized away, since de
| referencing a null pointer is undefined behavior, so the
| if-statement can be assumed to be always false. This then
| lead to actual errors (perhaps even an exploit, I do not
| recall) arising from the removed null-check.
|
| I think in general the "sanity check" cases are the worst.
| It is hard to determine whether an expression causes
| undefined behavior if you cannot try and evaluate it.
| Perhaps a compiler intrinsic that checks (at runtime)
| whether an expression causes undefined behavior could be
| useful here. Though I can imagine such an intrinsic being
| essentially impossible to implement.
| tom_mellior wrote:
| This was in the Linux kernel, which is compiled with
| special kernel flags which make dereferencing null
| pointers legal. In the context of that code,
| dereferencing a pointer and later checking it for null
| was absoutely meaningful. Optimizing the later null check
| was _a compiler bug that was acknowledged and fixed_. The
| compiler here didn 't respect the semantics it had
| promised to kernel code. This was entirely
| uncontroversial; it was _not_ a case of "unwanted
| optimization based on undefined behavior", it was a case
| of "compiler bug breaking well-defined code". Again, all
| this in a _kernel_ context.
|
| In _user code_ GCC will still happily remove the null
| check because in user code _this is an actual bug in the
| user 's code_.
| nwallin wrote:
| https://lwn.net/Articles/342330/
|
| https://lwn.net/Articles/342420/
| rectang wrote:
| > _extremely rare in the code I see_
|
| Well, I learned of the change in compiler behavior some
| years back because I had written loop code with a sanity
| check which depended on signed integer overflow wrapping,
| along with a test case to prove that the sanity check
| worked, and that test case started failing:
| not ok 2 - catch overflow in token position calculation
| Failed test 'catch overflow in token position calculation'
| at t/152-inversion.t line 70.
|
| To the extent I can be, I'm done with C. I leave it to
| people who think that silently optimizing away previously
| functional sanity checks is an acceptable engineering
| tradeoff, and who disparage those of us who have been
| bitten.
| anarazel wrote:
| > in how many contexts is INT_MAX + 1 overflowing to
| INT_MIN actually sane semantics? Unless you're immediately
| attempting to check the result to see if it overflowed
| (which is extremely rare in the code I see)
|
| It's not that rare - I know that postgres got bit by
| particularly that issue, and several other other projects
| as well. Particularly painful because that obviously can
| cause security issues.
| lixtra wrote:
| > With relatively few exceptions, if your program hits
| undefined behavior, then your program was already doing
| something pretty wrong to begin with.
|
| The author claims: "We have the absurd situation that C,
| specifically constructed to write the UNIX kernel, cannot
| be used to write operating systems. In fact, Linux and
| other operating systems are written in an unstable dialect
| of C that is produced by using a number of special flags
| that turn off compiler transformations based on undefined
| behavior (with no guarantees about future "optimizations").
| The Postgres database also needs some of these flags as
| does the libsodium encryption library and even the machine
| learning tensor-flow package."
|
| So basically the programers of the most used C programs
| consider the C standard so broken that they force the
| compiler to deviate from it. (Or they are not able to do
| things right?)
| mjw1007 wrote:
| I agree.
|
| I don't have much sympathy for people who were doing things
| like writing multithreaded programs in the days before C
| documented its memory model and then becoming unhappy because
| new optimisations that legitimately help single-threaded code
| broke their programs.
|
| In my experience C compiler maintainers have generally been
| open to the idea of offering guarantees beyond a narrow reading
| of the standard, but they want to be able to clearly state what
| it is that they're guaranteeing. "Keep old programs running"
| isn't enough.
|
| I think the "Prevailing interpretation" that Yodaiken complains
| about is coming from the same place as suspicion of the "be
| lenient in what you accept" IETF principle: that sort of thing
| doesn't lead to robustness in the long run.
|
| The way forward at this point is surely to define more things
| that are currently undefined (whether in the standard or by
| widely-implemented extensions).
| vyodaiken wrote:
| The current situation is not good for compiler writers either.
| But nobody has ever shown that either C programmers want to
| sacrifice safety for "optimizations", or that these UB
| optimizations actually improve performance of anything.
| ynik wrote:
| What do you mean by "these UB optimizations"? C is a low-
| level language; it's basically impossible for a compiler to
| reason about the code unless it makes certain assumptions. It
| needs to assume the code is not self-modifying to do pretty
| much any code-generation more intelligent than a macro
| assembler. It needs to assume the code isn't messing with the
| stack frames/return addresses in order to inline functions.
| It needs to assume the code isn't using an out-of-bounds
| pointer to access a neighboring local variable, so that it
| can move local variables into registers. "gcc -O0" is a good
| approximation for the performance you get if the compiler
| isn't allowed to optimize based on UB.
|
| Yes, that means C without optimizing without UB is slower
| than Java. Optimizations need some form of reasoning about
| what's happening. For Java it's optimizing based on
| guarantees provided by the language (there's no raw pointers
| that could mess with the things listed above). But C doesn't
| provide any hard guarantees, so instead it needs to blindly
| assume that the code will behave sanely.
|
| Also note that for many of the more manageable sources of UB,
| most compilers provide a choice (-fwrapv, -fno-strict-
| aliasing, ...). Yet few projects use these options, even when
| they use other gcc/clang-specific features. Doesn't that
| indicate that C programmers indeed want to sacrifice safety
| for optimizations?
| gpderetta wrote:
| Exactly.
|
| For example there were programs 30-40 years ago that relied
| on exact stack layouts. These days everybody would agree
| they are completely broken.
|
| The issue of course is that it is extremely hard to write
| programs that have no UB. It would be nice for compilers to
| have an option to automatically introduce assetions
| whenever they rely on some UB-derived axiom, basically as a
| sort of lightweight sanitizer.
|
| In fact if we had sanitizers 30-40 years ago probably
| things would be better today.
| MauranKilom wrote:
| > It would be nice for compilers to have an option to
| automatically introduce assetions whenever they rely on
| some UB-derived axiom
|
| Modifying a value from a different thread without
| synchronization is UB. The compiler assumes this does not
| happen in order to e.g. move things into registers. Could
| you elaborate how (and how often) you would like to have
| this kind of UB-derived axiom ("this value remains the
| same from here to there") checked with assertions?
| gpderetta wrote:
| Obviously you wouldn't be able to catch many, or even
| most cases. Use-after-free is anothe case that would ve
| very expensive to detect.
| vyodaiken wrote:
| That's good example, because nobody would complain if
| stack layouts changed and those programs failed. But if
| the compiler chooses to "optimize away" checks on stack
| layout, that's a different thing altogether. Also note
| that if you use pthreads or Linux clone or you are
| writing an operating system you can need to rely on exact
| stack layouts even today.
| vyodaiken wrote:
| For your last point, the extent of UB driven changes to
| semantics is still not widely known in the programmer
| community. Programmers don't read the standard - they read
| K&R, and K&R is right now describing a different language.
| We've had 15 years of programmers repeatedly filing bug
| reports to be told that the expected, tested, relied on,
| behavior was ephemeral. Only very sophisticated projects
| figure out about UB.
|
| Of course compilers have to make assumptions. The debate is
| (a) over what assumptions it is proper to make and (b) what
| are the permissible behaviors. The false dichotomy: either
| do without any optimizations at all or accept whatever UB
| gives you, is not a useful approach.
| MauranKilom wrote:
| So what optimizations do you mean with "these UB
| optimizations" then? And would it change your mind to see
| a benchmark proving the usefulness of that particular UB
| optimization?
| vyodaiken wrote:
| e.g. assuming UB can't happen: deleting overflow or null
| pointer checks, deleting comparisons between pointers
| that are assumed to point at different objects, ...
| rndgermandude wrote:
| To do UB "optimizations", the compiler first needs to figure
| out that there is an UB it can "optimize" anyway. At this
| point instead of "optimizing" it could, and in my humble
| opinion absolutely should, blow up the compilation by
| generating an UB error, so people can fix their stuff.
|
| What about backwards compatibility in regards to a new
| compiler version deciding to issue errors on UB now? You
| don't have any guarantees about what happens with UB right
| now, so if you upgrade to a new version compiler that
| generates errors instead of "optimizations" everything would
| be still as before: no guarantees. And it's frankly a lot
| better to blow up the compilation with errors than to have
| the compiler accept the UB code and roll a dice on how the
| final binary will behave later. You can either fix the code
| to make it compile again, or use an older "known good"
| version of the compiler that you previously used as a stopgap
| measure.
|
| I fail to see any reason whatsoever why compilers are still
| doing all kinds of stupid stuff with UB instead of doing the
| right thing and issuing errors when they encounter UB.
|
| I also fail to see why the C language designers still insist
| on keeping so much of the legacy shit around.
| gpderetta wrote:
| > To do UB "optimizations", the compiler first needs to
| figure out that there is an UB it can "optimize" anyway.
|
| That's not how compwillrs work. In fact in the general case
| it is impossible to figure out at compile time that "there
| is an UB".
|
| The compiler instead assumes as an axiom that no UB can
| ever happen and uses the axioms to prove properties of the
| code.
|
| These days if you want to catch UB, compile with
| -fsanitize=undefined-behaviour. The program wll then trap
| if UB is actually detected at runtime.
| nullc wrote:
| > To do UB "optimizations", the compiler first needs to
| figure out that there is an UB it can "optimize" anyway.
|
| The compiler assumes UB will never happen and it makes
| transformations that will be valid if there happens to be
| no UB. This doesn't require any explicit detection of UB,
| and in some cases UB or not is simply undecidable at
| compile time (as in _no_ compiler could detect it without
| incorrect results).
|
| Without these assumptions the resulting compiled code would
| be much slower, though some optimizations have different
| danger vs speed impact and there certainly can be a case
| that there are some optimizations that should be eschewed
| because they're a poor trade-off.
|
| There are many cases where current compilers will warn you
| when you've done something that is UB. It's probably not
| the case that they warn for every such detectable case and
| if so it would be reasonable to ask them to warn about more
| of them.
|
| I think your irritation is just based on a misunderstanding
| of the situation.
|
| Compiler authors are C(++) programmers too, they also don't
| like footguns. They're not trying to screw anyone over.
| They don't waste their time adding optimizations that don't
| make real performance improvements just to trip up invalid
| code.
| dooglius wrote:
| The caricatures are somewhat accurate though, optimizations
| that look at UB adversarially are never anywhere close to
| justified.
|
| > The set of people doing so overlaps with the set of people
| complaining that the compiler doesn't optimize their code
| sufficiently to a significant degree.
|
| There's no contradiction here, and the overlap is generally
| just "people who care". The optimizations that are not safe
| shouldn't exist, and the optimizations that are safe should be
| good.
|
| > nearly all of the time are things the code author would agree
| with if they thought long and hard
|
| I highly doubt this is the case for even one situation.
| MauranKilom wrote:
| Name one such optimization. We'll be happy to refute your
| points for that one.
| pornel wrote:
| > optimizations that look at UB adversarially
|
| The whole point is that there isn't such adversarial thing
| like "we're going to find the UB right there, won't even
| print a warning about it, and mess up your crappy code,
| haha!"
|
| Optimizers aren't reasoning about code like people do (start
| to finish with high-level understanding of the whole
| function), but rather as series of mostly dumb, mostly
| isolated small passes, each pass changing one little thing
| about the code.
|
| It just happens that one pass marks certain instructions as
| "can't happen" (like the spec says), then another pass
| simplifies expressions, and then another pass that removes
| code that doesn't do anything, usually left over from the
| previous steps. They sometimes combine in an "adversarial"
| way, but individually each pass is justified and necessary.
|
| Compilers already have lots of different passes. Splitting
| optimizations into passes is a way to keep complexity closer
| to O(n) rather than O(n^2), but this architecture makes
| interactions between passes very delicate and difficult to
| coordinate, so it's difficult to instrument the data to avoid
| only cases of annoying UB without pessimizing cases that
| users want optimized.
| GuB-42 wrote:
| The C standard is not the Bible, it is not written by an almighty
| god. As respectable K&R are, they are humans who wrote a standard
| for their own needs, based on the state of the art of that time.
| Sadly C is the work of mortals...
|
| Reading the standard trying to understand the way of god like
| religious scholars do is a pointless exercise. Modern compiler
| developers found that exploiting undefined behavior the way we do
| new leads to interesting optimization, others found it reasonable
| so now it is the standard.
|
| I think the issue most people have now is that compilers use
| advanced solvers that are able to infer a lot from undefined
| behavior and other things, so UB is no longer just "it works or
| it crashes".
| ChrisSD wrote:
| I think the real problem stems from the mismatch between modern
| processors and the processors C was originally designed for.
|
| C programmers want their code to be fast. Vanilla C no longer
| gives them the tools to do that on a modern processor. Either the
| language needs to be extended or the compiler needs to get more
| creative in interpreting the existing language. The latter is the
| least disruptive and it doesn't stop judicious use of the former.
|
| So, in short, UB is what gives room for the compiler to be more
| creative without the programmer having to change their code. It
| wasn't a reading error, it was an opportunity the compiler devs
| enthusiastically embraced.
| bhk wrote:
| Even on modern processors, an ADD instruction does not corrupt
| memory. The C standard, in declaring that an integer overflow
| results in all-bets-are-off UB, is not enabling compilers to
| provide valuable optimizations.
| nullc wrote:
| It would be nice if that were true, but it's not. The ability
| to assume incrementing a counter won't overflow makes range
| analysis possible in many cases where it otherwise wouldn't
| be. Because of this you can perform vectorization because the
| compiler can assume it knows how many times the loop will
| run.
|
| The performance differences are not small.
|
| You can also tell some compilers to treat signed integer
| overflow as wrapping-- but people don't usually do this
| because the optimizations you lose are valuable.
| vyodaiken wrote:
| I bet the performance differences are minimal except
| perhaps on some contrived cases. The compiler should not
| assume things it doesn't know. But show me a benchmark.
| rectang wrote:
| When compiler writers have get "creative" with C undefined
| behavior, programming C no longer produces predictable results.
|
| > least disruptive
|
| Like starting to optimize away loop checks that can "never
| happen" because signed integer overflow is UB, suddenly
| changing the behavior of programs that were fine for years?
|
| I wish I could just fence off this insanity by never starting
| another project in C. Unfortunately, C is ubiquitous in the
| ecosystem so all of us are stuck cleaning it up.
| msbarnett wrote:
| > Like starting to optimize away loop checks that can "never
| happen" because signed integer overflow is UB, suddenly
| changing the behavior of programs that were fine for years?
|
| Yeah. Not doing that on modern processors is actually quite
| disruptive.
|
| Here: for(i = offset; i < (offset + 16);
| i++) { arr[i] = i + 32; }
|
| What C compilers currently do is, in line with the standard,
| ignore the case that offset + 16 might overflow. This makes
| this eligible for loop unrolling, and depending on the
| specifics of the math inside the loop, the compiler can do a
| lot to pre-calculate things because it knows this is
| happening 16 times exactly.
|
| If, instead, we force compilers to think about the fact that
| offset + 16 _could_ have some implementation-defined meaning
| like wrapping, then all bets are off & we have to throw a
| bunch of optimization opportunities out the window.
|
| Lots and lots of hot & tight loops which are currently able
| to be compiled into something suitable for the preferences of
| modern CPUs instead has to be to be naively compiled because
| of the need to hold back due to the _possibility_ of
| something that largely wasn't happening, happening.
|
| Most people write most loops this way, never expecting or
| intending to overflow anything. Most loops are benefitting
| from this optimization. A lot of code would get slower, and
| programmers would have to do a lot more fragile hand-
| unrolling of operations to get that performance back. And
| they'd need to update that more often, as whatever the
| optimal "stride" of unrolling changes with the evolution of
| CPU pipelines.
|
| It's slower code and more work more often for more people, to
| satisfy a minority use-case that should really just have its
| own separate "please wrap this" construct.
| rectang wrote:
| > to satisfy a minority use-case
|
| Every single C program is potentially in that "minority".
| Nobody can tell when the compiler writers are going to
| change up behavior on you.
|
| It doesn't matter how carefully the codebase has been
| written, whether you've had `-Wall -Wextra` enabled. What
| was fine at one time is no longer fine today. Any C program
| may suddenly start exhibiting misbehavior from innocuous to
| catastrophic to horrendously insecure.
|
| It's psycho, maddening, irresponsible. And the only way to
| deal with it is to purge C programs compiled by these
| psychotic compilers from our systems.
| msbarnett wrote:
| > Every single C program is potentially in that
| "minority". Nobody can tell when the compiler writers are
| going to change up behavior on you.
|
| This is ridiculously hyperbolic, and bringing unthinking
| emotional responses like "psycho" and "irresponsible"
| only obscures the fact that there are very serious
| engineering tradeoffs involved in trying to balance "not
| surprising people whose code contains an assumption that
| some case is going to behave a certain way when by-the-
| standard-as-written that case can be ignored" and "not
| making everything with a hot loop 8x slower because we
| can't assume anything about loop bounds any more", and
| that compilers that do the latter are unlikely to prove
| popular with a lot of people either.
| dooglius wrote:
| > minority use-case
|
| The amount of code that looks like this in a big enough hot
| loop to make a difference is negligible. Can you provide
| even one real-world example where this makes a difference,
| i.e. not some microbenchmark? The amount of code that can
| break as a result of signed overflows being UB, on the
| other hand, is huge.
|
| > programmers would have to do a lot more fragile hand-
| unrolling of operations to get that performance back
|
| Much easier ways to do this, e.g. by using an assert
| wrapper around __builtin_unreachable. Alternatively, an
| unsafe_int_t could be defined that gives the optimize-able
| behavior. The important thing is to make it opt-in;
| sensible defaults matter.
| MauranKilom wrote:
| > The amount of code that can break as a result of signed
| overflows being UB, on the other hand, is huge
|
| C++ recently decided to _not_ make signed overflow
| defined, despite having the explicit opportunity to do
| so. Here is the reasoning:
|
| http://www.open-
| std.org/jtc1/sc22/wg21/docs/papers/2018/p090...
|
| > Performance concerns, whereby defining the behavior
| prevents optimizers from assuming that overflow never
| occurs;
|
| > Implementation leeway for tools such as sanitizers;
|
| > Data from Google suggesting that over 90% of all
| overflow is a bug, and defining wrapping behavior would
| not have solved the bug.
|
| Presumably, data from Google disproves your assertion
| that the amount of code that breaks due to signed
| overflow being UB is huge.
| msbarnett wrote:
| > Can you provide even one real-world example where this
| makes a difference, i.e. not some microbenchmark
|
| Sure. I don't even have to leave this thread to find one:
| https://news.ycombinator.com/item?id=27223954 reports a
| measurable speed impact to PostgreSQL when compiled with
| -fwrapv, which rules out the exact optimization in
| question.
|
| This shouldn't be surprising; loops are extremely common
| and superscalar processors benefit enormously from almost
| anything other than a naive translation of them.
|
| Here's -fwrapv cutting performance of a function in half
| in Cython vs the non-fwrapv compilation:
| https://stackoverflow.com/questions/46496295/poor-
| performanc...
| vyodaiken wrote:
| This is a common argument, but it is flat out wrong. If, as
| claimed, compilers have complete freedom of choice about
| how to handle UB, they have the choice to e.g. make this
| behavior depend on the processor architecture. Compiler
| developers are _choosing_ to use UB to make C semantics
| defective.
| adwn wrote:
| > _If, instead, we force compilers to think about the fact
| that offset + 16 could have some implementation-defined
| meaning like wrapping, then all bets are off & we have to
| throw a bunch of optimization opportunities out the
| window._
|
| Uh huh. If `i` is declared as `unsigned int` instead of
| `int`, then overflow _is_ defined and the compiler can 't
| apply those optimizations. And yet the world doesn't end
| and the sun will still rise tomorrow...
| tom_mellior wrote:
| The world doesn't end, but in the "int" case you get nice
| vector code and in the "unsigned int" case you get much
| less nice scalar code:
| https://gcc.godbolt.org/z/cje6naYP4
| adwn wrote:
| Yes, that is true. The proper way for a compiler to
| handle this, would be to add a single overflow check
| before the loop, which branches to a scalar translation
| of the loop. Most realistic code will need a scalar
| version anyway, to deal with the prolog/epilog of the
| unrolled loop for iteration counts that aren't multiples
| of the unrolling factor.
|
| Surely you agree that treating unsigned overflow
| differently from signed does not make any sense
| semantically? Why is signed overflow UB, but unsigned
| wrapping, and not the other way around? The terms
| 'signed' and 'unsigned' denote the value range, not
| "operations on this type might overflow/will never
| overflow".
| rocqua wrote:
| To a mathematician, wrapping 2^n+1 back to 0 is a lot
| more intuitive than wrapping 2^n to -2^n. Mathematically
| the two systems are largely equivalent. They are
| equivalent when considering addition and multiplication.
| Both implement arithmetic modulo 2^n+1.
|
| However, the canonical representation of this system runs
| from 0 to 2^n+1. Hence, if you were going to make one
| kind of integer overflow, and not the other, C made the
| correct choice.
|
| That leaves out the question of whether the difference
| between the two cases is significant enough to have a
| difference in how overflow works.
| tom_mellior wrote:
| > The proper way for a compiler to handle this, would be
| to add a single overflow check before the loop, which
| branches to a scalar translation of the loop. Most
| realistic code will need a scalar version anyway, to deal
| with the prolog/epilog of the unrolled loop for iteration
| counts that aren't multiples of the unrolling factor.
|
| That's true, I agree that that would be a clever way to
| handle this particular case. It would still happily
| invoke undefined behavior if the indices don't match the
| array's length, of course. Many assumptions about the
| programmer knowing what they are doing goes into the
| optimization of C code.
|
| > Surely you agree that treating unsigned overflow
| differently from signed does not make any sense
| semantically?
|
| Yes. Silently wrapping unsigned overflow is also very
| often semantically meaningless.
| msbarnett wrote:
| > And yet the world doesn't end and the sun will still
| rise tomorrow...
|
| No, you just get much slower, non-vectorized code because
| the compiler is forced to forgo an optimization if you
| use unsigned int as the loop bound (EDIT: tom_mellior's
| reply illustrates this extremely well:
| https://gcc.godbolt.org/z/cje6naYP4)
|
| Which is precisely the point: forcing a bunch of existing
| code with int loop bounds, which currently enjoys
| optimization, to take on the unsigned int semantics and
| get slower, is just going to piss off a different (and
| probably larger) set of people than the "compilers
| shouldn't assume that unsigned behaviour can't happen"
| set of people.
|
| It's a tradeoff with some big downsides; this isn't the
| obvious win the anti-optimization crowd pretends it is.
| bhk wrote:
| Well-defined integer overflow would not preclude loop
| unrolling in this case. One simple alternative would be for
| the compiler to emit a guard, skipping unrolling in the
| case that (offset+16) overflows. This guard would be
| outside the unrolled loop. Furthermore, unsigned values are
| often used for indices (the unsigned-ness of size_t pushes
| programmers in that direction) and unsigned overflow _is_
| well-defined, so any compiler engineer implementing
| unrolling should be be able to emit such a guard so that
| the optimization can be applied to loops with unsigned
| indices.
| [deleted]
| msbarnett wrote:
| > Well-defined integer overflow would not preclude loop
| unrolling in this case. One simple alternative would be
| for the compiler to emit a guard, skipping unrolling in
| the case that (offset+16) overflows.
|
| To what end? for(i = offset; i <
| (offset + 16); i++) { arr[i] = i + 32;
| }
|
| like most loops in most programs isn't _designed_ to
| overflow. The program isn 't any more correct for
| emitting two translations of the loop, one unrolled and
| one which is purely a bugged case anyways.
|
| Changing the way the UB manifests while altering the
| nature of the optimization hasn't actually fixed anything
| at all here. All this would seem to accomplish would be
| to increase pressure on the icache.
| tomcam wrote:
| > Vanilla C no longer gives them the tools to do that on a
| modern processor
|
| Can you elaborate on this point?
| I_Byte wrote:
| C was created during a time where instructions were executed
| linearly with no vectorization, memory was a flat space with
| no CPU caches, and there wasn't a branch predictor that may
| or may not execute the correct program branch in advance. The
| list goes on but the rest is beyond my scope.
|
| C was designed for a now obsolete computer architecture model
| and over the years this old model has essentially become an
| abstraction that sits between C and the CPU. As such, C
| programmers aren't really programming in their CPU's domain
| anymore and C, by default, lacks the commands necessary to
| effectively utilize these new developments. It is left up to
| the compiler to translate C code from the old architecture
| abstraction into efficient machine code for our new machines.
|
| For a more in depth look into this topic I recommend you
| check out (0).
|
| (0) - https://queue.acm.org/detail.cfm?id=3212479
| david2ndaccount wrote:
| _Compilers have become more powerful (opening up new ways to
| exploit undefined behavior) and the primary C compilers are free
| software with corporate sponsors, not programmer customers (or
| else perhaps Andrew Pinski would not have been so blithe about
| ignoring his customer Felix-gcc in the GCC bug report cited
| above)._
|
| This is the real problem. We have reached a situation where a
| small number of compilers dominate the space, but yet do not
| charge the users and so do not treat them as customers. The C
| standard is a product of an era where you would pay for your
| tools and so would demand a refund from any compiler vendor that
| would treat undefined behavior in an absurd manner.
| [deleted]
| mhh__ wrote:
| Wouldn't the largest customers (by far) still be the companies
| funding development of the compilers already?
| msbarnett wrote:
| > The C standard is a product of an era where you would pay for
| your tools and so would demand a refund from any compiler
| vendor that would treat undefined behavior in an absurd manner.
|
| Since current compilers aren't out to do anything malicious
| with UB, but instead simply treat it as "assume this can't
| happen and proceed accordingly", it's not clear at all to me
| what you think paid compilers would do here instead: refuse to
| compile vast swaths of code that currently compiles? Or compile
| it very pessimistically, forgoing any optimization
| opportunities by instead assuming it can happen and inserting a
| bunch of runtime checks in order to catch it then?
|
| In either case, I doubt there's any real market for "pay money
| for this compiler and it either won't build your code, or it
| will run more slowly". I'm just old enough to remember the paid
| C compiler market and the thing that was driving everybody to
| pay for the latest and greatest upgrade was "how much better is
| it at optimization than before?"
| vyodaiken wrote:
| I see this argument a lot, but it's silly. I don't have to
| care whether gcc changed the settled semantics without notice
| or documentation out of sincere belief that they were helping
| or out of a desire to beat a benchmark that doesn't matter to
| me, or out of habit. It's not the intent of the compiler
| authors that matters, but their disregard of the needs of
| application programmers.
| Quekid5 wrote:
| What are they to do, exactly?
|
| Optimization is a very 'generic' process and application
| programmers want optimization. The only sensible thing to
| do is to assume UB cannot occur and optimize accordingly.
|
| What else is there?
|
| I can already predict that whatever you suggest will very
| shortly end up in Halting Problem territory _or_ will mean:
| No optimization. There are a _lot_ of UBs that (if defined)
| would require run-time checking to define. That wouldn 't
| inhibit optimization per se, but it would ultimately mean
| slower execution.
| vyodaiken wrote:
| C programmers prefer control to "optimization". And if
| you assume UB cannot occur, you should not generate code
| that makes it happen. Radical UB is not required for
| optimization: in fact it appears to mostly do nothing
| positive. There is not a single paper or study showing
| significant better performance for substantial C code
| that depends on assuming UB can't happen - just a bunch
| of hand waving.
| swiley wrote:
| Are you arguing that e.g. Turbo C and friends from the 80s were
| higher quality than modern C compilers?
| setr wrote:
| I believe he's arguing that they were more sane/pragmatic
| compilers -- they would be inherently less comfortable
| exploiting UB to do anything other than what people expected
| or were used to, because there is more real possibility of
| retribution (GCC could get away with making demons fly out
| your nose and just lose marketshare [that it doesn't directly
| depend on anyways] where a commercial compiler would go out
| of business)
| msbarnett wrote:
| > I believe he's arguing that they were more sane/pragmatic
| compilers
|
| I can't imagine they have any actual experience with
| Borland or Symantec's C or C++ compilers, then. These
| things had notoriously shakey standards compliance and
| their own loopy implementations of certain things, along
| with legions of bugs - it's not hard to find older C++ libs
| with long conditional compilation workarounds for Borland's
| brain damage. Microsoft's C++ compiler was for years out of
| step with the standard in multiple ways.
|
| Part of the reason GCC ate these compilers' markets for
| lunch was the much more rigorous and reliable standards
| adherence, not just the lack of cost.
|
| This reads like nostalgia for an age that never was.
| [deleted]
| cperciva wrote:
| The author seems to be missing this essential text: "the
| implementor may augment the language by providing a definition of
| the officially undefined behavior."
|
| Making a system call is undefined behavior in the C standard, but
| it's not undefined behavior in clang-on-FreeBSD, because the
| implementors of clang on FreeBSD have defined what those system
| calls do.
|
| Ditto for "asm" (UD unless/until you're running on a compiler
| which defines what that does), all of the tricks which make
| "malloc" work, and all of his other examples of acceptable uses
| of code which the C standard does not define.
| bitwize wrote:
| The thing about UB is that it tends to happen when the C
| standard refuses to specify when a program segment is erroneous
| or valid. Some C environments treat memory as a large array of
| undifferentiated bytes or words, by design. Other C
| environments have tagged, bounds-checked regions of memory,
| again by design. (For example, the C compiler for the Lisp
| machine.) Usually, indirecting through a null pointer or
| walking off the end of an array are erroneous, but sometimes
| you _want_ to read from memory location 0, or scan through all
| of available memory. The C standard allows for both kinds of
| environments by stating that these behaviors are undefined,
| allowing the implementation to error out or do something
| sensible, depending on the environment.
|
| The idea that UB is carte blanche for implementations to do
| whatever is an unintended consequence of the vague language of
| the standard. Maybe a future C standard should use "safe" and
| "unsafe" instead of UB for some of these operations, and
| clarify that unsafe code will be erroneous in a safe
| environment and do something sensible but potentially dangerous
| in an unsafe environment so you must really know what you're
| doing.
| [deleted]
| Rusky wrote:
| > The idea that UB is carte blanche for implementations to do
| whatever is an unintended consequence of the vague language
| of the standard.
|
| Whether or not this was originally intended, it's certainly
| become the way the standard is written and used today, so
| that's kind of beside the point.
|
| Further, this is not some new idea that arose from the C
| standard. It's a basic, core idea in both software
| engineering and computer science! You define some meaning for
| your input, which may or may not cover all possible inputs,
| so that you can go on to process it without considering
| inputs that don't make sense.
|
| Now, to be fair, the "guardrail-free" approach where UB is
| _silent_ is a bit out of the ordinary. A lot of software that
| makes assumptions about its input will at least try to
| validate them first, and a lot of programming language
| research will avoid UB by construction. But C is in a unique
| place where neither of those approaches fully work.
|
| > The C standard allows for both kinds of environments by
| stating that these behaviors are undefined, allowing the
| implementation to error out or do something sensible,
| depending on the environment.
|
| This is true, but it doesn't mean that "something sensible"
| is actually something the programmer should rely on! That's
| just asking too much of UB- programmers need to work _with_
| the semantics implemented by their toolchain, not make up an
| intuitive /"sensible" meaning for their undefined program and
| then get mad when it doesn't work.
|
| For example, if you want to scan through a bunch of memory,
| _tell the language that 's what you're doing._ Is that memory
| at a fixed address? Tell the linker about it so it can show
| up as a normal global object in the program. Is it dynamic?
| Memory allocators fabricate new objects in the abstract
| machine all the time, perhaps your compiler supports an
| attribute that means "this function returns a pointer to a
| new object."
|
| The solution is not just to shrug and say "do something
| sensible but potentially dangerous." It's to precisely define
| the operations available to the programmer, and then provide
| tools to help them avoid misuse. If an operation isn't in the
| language, we can add it! If it's too easy to mess up, we can
| implement sanitizers and static analyzers, or provide
| alternatives! Yelling about a supposed misreading of
| "undefined behavior" is never going to be anywhere near as
| effective.
| vyodaiken wrote:
| One issue is that under the prevailing interpretation, the
| existing semantics is not reliable. You do not know when or
| if the compilers will take advantage of UB to completely
| change the semantics they are providing. That's not
| tenable.
| Rusky wrote:
| That's not how it works. Taking advantage of UB doesn't
| change the semantics, it just exposes which behaviors
| were never in the semantics to begin with. Barring
| compiler or spec bugs, we _do_ in principle know exactly
| when the compiler may take advantage of UB. That 's the
| point of a document like the standard- it describes the
| semantics in a precise way.
|
| To be fair, the existing semantics are certainly complex
| and often surprising, and people sometimes disagree over
| what they are, perhaps even to an untenable degree, but
| that's a very different thing from being unreliable.
| vyodaiken wrote:
| The net result of your argument is the language has no
| semantics. I write and test with -O0 and show that
| f(k)=m. Then I run with -O3 and f(k)=random. Am I
| required to be an expert on C Standard and compiler
| development in order to know that, with no warning, my
| code has always been wrong? What about if f(k)=m under
| Gcc 10, but now under Gcc 10.1 that whole section of code
| is skipped? What you are asking programmers to do is to
| both master the fine points of UB (which is impractical)
| and look into the future to see what changes may be
| invisibly committed to the compiler code.
| marcosdumay wrote:
| The C standards have the perfectly fine name "implementation
| dependent" to describe those things. Undefined behavior is much
| less constrained than implementation dependent, adn thus more
| problematic.
| cperciva wrote:
| No. "Implementation defined" says "the standard doesn't
| specify what happens here but the compiler must document what
| it does". That's a step removed from "the compiler _may_
| define what this does ".
| tom_mellior wrote:
| > The C standards have the perfectly fine name
| "implementation dependent" to describe those things.
|
| That term is not used by the C standards. Do you mean
| "implementation-defined"? asm is not among the explicitly
| specified implementation-defined behaviors, it's listed under
| "Common extensions". I don't see any mention at all of
| syscalls in C99. (I'm working with http://www.dragonwins.com/
| courses/ECE1021/STATIC/REFERENCES/... here.)
| formerly_proven wrote:
| I'm not sure why syscalls would be UB; it's just not
| something defined by the C standard.
|
| Edit: To clarify, I meant UB in the sense it is typically
| used in these discussions, where the standard more-or-less
| explicitly says "If you do X, the behavior is undefined."
| Not in the literal sense of "ISO C does not say anything
| about write(2), hence using write(2) is undefined behavior
| according to the C standard", which seems like a rather
| tautological and useless statement to me.
| hvdijk wrote:
| What do you think UB is if not something where the
| behaviour is not defined?
| Google234 wrote:
| So is everything UB since all hardware isn't perfect?
| tsimionescu wrote:
| UB, in this context, is very explicitly used in the
| standard: it is undefined behavior related to a construct
| that the standard describes.
| faho wrote:
| So if it is behavior that is not defined by the C
| standard, would that not make it undefined behavior?
| [deleted]
| colejohnson66 wrote:
| While _technically_ correct, "undefined behavior" in
| terms of C and C++ refer to what the _standard_ calls out
| explicitly as undefined, and not a simple "it's not
| referenced, therefore it's undefined."
|
| For example, signed(?) integer overflow is explicitly
| undefined by the standard, but as @formally_proven said,
| just because write(2) isn't mentioned doesn't mean usage
| of it is undefined.
| _kst_ wrote:
| Actually, that's exactly what it means:
|
| > If a "shall" or "shall not" requirement that appears
| outside of a constraint or runtime-constraint is
| violated, the behavior is undefined. Undefined behavior
| is otherwise indicated in this International Standard by
| the words "undefined behavior" or by the omission of any
| explicit definition of behavior. There is no difference
| in emphasis among these three; they all describe
| "behavior that is undefined".
|
| write() is a function, and a call to it behaves like a
| function call, but the C standard says nothing about what
| that function does. You could have a function named
| "write" that writes 0xdeadbeef over the caller's stack
| frame. Of course if "write" is the function defined by
| POSIX, then POSIX defines how it behaves.
| steveklabnik wrote:
| There is a difference between jargon in context and the
| use of those words in a general sense. It can be
| "undefined behavior" in a general sense, but not
| necessarily "undefined behavior" in the jargon sense.
|
| After all, if I were to use the words "undefined
| behavior" in a sentence unrelated to the standards, the
| definition in the standard of "behavior, upon use of a
| nonportable or erroneous program construct or of
| erroneous data, for which this International Standard
| imposes no requirements." would be nonsense. Same goes in
| the other direction.
| jcranmer wrote:
| Neither system calls nor the asm keyword are undefined behavior
| in the sense that C uses the term. They are, simply put, not
| covered by the standard at all.
|
| System calls--assuming you're referring to the C prototypes you
| call--work as normal external function definitions, just having
| semantics which are defined by the library (i.e., the kernel)
| and not the C specification itself. The asm keyword is a
| compiler language extension and is effectively implementation-
| defined (as C would call it), although compilers today tend to
| poorly document the actual semantics of their extensions.
| Joker_vD wrote:
| > all of the tricks which make "malloc" work,
|
| What are those, exactly? AFAIK, you can safely track memory
| addresses by storing them as intptr_t/uintptr_t.
| cperciva wrote:
| The C standard says very little about how those types work.
| In particular, you can cast a pointer to one of them and then
| cast back to a pointer -- but only if you cast the exact same
| value back, and the intptr values are not guaranteed to be in
| any way meaningful.
|
| In particular, casting a pointer to intptr_t, doing
| arithmetic on it, and casting back is not guaranteed to do
| anything useful. It almost certainly will, since most systems
| treat it as roughly the same as casting to char *, but the
| standard does not guarantee it.
| magnostherobot wrote:
| Do you have an example of a situation in which you'd want
| to cast the result of arithmetic intptr_t values to a
| pointer? The situations I can think of off the top of my
| head would be better done as arithmetic between pointers.
| bonzini wrote:
| Arithmetic on pointers in turn is only defined if the
| pointers point within the same object (or right past the
| end of that object).
|
| One example of using intptr_t would be going from a
| pointer passed to free() to a metadata block for the
| memory that must be freed.
| jacquesm wrote:
| Oh, for instance, on some implementations there is a lot of
| interesting stuff just prior to the allocated block returned.
| Not exactly the pinnacle of elegance but it gets the job
| done.
| kmm01 wrote:
| I'm afraid the author themselves is misreading the definition of
| undefined behavior. Undefined behavior is not "behavior upon use
| of a nonportable or erroneous program construct". That rephrasing
| completely changed the meaning. Undefined behavior is, as the C
| standard states, "behavior [, ...,] for which the Standard
| imposes no requirements". The whole ", upon use of...," part is
| just exemplifying situations in which undefined behavior can
| occur. The standard will sometimes say that a certain construct
| results in undefined behavior but more importantly any construct
| for which the standard does not specify a certain behavior has
| (what else?) undefined behavior.
| warkdarrior wrote:
| True. I think the point of the author is that the C standards
| group is not doing their job by leaving so much room for
| compilers to interpret undefined behavior.
| mumblemumble wrote:
| > C can accommodate significant optimization while regaining
| semantic coherence - if only the standard and the compilers stop
| a lazy reliance on a mistaken reading of "impose no
| requirements".
|
| That wouldn't be enough, because, sadly for us, rewriting history
| is only an option in Git repositories and speculative fiction. On
| this timeline, it doesn't much matter how the standard should
| have been interpreted (I'll refrain from opining on that), what
| matters is how the standard was interpreted, and its influence on
| the behavior of major compilers.
|
| Given the current situation, it seems to me that reaping the
| benefits of such an enterprise would require getting everyone on
| board, a mode that's based on the new interpretation, leaving it
| turned off by default, and then basically never actually using it
| for fear of breaking compatibility with toolchains that have to
| use an older version of the compiler for a good decade or two
| while we wait on them to age out of existence.
|
| I'm not sure the world actually has much practical use for an
| unapproachable ivory tower dialect of C. Personally, I'd much
| rather have a go at a language like Zig that's taken upon itself
| to dream even bigger.
| alerighi wrote:
| To me the real problem is that compilers these days do too much.
| Nobody asked for these optimizations, if you want optimized code,
| use C++, or Rust, or whatever other "modern" language that has
| higher level constructs and thus can optimize the code better.
|
| The reason to use C is to have an "high level assembler", and
| there is no reason to use C if I can't rely on the output of the
| compiler, and if the compiler eliminates some code that I write.
| We all know that happens when a signed integer overflows, we all
| know that most architectures are little endian, etc.
|
| Even if something is "undefined behaviour" for the standard, it
| usually has some well defined behaviour on the platform that you
| are targeting.
|
| Unfortunately, for these reasons gcc is dangerous to use, it's
| not a problem of desktop application (if it crashes, who cares),
| but I talk about critical systems. It would be unacceptable if a
| machine ends up killing someone because at gcc they though that
| they could remove a check that they thought it was useless. And
| so you have to spend thousand of $ just to get reliable compilers
| that doesn't do silly optimizations.
|
| I think that they should make a sort of "gcc lite" that does
| exactly what (at least to me) a C compiler has to do: translate
| the code from C to assembly, without changing the semantic of the
| code. If there is an undefined behaviour that behaviour will be
| translated to the assembly code and you will let the processor to
| deal with it.
|
| Also, we talk about optimizations that could be made by
| exploiting undefined behaviour, fine, but also the programmer
| usually optimizes exploiting the same undefined behaviours by
| hand.
| curtisf wrote:
| I think the reasoning here is flowing backwards.
|
| The writer wants to believe that C is a well-designed language
| suitable for writing large programs (because programmers
| understandably use it that way; there's not really an alternative
| to C), and so people reading the spec and finding a minefield
| _must_ be reading the spec wrong. So many important programs are
| written in C, and so many of them, with a very strict reading of
| the C standard, can hit cases where their behavior is
| "undefined". This _is_ scary, if the C-language-lawyers are
| right!
|
| The C language was originally largely descriptive, rather than
| prescriptive. Early "C" compilers disagreed on what to do in
| strange cases (e.g., one might wrap integer overflow, one might
| saturate, one might have wider ints). Even when using the less-
| chaotic "implementation defined behavior", behavior can still
| diverge wildly: `x == x + 1` is definitely `false` under some of
| those interpretations and maybe `true` in some of those
| interpretations.
|
| However, the C spec clearly says that the compiler may "ignore
| the situation" that "the result is ... not in the range of
| representable values for its type"; it is "permissible" that `x
| == x + 1` is replaced with `false` despite the "actual"
| possibility that adding 1 to x produces the same value, if `+`
| was compiled to be a saturating add.
|
| This has significant practical consequences, even without the
| "poisoning" result commonly understood of undefined behavior.
| Since the value is known statically to be `false`, that might be
| inlined into a call into a function. That function may
| _dynamically_ re-check `x == x + 1` and find that it is `true`;
| obviously that function doesn't have a `if (true && false) {`
| case, so it results in the function misbehaving arbitrarily
| (maybe it causes a buffer overrun to the argument of a syscall!).
|
| 'Intuition' does not make a programming-language semantics. You
| need to write down all the rules. If you want to have a language
| without undefined behavior, you need to write down the rules for
| what must happen, keeping in mind that many examples of undefined
| behavior, like dereferencing out-of-bounds pointers, _cannot_ be
| detected dynamically in C without massive performance costs. To
| detect if a pointer is out-of-bounds, you need to always pair it
| with information about its provenance; you need to track whether
| or not the object has been freed, or the stack-frame it came from
| has expired. Is replacing all pointers with fat-pointers
| indicating their provenance and doing multiple comparisons before
| every dereference the "right" way to compile C?
| rocqua wrote:
| Quoting from the same passage in the standard as the article
| does:
|
| > Permissible undefined behavior ranges from ignoring the
| situation completely with unpredictable results, to ...
|
| Ignoring the situation completely with unpredictable results
| seems like it covers the current compiler behavior.
|
| The author does not like how current compilers work. But his
| argument against it mixes "it would be better if it worked
| differently" with "A specific pedantic reading of the standard
| says they are wrong". The second kind of argument seems to
| undercut his wider point. For his wider point is "Compilers
| should be reasonable rather than turning on pedantry". At least,
| that is what I think his point is, and it seems like the much
| stronger argument to me.
|
| Trying to "trip up" the proponents of current behavior by
| pointing to a possible miss-reading of a comma is not going to do
| much. Arguing instead that their practice is harmful seems like a
| much more likely to work approach. That said, such an argument
| should probably be civil. The article links to this [1]
| discussion. The link is supposed to show the peculiar arguments
| used by proponents of current behavior. What I read there is
| someone lashing out, calling names, and grand-standing.
| Convincing compilers to be more reasonable is probably going to
| require a very different tone. Not one of "how dare you be so
| stupid" but one of "perhaps you could consider this side-effect
| of current behavior" and "have you considered this approach".
|
| [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475
| tsimionescu wrote:
| > The article links to this [1] discussion. The link is
| supposed to show the peculiar arguments used by proponents of
| current behavior. What I read there is someone lashing out,
| calling names, and grand-standing.
|
| Even worse, the person who raised that issue is mostly just
| wrong: the code they wrote had not worked with any
| optimizations turned on for about 13 years at the time they
| raised that bug. Teh fact that it worked in debug builds seems
| irrelevant, so the whole bug is complaining about a change that
| had no impact on the 99% of released code which is compiled
| with -O2 or -O3.
| superjan wrote:
| I think the best point in this article is that C, a language
| invented to write OS code, is currently difficult to use for that
| purpose due to the current handling of undefined behavior. If you
| write low level OS code, you are keenly aware of the behavior of
| the machine you target. There is no point trying to define a
| language like C in such a way that the only valid programs are
| those that would run identically on any hypothetically allowed
| hardware. For instance, on x64, there are no alignment
| restrictions, why should I be punished for making use of that
| fact? And what about all this C code with SSE intrinsics? should
| we ban that too?
| Spivak wrote:
| > So, "i << 32" buried in a million lines of database code makes
| the entire program junk - supposedly as an "optimization"!
|
| I think the author has it wrong here because they're assuming
| lines and order has any meaning when the program is compiled all
| together. Imagine a compiler where (I know that this is not
| actually possible to do) undefined behavior was a compiler error.
|
| You would sound crazy for complaining that "a syntax error in
| like 4325 make the whole program uncompileable."
|
| It's not like running C programs have any idea that they are
| hitting undefined behavior. It's just that the generated assembly
| is allowed to assume that you know what you're doing.
| i << runtime_val
|
| Can just generate the shift without worry that it will blow up
| and the propagate the knowledge up that runtime_val's domain is
| [0,31] for the optimizer.
| 1MachineElf wrote:
| This was posted by a user with a name _closely_ matching the
| domain (perhaps the original author?) 16 hours prior, and
| flagged: https://news.ycombinator.com/item?id=27215697
|
| When I stumbled across it last night, I couldn't understand why
| that would be. The content seemed good enough for readers on
| here, and this one's placement on the 2nd page of HN seems to
| confirm that. What's going on?
| DangitBobby wrote:
| It truly boggles my mind that this discussion (linked in TFA)
| played out the way it did. In my mind it's 100% not okay to (by
| default) optimize away a check for integer overflow. I've never
| really written in C (or any unsafe language) before so I had
| little context for the types of traps C sets for developers.
| Based on the responses of the person who presumably implemented
| the optimization, it comes as no surprise that C is so dangerous.
| After reading this I hope I never have to use gcc.
|
| https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475
| tom_mellior wrote:
| > In my mind it's 100% not okay to (by default) optimize away a
| check for integer overflow.
|
| But a + 100 > a is not a check for overflow. If a >= INT_MAX -
| 100, it _is_ an overflow. The "will this operation overflow"
| check would be a >= INT_MAX - 100, and GCC would not optimize
| that away.
| DangitBobby wrote:
| What you've written doesn't demonstrate the issue outlined in
| the big report. The issue is that real-world code defends
| against certain hostile inputs like so, and that the
| optimization breaks those defenses: int
| a,b,c; a=INT_MAX; /\* statement ONE */
| b=a+2; /* statement TWO */ c=(b>a); /*
| statement THREE \*/
|
| Whether or not you think this is _technically_ permissible by
| a sufficiently self-serving (from the perspective of a person
| whose ONLY goal is speed optimization) reading of spec is
| irrelevant. Any reading of the spec should err on the side of
| protection against real-world consequences.
|
| I don't want lawyering and technically correct-ing, I want
| pragmatism.
| tom_mellior wrote:
| If you want pragmatism, check whether something is legal
| _before_ you do it.
| DangitBobby wrote:
| This is addressed quite thoroughly in the thread that I
| linked. It is generally impossible to determine whether
| your code has a bug, and you cannot possibly be
| responsible for every bit of code that your code may
| happen to touch, but even if you could, mistakes and bad
| practices will happen. Practically speaking, it does not
| matter if you can find a way to displace blame onto the
| writer of the code. Placing blame is not how one
| minimizes harm. Maybe we have different definitions of
| pragmatism in our heads.
|
| I would also say that leaving such a common occurrence as
| integer overflow as undefined behavior rather than to
| consider that it practically always has platform as
| implementation specific behavior doesn't make much sense.
| I mean, it does always have platform specific
| implementation, right? Does that not exclude it from the
| definition of undefined behavior? Genuinely curious.
|
| The downvotes on my previous comments tell me this sort
| of thinking is endemic. I suddenly have more insight into
| the concern people have about software development as a
| field.
| olliej wrote:
| I dislike how UB is used where unspecified would work.
|
| For instance overflowing arithmetic is well specified on every
| architecture - the behaviour may be different on say sparc vs an
| hc12 vs x86, but on any of those architectures it will always be
| the same. Yet compiler devs have instead decided that it is
| undefined and so can be treated however they like.
|
| There are so many of these UB that could be unspecified that it's
| absurd - UB should be reserved for when there outcome is not
| consistent - UaF, invalid pointers, out of range accesses, etc
| MauranKilom wrote:
| The C++ committee recently (for the C++20 standard) voted
| against making signed overflow defined, opting to keep it
| undefined. Primarily for performance reasons (and because it
| would usually be a bug anyway).
|
| www.open-
| std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r4.html#r0r1
|
| You're of course free to disagree with the reasoning. But you'd
| probably agree that a new standard revision that _forces_ the
| average application to become, say, 5% slower, just so that
| broken programs would be just as broken, would not be very
| well-received.
| joelkevinjones wrote:
| Two points:
|
| The notion that compilers that encounter undefined behavior are
| allowed to generate any code they want is a new interpretation,
| for some value of "new". I can't remember the first time I
| encountered such an interpretation being used by compiler writers
| to justify something they wanted to do until sometime after 2000.
|
| The notion that John Regehr has (quoted in the article) that
| undefined behavior implies the whole execution is meaningless is
| not supported by the language of either the C89 or C99 standard,
| at least by my reading. The C89 standard has a notion of sequence
| points. Wouldn't all sequence points executed before undefined
| behavior is encountered be required to occur as if the undefined
| behavior wasn't there? It would seem so:
|
| From the C89 standard: 2.1.2.3 Program execution
|
| The semantic descriptions in this Standard describe the behavior
| of an abstract machine in which issues of optimization are
| irrelevant. Accessing a volatile object, modifying an object,
| modifying a file, or calling a function that does any of those
| operations are all side effects, which are changes in the state
| of the execution environment. Evaluation of an expression may
| produce side effects. At certain specified points in the
| execution sequence called sequence points, all side effects of
| previous evaluations shall be complete and no side effects of
| subsequent evaluations shall have taken place.
|
| The C99 standard has nearly identical language:
|
| 5.1.2.3 Program execution 1 The semantic descriptions in this
| International Standard describe the behavior of an abstract
| machine in which issues of optimization are irrelevant. 2
| Accessing a volatile object, modifying an object, modifying a
| file, or calling a function that does any of those operations are
| all side effects,11) which are changes in the state of the
| execution environment. Evaluation of an expression may produce
| side effects. At certain specified points in the execution
| sequence called sequence points, all side effects of previous
| evaluations shall be complete and no side effects of subsequent
| evaluations shall have taken place.
| AlotOfReading wrote:
| No, sequence points aren't relevant because they occur at
| runtime. "Time traveling UB" happens in the compiler, typically
| in optimization passes and can cause otherwise valid code to
| exhibit completely different behavior than it would if the UB
| didn't exist.
| MauranKilom wrote:
| The standard (in the prevailing reading of the UB section, and
| also in practice) places no requirements on the behavior of
| programs containing UB. None of the paragraphs you quoted have
| any bearing on how an UB-laden program behaves.
| zlynx wrote:
| > Wouldn't all sequence points executed before undefined
| behavior is encountered be required to occur as if the
| undefined behavior wasn't there? It would seem so:
|
| No. Code optimization is a series of logic proofs. It is like
| playing Minesweeper. If a revealed square has 1 neighboring
| mine and a count of 1, then you know that all 7 other squares
| are safe. In other Minesweeper situations you make a proof that
| is much more complex and allows you to clear squares many steps
| away from a revealed mine. If you make a false assumption of
| where a mine is, via a faulty proof, then you explode.
|
| The compiler is exactly like that. "If there is only one
| possible code path through this function, then I can assume the
| range of inputs to this function, then I can assume which
| function generated those inputs..."
|
| You can see how the compiler's optimization proof goes "back in
| time" proving further facts about the program's valid behavior.
|
| If the only valid array indexes are 0 and 1 then the only valid
| values used to compute those indexes are those values that
| produce 0 and 1.
|
| This isn't even program execution. In many cases the code is
| collapsed into precomputed results which is why code
| benchmarking is complicated and not for beginners. Many naive
| benchmark programs collapse 500 lines of code and loops into
| "xor eax,eax; ret;" A series of putchar, printf and puts calls
| can be reduced to a single fwrite and a malloc/free pair can be
| replaced with an implicit stack alloca because all Standard
| Library functions are known and defined and there is no need to
| actually call them as written.
| muth02446 wrote:
| Also relevant:
|
| What every compiler writer should know aboutprogrammers or
| "Optimization" based on undefined behaviour hurts performance
|
| by M. Anton Ertl
|
| https://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_20...
| mjburgess wrote:
| Commenters here seem to be missing the core thesis of this
| article. It's not about what the standard literally means; it's
| about it's spirit -- and the _reason_ for its spirit.
|
| The issue is "undefined behaviour" should never have been
| interpreted this extremely. The standard may be silent on how
| extreme, but it is implausible to suggest that the standard was
| actually written to enable this.
|
| Compiler writers _for C!_ dismiss severe issues which occur in
| compiling any program with undefined behaviour; issues which
| would render any modern language a bad joke.
|
| Compiler writers are using this as a perverse shield to simply
| fail to optimize for correctness, or provide means to; and to
| enable them to optimize only for performance.
|
| Are we really saying the _standard_ supports this sleight-of-
| hand? It seems more like using the Second Amendment to murder
| someone.
| steveklabnik wrote:
| Personally, I think that arguing that those who define and
| implement a standard don't understand one of the most
| fundamental aspects of said standard is going to be an uphill
| battle.
|
| You could argue that they've lost their way, and the article
| flirts with this, but the path forward is the hard part, and
| IMHO rings a bit hollow: it's asserted that these rules aren't
| needed for performance, but no evidence is given, and what
| similar evidence we do have (compiling on lower optimization
| levels) doesn't seem to support this thesis. You could argue
| that, the kernel, which turns off aliasing, is plenty
| performant without it, and that's a decent argument, but it's
| not clear that it wouldn't be even faster with it, and it's
| much harder to empirically test this than removing the flag,
| since it will miscompile.
| bonzini wrote:
| Different code depends on different optimizations. A loop on
| an int** might benefit a lot from aliasing optimizations,
| because the compiler will assume that a[i] will remain the
| same after writing to a[i][j]. Other code may not benefit at
| all.
|
| Likewise that loop may not benefit from signed overflow;
| instead an initialization loop that, by way of multiple level
| of macros, ends up doing a[i]=b[i]*1000/100, might become
| twice as fast if signed overflow rules let the compiler
| rewrite the assignment as a[i]=b[i]*10.
| _kst_ wrote:
| The author suggests that the text following the definition of
| "undefined behavior", listing the permitted or possible range of
| undefined behavior, should be read to restrict the consequences.
|
| But the first possibility listed is "ignoring the situation
| completely with unpredictable results". Surely that covers any
| possible consequences.
|
| The author also says:
|
| > Returning a pointer to indeterminate value data, surely a
| "use", is not undefined behavior because the standard mandates
| that malloc will do that.
|
| Returning a pointer to data is not a use of that data. The fact
| that its value is indeterminate isn't relevant until you attempt
| to read it (without first writing it).
|
| It may be worthwhile to reduce the number of constructs whose
| behavior is undefined, making them implementation-defined or
| unspecified instead. For example, if signed integer overflow
| yielded an unspecified result rather than causing undefined
| behavior, I wonder if any implementations would be adversely
| affected. (But it would remove the possibility of aborting a
| program that computes INT_MAX+1.)
|
| I don't think reinterpreting "undefined behavior" as anything
| other than "the Standard imposes _no_ requirements " is
| practical. If a program writes through a dangling pointer and,
| for example, clobbers a function's return address, what
| constraints could be imposed on what the program might do next?
| btilly wrote:
| _The author suggests that the text following the definition of
| "undefined behavior", listing the permitted or possible range
| of undefined behavior, should be read to restrict the
| consequences._
|
| _But the first possibility listed is "ignoring the situation
| completely with unpredictable results". Surely that covers any
| possible consequences._
|
| Absolutely not. In the C89 standard, undefined behavior becomes
| undefined *UPON USE OF* the thing that is undefined. In current
| compilers, the existence of undefined behavior anywhere in your
| program is an excuse to do anything that the compiler wants to
| with all of the rest of your program. Even if the undefined
| behavior is never executed. Even if the undefined behavior
| happens after the code that you have encountered.
|
| So, for example, undefined behavior that can be encountered
| within a loop makes it allowable to simply remove the loop.
| Even if the undefined behavior is inside of an if that does not
| happen to evaluate to true with your inputs.
| aw1621107 wrote:
| > In the C89 standard, undefined behavior becomes undefined
| _UPON USE OF_ the thing that is undefined.
|
| Is that still the case for current C standards, or did
| something change in C99/C11?
| btilly wrote:
| I don't have the C11 standard. But that part of the passage
| remained unchanged in C99.
|
| In C89 there was a list of PERMISSIBLE things that
| compilers could do upon encountering undefined behavior. In
| C99 that was changed to a list of POSSIBLE things. And
| compilers have taken full advantage of that.
| Jweb_Guru wrote:
| This is actually _desired_ though, at least by some programs.
| For example, say you have a function with a very expensive
| loop that repeatedly performs a null check and then executes
| some extra code if it 's null, but never sets the value. This
| is called from another function which uses the checked value
| without a null check (proving it's not null) before and after
| the loop ends. The first function is inlined. You want to
| tell the compiler not to optimize out the null check and
| extra code in the loop? Or that it can't optimize stuff out
| to reuse the value from the first use of the value? If so,
| what _is_ the compiler allowed to optimize out or reorder?
|
| Now, to see why this might actually produce a bug in working
| code--say some other thread has access to the not-null value
| and sets it racily (non-atomically) to null. Or (since most
| compilers are super conservative about checks of values that
| escape a function because they can't do proper alias
| analysis), some code accidentally buffer overflows and
| updates the pointer to null while intending to do something
| else. Suddenly, this obvious optimization becomes invalid!
|
| Arguments to the effect of "the compiler shouldn't optimize
| out that loop due to assuming absence of undefined behavior"
| are basically arguments for compilers to leave tons of
| performance on the table, due to the fact that sometimes C
| programs don't follow the standard (e.g. forgetting to use
| atomics, or indexing out of bounds). While it's a legitimate
| argument, I don't think people would be too happy to find
| their C programs losing to Java in benchmarks on -O3, either.
| MauranKilom wrote:
| > So, for example, undefined behavior that can be encountered
| within a loop makes it allowable to simply remove the loop.
| Even if the undefined behavior is inside of an if that does
| not happen to evaluate to true with your inputs.
|
| The last sentence is not true. If there is UB inside the if,
| the compiler may assume that the if condition never evaluates
| to true (and hence delete that branch of the if), but it may
| certainly not remove the surrounding loop (unless it can
| _also_ prove that the condition _must_ be true).
| ynik wrote:
| > For example, if signed integer overflow yielded an
| unspecified result rather than causing undefined behavior, I
| wonder if any implementations would be adversely affected.
|
| You don't need to wonder. You can use -fwrapv to make signed
| integer overflow defined behavior.
|
| C++20 introduced the guarantee that signed integers are two's
| complement. The original version of that proprosal also defined
| the behavior on overflow; but that part was rejected (signed
| integer overflow remains UB): http://www.open-
| std.org/jtc1/sc22/wg21/docs/papers/2018/p090... So at least the
| committee seems to think that the performance advantages are
| worth it.
| anarazel wrote:
| > For example, if signed integer overflow yielded an
| unspecified result rather than causing undefined behavior, I
| wonder if any implementations would be adversely affected.
|
| I suspect so - makes it harder to reason about loop counts
| because the compiler can't necessarily guarantee that an
| incremented loop counter won't become negative and thus the
| loop needs to iterate more.
|
| E.g. something like for (int i=param; i < param + 16; i++) has
| a guaranteed loop count with the current rules, but not with
| yours?
|
| That's not an excuse for but having any way to do proper
| overflowing operations on signed integers though.
| Asooka wrote:
| > I suspect so - makes it harder to reason about loop counts
| because the compiler can't necessarily guarantee that an
| incremented loop counter won't become negative and thus the
| loop needs to iterate more.
|
| This is a favourite example that gets thrown around, but for
| all practical loops GCC and clang seem to have no problem
| even when you compile with -fwrapv
| anarazel wrote:
| Postgres compiles with fwrapv for many years now, and yes,
| it does introduce a measurable CPU overhead. Not 10%, but
| also not just 0.1%.
| OskarS wrote:
| Assuming you defined signed integer overflow to follow two's
| complement rules (the only reasonable interpretation other
| than UB), it would still be a guaranteed loop count of 16.
| (EDIT: i'm a dumbass, this is obvs not true. disregard this
| paragraph)
|
| There's an interesting thing to note with that example
| though: even if you did make signed integer overflow defined,
| that code is still obviously incorrect if param + 16
| overflows. Like, the fact that signed integer overflow is UB
| is totally fine in this example: making it defined behavior
| doesn't fix the code, and if making it UB allows the compiler
| to optimize, then why not?
|
| Arguably, this is the case with the vast majority of signed
| integer overflow examples: the UB isn't really the issue, the
| issue is that the programmer didn't consider overflow, and if
| overflow happens the code is incorrect regardless. Why
| cripple the compilers ability to optimize to protect cases
| which are almost certainly incorrect anyway?
| Gibbon1 wrote:
| The real problem is in a better world 'int' would be
| replaced by types that actually exhibit the correct
| behavior.
|
| for a loop counter you want an index type that will seg
| fault on overflow. If you think not having that check is
| worth it the programmer would need to tag it with unsafe.
|
| It's also problematic because it's size is defined as at
| least 16 bits. But programmers which means you should never
| use it to store a constant larger than 16 bits. But people
| do that all the time.
| thewakalix wrote:
| Many people write C programs that are not intended to be
| portable to 16-bit architectures.
| OskarS wrote:
| I'm not sure I agree. If signed overflow is UB, loops
| like this can be optimized the hell out of. The most
| obvious way would be to unroll it and eliminate the loop
| (and loop variable) entirely, but you can also do things
| like vectorize it, maybe turn it in to just a small
| number of SIMD instructions. The performance gains are
| potentially enormous if this is in a hot loop.
|
| With your magic int that traps on overflow, you couldn't
| do that if the compiler was forced to rely on that
| behaviour. This is exactly why signed overflow is UB in
| C, and I don't think that's an unreasonable case for a
| language like C.
|
| To be clear, my point is that this program is incorrect
| if overflow happens regardless of whether overflow is UB
| or not. So you might as well make it UB and optimize the
| hell out of it.
| cygx wrote:
| The broader argument is that signedness of the integer
| type used for indexing is a non-obvious gotcha affecting
| vectorizability. It makes sense once you understand C
| integer semantics, but putting on a language designer
| hat, I'd go with something more explicit.
| _kst_ wrote:
| for (int i=param; i < param + 16; i++)
|
| does not have a guaranteed loop count with the current rules.
| The loop body will execute 16 times if param <= INT_MAX-16,
| but if the expression "param + 16" can overflow, the behavior
| is undefined. (I'm assuming param is of type int.)
| BruiseLee wrote:
| The compiler is allowed to act as if this loop executes
| exactly 16 times. That means it could unroll and vectorize
| it for example.
| vyodaiken wrote:
| It is completely useless to allow compilers to assume
| false things about the code they generate.
| fooker wrote:
| It's not useless. The assumption is not false if the
| program doesn't have undefined behavior. The assumption
| allows the code to be a few times faster. To disallow
| this assumption would inhibit these optimizations.
| msbarnett wrote:
| > does not have a guaranteed loop count with the current
| rules. The loop body will execute 16 times if param <=
| INT_MAX-16, but if the expression "param + 16" can
| overflow, the behavior is undefined. (I'm assuming param is
| of type int.)
|
| And the standard permits us (among other responses) to
| ignore undefined behaviour, so it _does_ have a guaranteed
| loop count under a reading of the standard which the
| standard specifically and explicitly allows.
| _kst_ wrote:
| No, the standard permits the implementation to ignore the
| behavior "with unpredictable results".
|
| If the value of param is INT_MAX, the behavior of
| evaluating param + 16 is undefined. It doesn't become
| defined behavior because a particular implementation
| makes a particular choice. And the implementation doesn't
| have to tell you what choice it makes.
|
| What the standard means by "ignoring the situation
| completely" is that the implementation doesn't have to be
| aware that the behavior is undefined. In this particular
| case:
|
| for (int i=param; i < param + 16; i++)
|
| that means the compiler can assume there's no overflow
| and generate code that always executes the loop body
| exactly 16 times, or it can generate naive code that
| computes param + 16 and uses whatever result the hardware
| gives it. And the implementation is under no obligation
| to tell you how it decides that.
| tsimionescu wrote:
| > If the value of param is INT_MAX, the behavior of
| evaluating param + 16 is undefined. It doesn't become
| defined behavior because a particular implementation
| makes a particular choice. And the implementation doesn't
| have to tell you what choice it makes.
|
| The compiler writer argument is as follows:
|
| The program is either UB (when param is INT-MAX - 15
| higher) or has exactly 16 iterations. Since we are free
| to give any semantics to a UB program, it is standard-
| compliant to always execute 16 times regardless of
| param's value.
| msbarnett wrote:
| > that means the compiler can assume there's no overflow
| and generate code that always executes the loop body
| exactly 16 times
|
| Right. That's what I said.
|
| And just to be super-precise about the wording, the
| standard doesn't say "ignore the behavior 'with
| unpredictable results'" it says "Permissible undefined
| behavior ranges from ignoring the situation completely
| with unpredictable results". Nitpicky, but the former
| wording could be taken to imply that ignoring behavior is
| only permissible if the behavior is unpredictable, when
| what the standard actually says is that you can ignore
| the behavior, even if the results of ignoring it are
| unpredictable.
| _kst_ wrote:
| And my point is that as far as the language is concerned,
| there is no guaranteed loop count under any
| circumstances. (An implementation is allowed, but not
| required, to define the behavior for that
| implementation.)
| Smaug123 wrote:
| The two of you are not disagreeing except insofar as
| you're both using the word "guaranteed" to mean
| completely different things. _kst_, you're using it to
| mean "the programmer can rely on it". msbarnett, you're
| using it to mean "the compiler can rely on it".
| dooglius wrote:
| Either the limit on param is guaranteed in some way by
| the rest of the program, or it is not. If it is, then the
| loop count is guaranteed in both cases. If it is not, the
| loop count is not guaranteed in either case.
| msbarnett wrote:
| That you wish that the C Standard mandated this
| interpretation does not change the fact that this is not
| what the C Standard says.
| dooglius wrote:
| You are mistaken, the C standard is quite clear that it
| does not make any guarantees regarding the behavior of
| programs that exhibit undefined behavior, and that signed
| integer overflow is undefined behavior.
| masklinn wrote:
| They're not mistaken. What compilers will do is assume
| that UB don't happen. If no UB happens, that means `param
| + 16` never overflowed, therefore there are always
| exactly 16 operations.
| msbarnett wrote:
| "for (int i=param; i < param + 16; i++) does not have a
| guaranteed loop count in the presence of undefined
| behavior" is true, but it's equally true that the C
| standard is quite clear that undefined behavior can be
| ignored, so we can validly treat "for (int i=param; i <
| param + 16; i++)" as if it were guaranteed to loop 16
| times in all cases.
| _kst_ wrote:
| No, the C standard doesn't say that "undefined behavior
| can be ignored" (which would mean what, making it
| defined?).
|
| It says, "NOTE Possible undefined behavior ranges from
| ignoring the situation completely with unpredictable
| results, ...".
|
| It doesn't say that the _behavior_ can be ignored. It
| says that the _undefinedness_ can be ignored. The
| implementation doesn 't have to take notice of the fact
| that the behavior is undefined.
|
| Let's take a simpler example:
| printf("%d\n", INT_MAX + 1);
|
| The behavior is undefined. The standard does not
| guarantee anything about it. A conforming implementation
| can reject it at compile time, or it can generate code
| that crashes, or it can generate code that emits an ADD
| instruction and print whatever the hardware returns, or
| it can play roge at compile time. (The traditional joke
| is that can make demons fly out of your nose. Of course
| it can't, but an implementation that did so would be
| physically impossible, not non-conforming.)
|
| An implementation might define the behavior, but it's
| still "undefined behavior" as that term is defined by the
| ISO C standard.
| msbarnett wrote:
| "undefined behavior can be ignored" (meaning: the case
| where this _could_ overflow need not be considered and
| can be treated as though it does not exist) vs "The
| implementation doesn't have to take notice of the fact
| that the behavior is undefined" strikes me as a
| distinction without a difference given that we land in
| exactly the same spot: the standard allows us to treat
| "for (int i=param; i < param + 16; i++)" as if it were
| guaranteed to loop 16 times in all cases.
|
| > An implementation might define the behavior, but it's
| still "undefined behavior" as that term is defined by the
| ISO C standard.
|
| The point where we seem to disagree (and the pedantry
| here is getting tiresome so I don't know that there's any
| value in continuing to go back and forth of on it) is
| that yes, it's undefined behavior by the ISO C standard.
| BUT, the ISO C standard _also_ defines the allowable
| interpretations of and responses _to_ undefined
| behaviour. Those responses don 't exist "outside" the
| standard - they flow directly from it.
|
| So it's simultaneously true that the standard does not
| define it and that the standard gives us a framework in
| which to give its undefinedness some treatment and
| response, even if that response is "launch angband" or,
| in this case, "act as if it loops 16 times in all cases".
| _kst_ wrote:
| Of course an implementation can do anything it likes,
| including defining the behavior. That's one of the
| infinitely many ways of handling it -- precisely because
| it's _undefined behavior_.
|
| I'm not using "undefined behavior" as the English two-
| word phrase. I'm using the technical term as it's defined
| by the ISO C standard. "The construct has _undefined
| behavior_ " and "this implementation defines the behavior
| of the construct" are not contradictory statements.
|
| And "ignoring the situation completely" does not imply
| any particular behavior. You seemed to be suggesting that
| "ignoring the situation completely" would result in the
| loop iterating exactly 16 tyimes.
| msbarnett wrote:
| > Of course an implementation can do anything it likes,
| including defining the behavior. That's one of the
| infinitely many ways of handling it -- precisely because
| it's undefined behavior.
|
| An implementation can do whatever it likes _within the
| proscribed bounds the standard provides for reacting to
| "undefined behavior"_, and conversely whatever the
| implementation chooses to do within those bounds is
| consistent with the standard.
|
| Which, again, is the entire point of this: "the loop
| iterates exactly 16 times" is a standards-conforming
| interpretation of the code in question. There's nothing
| outside the standard or non-standard about that. That is,
| in fact, exactly what the standard says that it is
| allowed to mean.
|
| > I'm not using "undefined behavior" as the English two-
| word phrase. I'm using the technical term as it's defined
| by the ISO C standard.
|
| So am I. Unlike you, I'm merely taking into account the
| part of the standard that says "NOTE: Possible undefined
| behavior ranges from ignoring the situation completely
| with unpredictable results" and acknowledging that things
| that _do so_ are standards-conforming.
|
| > You seemed to be suggesting that "ignoring the
| situation completely" would result in the loop iterating
| exactly 16 tyimes.
|
| I'm merely reiterating what the standard says: that the
| case in which the loop guard overflows can be ignored,
| allowing an implementation to conclude that the loop
| iterates exactly sixteen times in all scenarios it is
| required to consider.
|
| All you seem to be doing here is reiterating, over and
| over again, "the standard says the behavior of the loop
| is undefined" to argue that the loop has no meaning,
| while ignoring that a different page of the same standard
| actual gives an allowable range of meanings to what it
| means for "behavior to be undefined", and that therefore
| anyone of those meanings is, in fact, precisely within
| the bounds of the standard.
|
| We can validly say that the standard says "for (int
| i=param; i < param + 16; i++)" means "iterate 16 times
| always". We can validly say that the standard says "for
| (int i=param; i < param + 16; i++)" means "launch angband
| when param + 16 exceeds MAX_INT". Both are true
| statements.
| thewakalix wrote:
| Perhaps there's an implicit quantifier here: "for all
| valid implementations of the C standard, the loop count
| is guaranteed to be 16" versus "there exists a valid
| implementation of the C standard in which...".
| thewakalix wrote:
| Perhaps there's an implicit quantifier here: "for all
| valid implementations of the C standard, the loop count
| is guaranteed to be 16" versus "there exists a valid
| implementation of the C standard in which...".
|
| (This line of thought inspired by RankNTypes, "who
| chooses the type", etc.)
| anarazel wrote:
| That's precisely my point? Because the overflow case is
| undefined, the compiler can assume it doesn't happen and
| optimize based on the fixed loop count.
| rurban wrote:
| The overflow case is not UB. param can be unsigned, of
| fwrapv may be declared. Or the compiler chooses to
| declare fwrapv by default. In no case is the compiler
| allowed to declare the overflow away, unless it knows
| from before that param can not overflow. The optimization
| on loop count 16 can still happen with a runtime guard.
| _kst_ wrote:
| If param is unsigned, then "param + 16" cannot overflow;
| rather, the value wraps around in a language-defined
| manner. I've been assuming that param is of type int (and
| I stated that assumption).
| OskarS wrote:
| The loop counter is signed even if param is not, so i++
| could overflow. fwrapv is a compiler flag, it is not part
| of the standard: it is a flag that mandates a certain
| behaviour in this case, but in standard C, the loop
| variable overflowing is definitely UB. No runtime guard
| needed, C compilers are just allowed to assume a fixed
| length. This is the whole reason signed overflow is UB in
| C, for exactly cases like this.
| not2b wrote:
| That's the exact reason why this rule was introduced into the
| standard: it was so C compilers could compete with Fortran
| compilers (Fortran has similar rules and at the time they
| were beating C compilers on equivalent scientific codes by
| 2-3x).
|
| Fortran has even more restrictive aliasing rules than C: a
| function is allowed to assume that any two array arguments
| passed as arguments do not overlap. If they do, the behavior
| is undefined.
| simias wrote:
| I don't know if there exists a C compiler that leverages this
| feature but there are ISAs (for instance MIPS) that can trap
| on signed overflow.
|
| The fact that it's UB in C means that you can tell the
| compiler to generate these exception-generating instructions,
| which could make some overflow bugs easier to track down
| without any performance implications. And your compiler would
| still be 100% compliant with the standard.
|
| That being said I just tried and at least by default GCC
| emits the non-trapping "ADDU" even for signed adds, so maybe
| nobody actually uses that feature in practice.
| anarazel wrote:
| That doesn't really help with the compiler optimization
| aspect : A typical use of the range information would be to
| unroll the loop - in which case there's no addition to trap
| on anymore.
| tsimionescu wrote:
| To be fair, if you want to make sure that loop is
| unrolled even in the presence of -fwrapv, writing it as
| for (int i=0; i < 16; i++) {/* use i+param */} is a very
| simple change for you to make even today. You'll have to
| make much uglier changes to code if you're at the level
| of optimization where loop unrolling really matters for
| your code on a modern processor.
| vyodaiken wrote:
| e.g. removing a check for for overflow is definitely NOT
| ignoring the behavior. Deleting write because it would be
| undefined behavior for a pointer to point at some location is
| also NOT ignoring the behavior. Ignoring the behavior is
| exactly what the rationale is describing when it says UB allows
| compilers to not detect certain kinds of errors.
|
| Returning a pointer is certainly a use. In any event, the
| prevailing interpretation makes it impossible to write a
| defined memory allocator in C.
|
| If a program writes through a dangling pointer and clobbers a
| return address, the programmer made an error and unpredictable
| results follow. C is inherently memory unsafe. No UB based
| labrynth of optimizations can change that. It is not designed
| to be memory safe: it has other design goals.
| aw1621107 wrote:
| > e.g. removing a check for for overflow is definitely NOT
| ignoring the behavior. Deleting write because it would be
| undefined behavior for a pointer to point at some location is
| also NOT ignoring the behavior.
|
| Depending on how you look at it, this _is_ ignoring the
| behavior.
|
| For example, say you have this: int f(int
| a) { if (a + 1 < a) { // Handle
| error } // Do work }
|
| You have 2 situations: 1. a + 1 overflows
| 2. a + 1 does not overflow
|
| Situation 1 contains undefined behavior. If the compiler
| decides to "ignor[e] the situation completely", then
| Situation 1 can be dropped from consideration, leaving
| Situation 2. Since this is the only situation left, the
| compiler can then deduce that the condition is always false,
| and a later dead code elimination pass would result in the
| removal of the error handling code.
|
| So the compiler is ignoring the behavior, but makes the
| decision to do so by not ignoring the behavior. It's slightly
| convoluted, but not unreasonable.
| vyodaiken wrote:
| More than slightly convoluted. The obvious intention is
| that the compiler ignores overflow and lets the processor
| architecture make the decision. Assuming that overflow
| doesn't happen is assuming something false. There's no
| excuse for that and it doesn't "optimize" anything.
| UncleMeat wrote:
| IMO, implementation defined is worse. It is still a time bomb
| but now it is a time bomb that you cannot use compiler errors
| to prevent automatically.
| pif wrote:
| Unspecified result means the compiler must think about what
| could happen in case I made an error.
|
| UB means the compiler will trust me and concentrate on generate
| the fastest code ever.
|
| C is for clever programmers; if you don't want to be clever,
| you are free to use Go or something like that.
| masklinn wrote:
| > UB means the compiler will trust me and concentrate on
| generate the fastest code ever.
|
| In reality, UB means the compiler will assume it doesn't
| happen and work from there.
|
| Of course a more expressive language could just make it so
| the compiler doesn't have to assume this e.g. a C compiler
| will consider a dereference as meaning the pointer is non-
| null, both backwards and forwards.
|
| But if the language had non-null pointers, it would not need
| to bother with that, it would have a non-null pointer in the
| first place. It could still optimise nullable pointers (aka
| lower nullable pointers to non-nullable if they're provably
| non-nullable, usually after a few rounds of inlining), but
| that would be a much lower priority.
| cygx wrote:
| It's not so much about cleverness, but knowledge and
| vigilance. You first have to be aware of all the footguns,
| and then be careful not to let any of them slip through...
| pif wrote:
| > You first have to be aware of all the footguns,
|
| Knowing your tools is part of being a professional. C is
| not for amateurs.
| rurban wrote:
| Then use such a tool, but don't call it C, rather
| -std=gnuc-opt11, which always knows better than the
| author, without any warning.
|
| Call it randomC, unsuitable for professional programmers,
| but extremly suitable for benchmark games and managers.
| Who prefer to ignore pesty overflows, underflows, memset,
| memcpy, dereferencing NULL pointers and other rare cases.
| nwallin wrote:
| > For example, if signed integer overflow yielded an
| unspecified result rather than causing undefined behavior, I
| wonder if any implementations would be adversely affected.
|
| Yes.
|
| There are several architectures where signed integer overflow
| traps, just like division by 0 on x86. (which is why division
| by 0 is UB) If a C compiler for those architectures was
| required to yield an unspecified result instead of trapping,
| every time the code performed a signed integer
| addition/subtraction, it would need to update a trap handler
| before and afterward to return an unspecified value instead of
| invoking the normal trap handler.
| jacinabox wrote:
| Sort of a second amendment for footguns.
| layer8 wrote:
| Because there seems to be some confusion in this thread:
|
| - "Implementation-defined behavior" means that the C standard
| specifies the allowable behaviors that a C implementation must
| choose from, and the implementation must document its particular
| choice.
|
| - "Unspecified behavior" means that the C standard places no
| particular restrictions on the behavior, but a C implementation
| must pick a behavior and document its choice.
|
| - "Undefined behavior" means that C implementations are allowed
| to assume that the respective runtime condition does not ever
| occur, and for example can generate optimized code based on that
| assumption. In particular, it is free to _not_ decide any
| behavior for the condition (let alone document it). As a
| consequence, if the runtime condition actually _does_ occur, this
| can affect the behavior of _any_ part of the program, even the
| behavior of code executed before the condition would occur. This
| is because from a false assumption the truth of _any_ statement
| can be logically derived (principle of explosion [0]). And that
| is why the C standard does not restrict the behavior of the whole
| program if it contains undefined behavior.
|
| [0] https://en.m.wikipedia.org/wiki/Principle_of_explosion
| tsimionescu wrote:
| I would note that the article is explicitly contesting the
| definition of UB that you are giving here (though you are
| absolutely right that this is the de facto definition used by
| all major compilers, and the commtitee).
|
| Basically the article is arguing that UB should be similar to
| Unspecified behavior - behavior that the implementation leaves
| up to the hardware and/or OS.
|
| I'm not sure where I fall to this issue, though I would note
| that the definition of UB in the standard needs quite a bit of
| interpretation to arrive at the commonly used definition you
| are quoting. That is, while I think the definition you give is
| _compatible_ with the one in the standard, I don 't think it
| _is_ the definition of the standard, which is much softer about
| how UB should impact the semantics of a C program. In
| particular, nothing in the wording of the standard expliclty
| says that an implementation is expected to assume UB doesn 't
| happen, or that a standard-conforming program can't have UB.
| Gaelan wrote:
| > and the commtitee
|
| I'd argue that for a document like the C standard, if there's
| a well-known intended meaning, that _is_ the meaning of the
| document - any other interpretation is purely academic.
| simonh wrote:
| These are all interpretations. Your trying to special
| privilege one interpretation by calling it the intended
| meaning when really it's what I'd call an authoritative
| interpretation. The accepted interpretation by an
| authority. That doesn't necessarily make it the actual
| intended meaning though.
| Gaelan wrote:
| I was under the impression the "committee" here was the
| people who wrote the C standard? So the accepted
| interpretation by the people who wrote the document must
| be the intended meaning, no? If it wasn't, they'd have
| written something else.
|
| (Sure, it's possible they wrote something, then only
| later came to understand the implications of this. But
| the standard has been understood as allowing nasal demons
| since long before the latest revision of the C standard,
| so I'd argue that by publishing a new version without
| rewording that section, they're making clear that they
| intend its current interpretation to be part of the
| standard.)
| bregma wrote:
| from ISO/IEC 9899:2011 "Programming Languages -- C"
| 3.4.3 1 undefined behavior behavior, upon use of
| a nonportable or erroneous program construct or of
| erroneous data, for which this International Standard imposes
| no requirements 2 NOTE Possible undefined
| behavior ranges from ignoring the situation completely with
| unpredictable results, to behaving during translation or
| program execution in a documented manner characteristic of
| the environment (with or without the issuance of a diagnostic
| message), to terminating a translation or execution (with the
| issuance of a diagnostic message).
|
| It doesn't look like a leap to go from this definition to
| that of ignoring the situation completely with unpredictable
| results.
| alerighi wrote:
| To me ignore the situation completely with unpredictable
| results would mean: the compiler generates an assembly that
| could not be correct, and then the behaviour of the program
| is determined by what the processor does.
|
| Doing something like removing checks is not ignoring the
| situation: is acting in some particular way when undefined
| behaviour is detected.
|
| And it has neither unpredictable results: it specifies what
| happens, since these checks are added systematically.
|
| I don't see anywhere in the standard that the compilers are
| free to change at their choice the semantic of the program
| if undefined behaviour is detected. Rather undefined means
| to me that the compiler generates code where the result
| cannot be known because it will depend on external factors
| (basically the hardware implementation).
| fiter wrote:
| This quote is the topic of the original article and the
| article goes into detail about how it believes the quote
| should be interpreted.
| MauranKilom wrote:
| ...and yet the article completely ignores the "with
| _unpredictable_ results " part and instead spends a lot
| of time discussing all the other valid consequences
| (which are also only mentioned as examples, at least in
| the common understanding of "from ... to ...").
|
| Downthread commenters go into more detail regarding the
| "ignoring e.g. the possibility of signed overflow may
| mean to assume that it never happens" reading, so I won't
| elaborate on it here.
| _Nat_ wrote:
| The original article's interpretation seemed untenable.
|
| While the difference between " _Permissible_ " and "
| _Possible_ " could be quite significant, in this case, it
| was qualifying:
|
| > [ _Permissible_ / _Possible_ ] undefined behavior
| ranges from ignoring the situation completely with
| unpredictable results, to behaving during translation or
| program execution in a documented manner characteristic
| of the environment (with or without the issuance of a
| diagnostic message), to terminating a translation or
| execution (with the issuance of a diagnostic message).
|
| The previously-" _Permissible_ " behaviors were so broad
| that they basically allowed anything, including
| translating the source-code in any documented manner..
| which basically means that, as long as a compiler says
| how it'll treat undefined-behavior, it can do it that
| way, because it's free to completely reinterpret the
| source-code in any (documented) manner.
| cygx wrote:
| To my knowledge, in case of unspecified behaviour, it's not
| required to pick and document a particular choice. Behaviour
| may even vary on a case by case basis where it makes sense (eg
| order of evaluation of function arguments, whether an inline or
| external definition of an inline function gets used, etc).
|
| The need to properly document the choice is the defining
| characteristic of implementation-defined behaviour.
| nwallin wrote:
| > - "Undefined behavior" means that C implementations are
| allowed to assume that the respective runtime condition does
| not ever occur, and for example can generate optimized code
| based on that assumption.
|
| Please note that the article is making the specific argument
| that this interpretation of UB is an _incorrect_
| interpretation. The author is arguing that you, me, the llvm
| and gcc teams are wrong to interpret UB that way.
|
| Linux had a bug in it a few years ago; the code would
| dereference a pointer, then check if it was null, then returned
| an error state if it was null, or continued performing the
| important part of the function. The compiler deduced that if
| the pointer had been null when it was dereferenced, that's UB,
| so the null check was unnecessary, and optimized the null check
| out. The trouble was that in that context, a null pointer
| dereference didn't trap, (because it was kernel code? not
| sure.) so the bug was not detected. It ended up being an
| exploitable security vulnerability in the kernel, I think a
| local privilege escalation.
|
| The article is making the argument that the compiler should
| _not_ be free to optimize out the null check before subsequent
| dereferences. The compiler is permitted to summon nasal demons
| where the pointer is dereferenced the first time, but should
| not be free to summon nasal demons at later lines of code,
| after the no-nasal-demons-please check.
|
| (The linux kernel now uses -fno-delete-null-pointer-checks to
| ensure that doesn't happen again. The idea is that even though
| it was a bug that UB was invoked, the failure behavior should
| be safe instead of granting root privileges to an unprivileged
| user.)
|
| Fun with NULL pointers part 1 https://lwn.net/Articles/342330/
|
| Fun with NULL pointers part 2 https://lwn.net/Articles/342420/
| tom_mellior wrote:
| > The trouble was that in that context, a null pointer
| dereference didn't trap, (because it was kernel code? not
| sure.) so the bug was not detected.
|
| Yes, because it was kernel code. Because that dereference is
| completely legal in kernel code. The C code was fine,
| assuming that it was compiled with appropriate kernel flags.
| This was not a bug in Linux, at least not on the level of the
| C code itself.
|
| > The linux kernel now uses -fno-delete-null-pointer-checks
| to ensure that doesn't happen again.
|
| I also seem to remember that it was already using other
| "please compile this as kernel code" flags that _should have_
| implied "no-delete-null-pointer-checks" behavior, and that
| the lack of this implication was considered a bug in GCC and
| fixed.
| alerighi wrote:
| By the way, dereferencing NULL is a well defined behaviour on
| every computer architecture: you are basically reading at
| address 0 of memory. It just causes a crash if you have an
| operating system since it will cause a page fault, but in
| kernel mode or in devices without an OS is a legit thing to
| do (and even useful in some cases).
|
| Why should C compilers make it undefined? The standard
| doesn't mandate that undefined behaviour should change the
| semantic of the program. Just define all the undefined
| behaviour that you could, to me keeping them undefined makes
| no sense (even from the standard point, everyone knows that
| if you overflow an int it wraps around, why should it be
| undefined??)
| gameswithgo wrote:
| Lots of us use languages that have no/almost no undefined
| behavior, it is really weird to see people trying to come up with
| reasons why this is hard. Perhaps there are still computing
| environments so slow that undefined behavior still makes sense in
| C, and C should remain that way. If so, use something with less
| of this for anything else.
| ho_schi wrote:
| This is interesting. Humans interpret standards and compilers
| interpret code, too :)
|
| We have the benefits of a standard, portability and multiple
| compilers. But it also comes with duties.
|
| BTW.
|
| The C++ people seem to tackle some of the issues around
| widespread 'undefined behavior' and defined a lot of behavior:
| 1) If a side effect on a scalar object is unsequenced relative
| to another side effect on the same scalar object, the behavior
| is undefined. i = ++i + 2; // undefined behavior
| until C++11 i = i++ + 2; // undefined behavior until
| C++17 f(i = -2, i = -2); // undefined behavior until C++17
| f(++i, ++i); // undefined behavior until C++17, unspecified
| after C++17 i = ++i + i++; // undefined behavior
| 2) If a side effect on a scalar object is unsequenced relative to
| a value computation using the value of the same scalar object,
| the behavior is undefined. cout << i << i++; // undefined
| behavior until C++17 a[i] = i++; // undefined
| behavior until C++17 n = ++i + i; // undefined
| behavior
|
| Source: https://en.cppreference.com/w/cpp/language/eval_order
|
| They replaced the entires "Sequence Point Rules" by "Sequenced-
| before rules". Compiler implementers cannot choose what to do
| (implementation defined) or neglect the issue (undefined
| behavior) - the must act well defined now in many situations.
| mpweiher wrote:
| https://news.ycombinator.com/item?id=17454467 made very much the
| same argument.
| dfabulich wrote:
| I think the problem is cultural in the C community. C programmers
| have Stockholm Syndrome around UB optimizations.
|
| As TFA notes, "There is No Reliable Way to Determine if a Large
| Codebase Contains Undefined Behavior"
| https://blog.llvm.org/2011/05/what-every-c-programmer-should...
|
| That's because UB is a bug that occurs as your program runs (e.g.
| dereferencing a null pointer). You'd have to prove that your
| program is free of bugs to prove that you won't encounter UB at
| runtime, which is not possible in general.
|
| But C programmers believe they deserve to be abused this way.
|
| "Sure, the C compiler is allowed to nuke your entire program
| whenever there's a bug at runtime, but all you have to do is
| prove that your program is bug free! Why, any _real_ C programmer
| can write bug-free code, so if you have any UB bugs, you 're
| probably not good enough to write C. Sure, I've been bruised by
| UB bugs in the past, we all have, but when I made those mistakes,
| I deserved to be bruised, and those bruises made me a better
| programmer."
| pif wrote:
| You don't get it: if you are ready to give up top performance,
| you can choose among a multitude of more forgiving languages.
|
| There is simply no point in having a C language with less than
| top level optimizations.
| kstrauser wrote:
| Name a feature of C that makes it inherently "faster" than
| another compiles-to-machine-code language. I can name plenty
| that make it slower.
| vyodaiken wrote:
| Good point. C programmers have come to assume that C is a
| shitty language with terrible semantics and it's their fault
| for not using Rust or something. They blame the language for
| the mess made by standards/compilers.
| jerf wrote:
| I expect there's already been a lot of evaporative cooling in
| the C community and there will only be more over time.
| Increasingly the people who are going to be left in the C
| community are precisely those people who think C is generally
| OK. Anyone who has developed a high degree of fear of C is
| ever-increasingly doing what it takes to move away.
|
| I have tried to phrase this neutrally, as to not be a comment
| about whether or not C "really is" generally OK. I am not sure
| I have 100% succeeded, as I have my opinions. But the point
| stands regardless; the people in the C community are generally
| going to be the ones who are OK with C.
| anarazel wrote:
| The situation has gotten at least somewhat better since then.
| Ubsan and friends are not a guarantee, but make it much more
| realistic to find UB issues.
| optymizer wrote:
| Anecdotally, I find about as many bugs in my Java code at work,
| as I do in my hobby C code, including "dereferencing null
| pointers" aka NullPointerExceptions.
| anarazel wrote:
| NPE isn't really a relevant comparison point - it raises an
| exception of sorts in both languages. Accessing an array past
| its end, accessing freed memory, reading uninitialized memory
| seem more apt.
| optymizer wrote:
| These are all issues a compiler could insert checks for,
| saving you the time to do it manually, while trading
| performance for security or correctness.
|
| C doesn't trade performance away, so if you'd like to pay
| the price of these checks, it's on you, the programmer to
| add them in. C programmers have to put in extra effort to
| make the executables safer and output correct results.
|
| Other languages trade off performance for safety and/or
| correctness. The programmers using them have to put in
| extra effort to make the executables run as fast as they do
| in C.
|
| Ultimately, programmers tend to make rational choices, and
| the stockholm syndrome mentioned above is really just C
| programmers dealing with the consequences of the trade-off
| they made by using C.
| vyodaiken wrote:
| C is designed to allow programmers to insert or remove
| checks as performance tuning. Compiler UB "optimizations"
| that remove that ability from the programmer make the
| language unusable.
| jerf wrote:
| "Undefined behavior" is a term of art relevant to C, meaning
| that the standard no longer has _any_ comment about what
| happens. Thus the comments about launching nukes, or
| destroying your machine, etc., being standards complaint,
| even though obvious real compilers won 't actually emit code
| that does that.
|
| Dereferencing a nil pointer in Java is not undefined
| behavior. It is defined; it throws a NullPointerException,
| and "it throws a NullPointerException" is a very specific
| term of art in the Java community that is very precisely
| defined.
|
| Many languages don't have an explicit concept of "undefined
| behavior" in their standard (though, of course, due to
| deficiencies in the standard they may have implicitly
| undefined behavior), and of those that do I doubt anyone
| matches C or C++ in the commonly-used category of languages.
| optymizer wrote:
| Point taken about NPEs being defined behavior, but from a
| practical point of view, a bug is a bug (and bugs is what
| the parent comment was referring to).
|
| Whether the bug was due to a well-defined or undefined
| behavior seems like an exercise in assigning blame to
| another entity (the language, the committee, the compiler,
| etc).
| jerf wrote:
| Correct assignment of (technical) blame is an important
| engineering task, though. You sound to me like you are
| thinking that's some sort of wrong thing to do, but I
| would disagree. Identifying whether the bug is a defined
| (in C) behavior or the result of the compiler making a
| certain choice as the result of your code invoking
| undefined behavior is an important element of fixing the
| bug; if you don't know which is which you're operating
| under a critical handicap.
| Asooka wrote:
| > But C programmers believe they deserve to be abused this way.
|
| I don't know of any C programmers who think that way. We accept
| the state of affairs because there is little alternative. We're
| not going to rewrite our code in another language, because we
| mostly need to write in C for various reasons. Also Rust
| inherits LLVM's undefined behaviour, so it's no panacea. We
| mostly keep adding flags to turn off the most excessive
| optimisations and pray.
| steveklabnik wrote:
| Rust inheriting it is the equivalent of a codegen bug, and
| they get patched. The cause here isn't LLVM, it's language
| semantics.
| pjmlp wrote:
| Duplicate, https://news.ycombinator.com/item?id=27217981
| yesenadam wrote:
| That only got 1 comment.
| pjmlp wrote:
| Just like this one, when I posted mine here
| setr wrote:
| Linking a failed discussion from another (possibly) failed
| discussion seems kind of pointless.
| pjmlp wrote:
| Just like this thread.
| whatshisface wrote:
| I don't see any utility in inventing a new reading of the
| standard. Getting everyone to agree on a new interpretation of a
| sentence can't possibly be easier than getting everyone to agree
| on a more clearly worded sentence. The actual thing you'd have to
| convince everyone of (the utility of the new consensus) and the
| people you'd have to convince (compiler writers, documentation
| authors) are the same in both cases.
| setr wrote:
| The difference is that, once decided and written, no party
| (both old and new) can't chime in with a new interpretation.
| JadeNB wrote:
| > The difference is that, once decided and written, no party
| (both old and new) can't chime in with a new interpretation.
|
| That double negative ("no party ... can't") was accidental,
| right?
| klodolph wrote:
| I think one of the best attempts to solve this is the attempt to
| classify undefined behavior into "bounded" and "critical" UB,
| which is a distinction that hasn't gained as much traction as I'd
| like.
|
| Something like 1<<32 is "bounded" UB, and can't result in
| anything happening to your program... it is permitted to result
| indeterminate values, it is permitted to trap, but that's
| basically it.
|
| Critical UB is stuff like writing to a const object, calling a
| function pointer after casting it to an incompatible type,
| dereferencing an invalid pointer, etc. Basically, anything goes.
|
| This is part of the "analyzability" optional extension which I
| wish would gain more traction.
| MauranKilom wrote:
| > Something like 1<<32 is "bounded" UB, and can't result in
| anything happening to your program... it is permitted to result
| indeterminate values, it is permitted to trap, but that's
| basically it.
|
| That is what implementation-defined behavior is for. Feel free
| to advocate for changing the standard accordingly.
| klodolph wrote:
| > Feel free to advocate for changing the standard
| accordingly.
|
| I am, in fact, describing the existing C standard, not
| advocating for changes. Please refer to Annex L
| "Analyzability", which defines the terms I used: "bounded
| undefined behavior" and "critical undefined behavior". Note
| that this section is conditional... it is not widely adopted.
| I am advocating for increased adoption of this part of the
| standard, or barring that, revisions that would make adoption
| more palatable.
|
| And in general, "If you don't like it, advocate changing the
| standard" is not a useful or insightful response. It is
| completely reasonable and normal to complain about something
| without trying to fix it.
| MauranKilom wrote:
| You're right, apologies.
| jcranmer wrote:
| It's easy to pick on undefined behavior in C when you focus on
| the more gratuitous undefined behaviors such as signed overflow
| or oversized shifts. I'm not certain why these are undefined
| behavior instead of implementation-defined, but my suspicion is
| that these caused traps on some processors, and traps are
| inherently undefined behavior.
|
| Instead, if you dislike undefined behavior, I challenge you to
| come up with workable semantics for non-gratuitous scenarios, of
| which I'll lay out three. Remember, if it's not undefined, then
| there are some defined semantics that need to be preserved, even
| if implementation-defined, so you should be able to explain those
| semantics. If you can't find such semantics, then maybe undefined
| behavior _isn 't_ such a bad thing after all.
|
| The first case is the compiler hint. A function marked _Noreturn
| returns. Or you alias two pointers marked restrict. Remember that
| the entire point of these hints is to permit the compiler to not
| generate code to check for these scenarios.
|
| The second case is uninitialized memory. You've probably already
| come up with such semantics, so I'll point out an optimization
| scenario that you probably don't object to that your semantics
| didn't cover: static int y; /* uninitialized */
| _Bool cond = f(); int x = cond ? 2 : y; /* Is it legal to
| fold this to int x = 2; ? */
|
| Hopefully, you'll agree that that is a reasonable optimization.
| Now consider this code: static int y; /*
| uninitialized */ _Bool cond1 = f(), cond2 = g(); int
| x1 = cond1 ? 2 : y; /* So int x1 = 2; */ int x2 = cond2 ? 3
| : y; /* So int x2 = 3; */ if (!cond1 && !cond2) {
| assert(x1 == x2); /* Uh... */ }
|
| This is just scraping the tip of the surface of uninitialized
| values. Developing sane semantics around these sorts of values
| that also allow reasonable optimizations is challenging. You can
| look at the very lengthy saga that is undef, poison, and freeze
| in LLVM (still ongoing!) to see what it looks like in practice.
|
| The third category is traps. Let's pick an easy example: what
| happens if you dereference a null pointer? Now let's consider the
| consequences of those semantics on some more code examples:
| int *x = NULL; int *y = &*x; /* Can this be lowered to int
| *y = x; ? */ size_t offset = (size_t)&((struct foo
| *)NULL)->field; /* Remember, x->a is actually (*x).a */ int
| *z = get_pointer(); *z; /* No one uses the result of the
| load, can I delete it? */ for (int i = 0; i < N; i++) {
| foo(*z); /* Can I hoist the load out of the loop? */ }
|
| Note that all of the optimizations I'm alluding to here are ones
| that would have existed all the way back in the 1980s when C was
| being standardized, and these are pretty basic, pedestrian
| optimizations that you will cover in Compilers 101.
| vyodaiken wrote:
| "static int y; /* uninitialized _/ _Bool cond = f(); int x =
| cond ? 2 : y; /_ Is it legal to fold this to int x = 2; ? _/
| Hopefully, you 'll agree that that is a reasonable
| optimization. Now consider this code:"
|
| Do not agree. That's not an optimization it's just false
| reasoning. Reasonable would be to either ignore it so the code
| depends on the uniitialized data, whatever it is, or to flag an
| error. Also according to the standard, static variable of
| arithmetic type are initialized to zero by default so there is
| no UB at all.
|
| Second example has the same problem.
|
| Third example has, for example, assumptions that C does not let
| the compiler make e.g. that z is a restricted pointer. Imagine
| that f(int c) { z = getsamepointer(); _z +=1;return *z; }
|
| And none of those optimizations existed in the 1980s.
| anarazel wrote:
| > It's easy to pick on undefined behavior in C when you focus
| on the more gratuitous undefined behaviors such as signed
| overflow or oversized shifts.
|
| I wouldn't even mind the signed integer overflow thing that
| much if there were a reasonable way in standard C to check
| whether a signed operation would overflow.
|
| It's not impossible to do correctly in a compiler independent
| way, but ridiculously hard. And slow.
| anarazel wrote:
| Another thing is that the wording around some of the UB
| issues is just plain bad. The most extreme probably is the
| rules around strict aliasing. That there, for quite a while,
| was uncertainty whether the rules allow type punning by
| reading a union member when the last write was to another
| member of a good example of not taking reality into account.
| Yes memcpy exists - but it is even less type safe!
| jcranmer wrote:
| The union punning trick is UB in C89 and well-defined in
| C99 and later, although it was erroneously listed in the
| (non-normative) Annex listing UBs in C99 (removed by C11).
|
| Strict aliasing is another category of UB that I'd consider
| gratuitous.
| anarazel wrote:
| > The union punning trick is UB in C89 and well-defined
| in C99 and later, although it was erroneously listed in
| the (non-normative) Annex listing UBs in C99 (removed by
| C11).
|
| Right, that's my point. If the standard folks can't
| understand their standard, how are mere mortals supposed
| to?
|
| > Strict aliasing is another category of UB that I'd
| consider gratuitous.
|
| I'm of a bit split minds on it. It can yield substantial
| speedups. But is also impractical in a lot of cases. And
| it's often but strong enough anyway, requiring explicit
| restrict annotations for the compiler to understand two
| pointers don't alias. Turns out two pointers of the same
| (or compatible) type aren't rare in performance critical
| sections...
|
| Realistically it should have been opt-in.
| nullc wrote:
| > Strict aliasing is another category of UB that I'd
| consider gratuitous.
|
| Without it you cannot vectorize (or even internally re-
| order) many loops which are currently vectorizable
| because the compiler can't statically prove arguments
| won't alias otherwise.
| anarazel wrote:
| See e.g. https://github.com/postgres/postgres/blob/8bdd6f563a
| a2456de6...
| tlb wrote:
| What version of C doesn't require `static int y;` to be
| initialized to 0?
| jcranmer wrote:
| You're right, it should be automatic storage duration instead
| of static storage duration.
| Lvl999Noob wrote:
| I am not a compiler dev or a high skill coder so my opinion
| might not matter much but I'll still lay them out here.
|
| > The first case is the compiler hint...
|
| The compiler should refuse to compile if it comes upon such a
| case. Those hints are as much used by programmers as they are
| by the compiler. It should emit a warning and necessary checks
| + abort code if those hints can neither be proven nor disproven
| statically.
|
| > The second case is uninitialized memory...
|
| For the first example, unless the compiler knows that f() must
| always be true, it should give a compile time error. In case
| that it does know that f() is always true, it should still emit
| a warning (or give an error anyways).
|
| > The third category is traps...
|
| I am honestly not sure about this one. It would depend on the
| behaviour defined for multiple references / dereferences done
| simultaneously and type casting. I would still probably expect
| the compiler to give out warnings atleast.
|
| Edit: language / grammar / typos
| jcranmer wrote:
| As a rule of thumb, yes, compilers should issue warnings
| where undefined behavior is obviously occurring. (And there
| is something to be said for compiling with -Werror). However,
| that's not going to always work, and there are two reasons
| for this.
|
| The first reason is that undefined behavior is ultimately a
| statement about dynamic execution of the program. The set of
| expressions that could potentially cause undefined behavior
| is essentially all of them--every signed arithmetic
| expression, every pointer dereference, hell, almost all
| function calls in C++--and figuring out whether or not they
| actually do cause undefined behavior is effectively
| impossible at the compiler level. This is why sanitizers were
| developed, and also why sanitizers only work as a dynamic
| property.
|
| For a concrete example, consider the following code:
| extern void do_something(int * restrict a, int * restrict b,
| size_t n); void my_function(int *y, int *z, size_t
| x, size_t n) { if (x > n) do_something(y,
| z, n); }
|
| This code could produce undefined behavior. Or it could not.
| It depends on whether or not y and z overlaps. Maybe the
| check of the if statement is sufficient to guarantee it.
| Maybe it's not. It's hard to advocate that the compiler
| should warn, let alone error, about this kind of code.
|
| The second issue to be aware of is that there is a general
| separation of concerns between the part of the compiler that
| gives warnings and errors (the frontend), and the part that
| is actually optimizing the code. It is difficult, if not
| impossible, to give any kind of useful warning or error
| message in the guts of the optimizer; by the time code
| reaches that stage, it is often incredibly transformed from
| the original source code, to the point that its correlation
| with the original can be difficult to divine.
|
| So I once came across some really weird code that broke an
| optimization pass I was working on. It looked roughly like
| this (approximate C translation of the actual IR):
| if (nullptr != nullptr) { int *x = nullptr;
| do { /* do some stuff with x */ x++;
| } while (x != nullptr); }
|
| What hideous code creates a loop that iterates a pointer
| through all of memory? Why, this (after reducing the test
| case): void foo() {
| std::vector<std::set<int>> x; x.emplace_back();
| }
|
| So the crazy code was generated from the compiler very
| heavily inlining the entire details of the STL, and the
| original code was a more natural iteration from a start to an
| end value. The compiler figured out enough to realize that
| the start and end values were both null pointers, but didn't
| quite manage to actually fully elide the original loop in
| that case. Warning the user about the resulting undefined
| behavior in this case is completely counterproductive; it's
| not arising from anything they did, and there isn't much they
| can to do to silence that warning.
| bhk wrote:
| > It's easy to pick on undefined behavior in C when you focus
| on the more gratuitous undefined behaviors ...
|
| That is the _whole point_. There are scores of instances of
| gratuituous UB.
|
| > I'm not certain why these are undefined behavior instead of
| implementation-defined, but my suspicion is that these caused
| traps on some processors, and traps are inherently undefined
| behavior.
|
| Traps are not inherently undefined. The C standard discusses
| floating point traps in detail. Many details may of course be
| left to the implementation or platform to describe, but that's
| very different from saying "all bets are off".
|
| The real reason for the gratuitous UB was misunderstanding and
| ignorance.
|
| > Hopefully, you'll agree that that is a reasonable
| optimization.
|
| Hopefully, you'll agree that the code is buggy, and should be
| fixed. We end up running expensive and cumbersome static
| analysis tools that try to detect situations just like this ...
| which the compiler itself has detected, but chosen not to warn
| us.
| richardwhiuk wrote:
| The first code isn't necessarily buggy.
___________________________________________________________________
(page generated 2021-05-20 23:01 UTC)