[HN Gopher] Summary of C/C++ integer rules
       ___________________________________________________________________
        
       Summary of C/C++ integer rules
        
       Author : dmulholl
       Score  : 101 points
       Date   : 2021-04-02 07:50 UTC (15 hours ago)
        
 (HTM) web link (www.nayuki.io)
 (TXT) w3m dump (www.nayuki.io)
        
       | lifthrasiir wrote:
       | > Signed numbers may be encoded in binary as two's complement,
       | ones' complement, or sign-magnitude; this is implementation-
       | defined.
       | 
       | Thankfully, in addition to what MaxBarraclough helpfully pointed
       | out, every (u)intN_t type provided by <stdint.h> is guaranteed to
       | use two's complement even in C99.
        
       | saagarjha wrote:
       | > Having signed and unsigned variants of every integer type
       | essentially doubles the number of options to choose from. This
       | adds to the mental burden, yet has little payoff because signed
       | types can do almost everything that unsigned ones can.
       | 
       | Unsigned types are quite useful when doing bit twiddling because
       | they don't overflow or have a bit taken up by the sign.
        
         | dreinhardt wrote:
         | Which is why sane languages of the time had a bitfield type.
        
         | enriquto wrote:
         | > Unsigned types are quite useful when doing bit twiddling
         | because they don't overflow or have a bit taken up by the sign.
         | 
         | That's essentially their only application. The rest are stupid
         | single-bit memory-size optimizations. As Jens Gustedt noted,
         | it's one of the (many) misnomers in the C language. It should
         | be better called "modulo" instead of "unsigned". Other such
         | misnomers that I recall:                   unsigned -> modulo
         | char     -> byte         union    -> overlay         typedef
         | -> typealias         const    -> immutable         inline   ->
         | negligible         static   -> intern         register ->
         | addressless
         | 
         | EDIT: found the reference
         | https://gustedt.wordpress.com/2010/08/18/misnomers-in-c/
        
           | renox wrote:
           | > const -> immutable
           | 
           | const -> read_only_view is better
        
           | ActorNightly wrote:
           | Thanks for this, gonna add some #defines to my headers :)
        
           | dkersten wrote:
           | > That's essentially their only application.
           | 
           | What about when it doesn't make semantic sense to have
           | negative values? Eg for counting things, indexing into a
           | vector, size of things. If negative doesn't make sense, I use
           | unsigned types. Its not about the memory-size in that case.
        
             | enriquto wrote:
             | Positive values are a particular case of signed values, you
             | can still use signed ints to store positive values. No need
             | to enforce your semantics through type, and especially not
             | when the values of the type are trivially particular cases
             | of the values of another type. For example, when you write
             | a function in C that computes prime factors of an int, do
             | you need a type for prime numbers? No, you just use int.
             | The same thing for positive numbers, and for even numbers,
             | and for odd numbers. You can and should do everything with
             | signed integers, except bitfields, of course.
        
               | masklinn wrote:
               | > Positive values are a particular case of signed values,
               | you can still use signed ints to store positive values.
               | 
               | And yet Java's lack of unsigned integers is considered a
               | major example of its (numerous) design errors.
               | 
               | > No need to enforce your semantics through type, and
               | especially not when the values of the type are trivially
               | particular cases of the values of another type.
               | 
               | Of course not, there's no _need_ for any type at all, you
               | can do everything with just the humble byte.
               | 
               | > The same thing for positive numbers
               | 
               | No?
               | 
               | > You can and should do everything with signed integers
               | 
               | You really should not. If a value should not have
               | negative values, then making it so it _can not_ have
               | negative values is strictly better than the alternative.
               | Making invalid values impossible makes software clearer
               | and more reliable.
               | 
               | > except bitfields, of course.
               | 
               | There's no more justification for that than for the other
               | things you object to.
        
               | enriquto wrote:
               | Well, you and me are different persons and we don't have
               | to agree on everything. In this case, it seems that we
               | don't agree on _anything_. But it 's still OK, if it
               | works for you ;)
        
               | dkersten wrote:
               | > No need to enforce your semantics through type
               | 
               | Maybe I'm spoiled by other languages with more powerful
               | type systems, but this is exactly what I want my types to
               | do! Isn't this why we have type traits and concepts and
               | whatnot in C++ now? If not for semantics, why have types
               | at all, the compiler could figure out what amount of
               | bytes it needs to store my data in, after all.
               | 
               | I use types for two things: to map semantics to hardware
               | (if memory or performance optimization are important,
               | which is rare) and to enforce correctness in my code.
               | You're telling me that the latter is not a valid use of
               | types and I say that's the single-biggest reason I use
               | statically typed languages over dynamically typed
               | languages, when I do so.
               | 
               | But even if that's not the case, why would I use a more
               | general type than I need, when I know the constraints of
               | my code? If I know that negative values are not
               | semantically valid, why not use a type that doesn't allow
               | those? What benefit would I get from not doing that? I
               | mean, why do we have different sizes of integers when all
               | the possible ones I could want can be represented as a
               | machine-native size and I can enforce size constraints in
               | software instead? We could also just use double's for all
               | numbers, like some languages do.
        
               | enriquto wrote:
               | Would you really write a function find_prime_factors()
               | that takes an input of type "integer" and an output of
               | type "prime", that you have previously defined? Then if
               | you want to sum or multiply such primes you have to cast
               | them back to integers. Maybe it makes sense for you, but
               | for me this is the textbook example of useless over-
               | engineering.
               | 
               | The same ugliness occurs when using unsigned types to
               | store values that happen to be positive. Well, in that
               | case it is even worse, because it is incomplete and
               | asymmetric. What's so special about the lower bound of
               | the possible set of values? If it's an index to an array
               | of length N, you'll surely want an integer type whose
               | values cannot exceed N. And this is a can of worms that I
               | prefer not to open...
        
               | dkersten wrote:
               | > Would you really write a function find_prime_factors()
               | that takes an input of type "integer" and an output of
               | type "prime", that you have previously defined?
               | 
               | If the language allows me to and its an important
               | semantic part of my program, then yes. The same way as I
               | would create types for units that need conversion.
               | 
               | Unless I'm writing low level performance sensitive code,
               | yes, I want to encode as much of my semantics as I can,
               | so that I can catch mistakes and mismatches at compile
               | time, make sure units get properly converted and whatnot.
               | 
               | > What's so special about the lower bound of the possible
               | set of values?
               | 
               | Nothing, I would encode a range if I can. But many things
               | don't have a knowable upper-bound but do have a lower
               | bound at zero: you can't have a negative size (for most
               | definitions of size), usually when you have a count of
               | things you don't have negatives, you know that a
               | dynamically sized array can never have an element index
               | less than 0, but you may not know the upper bound.
               | 
               | Also, the language has limitations, so I have to work
               | within them. I don't understand your objection for using
               | what is available to make sure software is correct. Also,
               | remember that many of the security bugs we've seen in
               | recent years came about because of C not being great at
               | enforcing constraints. Are you really suggesting not to
               | even try?
               | 
               | > And this is a can of worms that I prefer not to open...
               | 
               | And yet many languages do and even C++20 is introducing
               | ranges which kind of sort of fall into this space.
        
               | giomasce wrote:
               | To me it could totally make sense. It depends on the
               | context, but I can very well see contexts where such a
               | choice could make sense. For example, in line of
               | principle it would make sense, for an RSA implementation,
               | to accept to construct a type PublicKey only computing
               | the product of two Prime's, and not two arbitrary
               | numbers. And the Prime type would only be constructible
               | by procedures that provably (perhaps with high
               | probability) generate an actual prime number. It would be
               | a totally sensible form of defensive programming. You
               | don't want to screw up your key generation algorithm, so
               | it makes sense to have your compiler help you to not
               | construct keys from anything.
               | 
               | For the same reason, say, in an HTTP server I could store
               | a request as a char* or std::string, but I would
               | definitely create a class that ensures, upon
               | construction, that the request is valid and legitimate.
               | Code that processes the request would accept HTTPRequest,
               | but not char*, so that unverified requests cannot even
               | risk to cross the trust boundary.
        
               | UncleMeat wrote:
               | But "unsigned" doesn't actually enforce the semantics you
               | want. Missing an overflow check means your value will
               | never be negative, but it is almost certainly still a
               | bug. And because unsigned overflow is defined, the
               | compiler isn't allowed to prevent you from doing it!
               | 
               | This is just enough type semantics to injure oneself.
        
               | dkersten wrote:
               | So, because its not perfect, should you throw it all out?
        
               | jcelerier wrote:
               | > Maybe I'm spoiled by other languages with more powerful
               | type systems, but this is exactly what I want my types to
               | do! Isn't this why we have type traits and concepts and
               | whatnot in C++ now? If not for semantics, why have types
               | at all, the compiler could figure out what amount of
               | bytes it needs to store my data in, after all.
               | 
               | yes, but understand that, despite the name, what unsigned
               | models in C / C++ is not "positive numbers" but "modulo
               | 2^N" arithmetic (while signed models the usual
               | arithmetic).
               | 
               | There is no good type that says "always positive" by
               | default in C or C++ - any type which gives you an
               | infinite loop if you do
               | for({int,unsigned,whatever} i = 0; i < n - 1; i++) {
               | // oops, n was zero, n - 1 is 4.something billion, see
               | you tomorrow         }
               | 
               | is _not_ a good type.
               | 
               | If you want a "always positive" type use some safe_int
               | template such as https://github.com/dcleblanc/SafeInt -
               | here if you do "x - y" and the result of the computation
               | should be negative, then you'll get the rightful runtime
               | error that you want, not some arbitrarily high and
               | incorrect number
               | 
               | The correct uses of unsigned are for instance for
               | computations of hashes, crypto algorithms, random number
               | generation, etc... as those are in general defined in
               | modular arithmetic
        
               | oddthink wrote:
               | +1 for this. I was just bitten by this last week, when I
               | switched from using a custom container where size() was
               | an int to a std::vector where size() is size_t.
               | 
               | The code was check-all-pairs, e.g.                 for
               | (int i = 0; i < container.size() - 1; ++i) {         for
               | (int j = i + 1; j < container.size(); ++j) {
               | stuff(container[i], container[j]);         }       }
               | 
               | Which worked just fine for int size, but failed
               | spectacularly for size_t size when size==0.
               | 
               | I totally should have caught that one, but I just
               | couldn't see it until someone else pointed it out. And
               | then it was obvious, like many bugs.
        
               | jcelerier wrote:
               | I recommend using -fsanitize=undefined -fsanitize=integer
               | if you can build with clang - it will print a warning
               | when an unsigned int underflows which catches a
               | terrifying amount of similar bugs the first time it is
               | run (there are a lot of false positives in hash
               | functions, etc though but imho it's well worth using
               | regularly)
        
             | UncleMeat wrote:
             | If negative doesn't make sense then you are saving one bit
             | using this method, but introducing a _ton_ of fun footguns
             | involving things like conversions. Further, the compiler
             | cannot assume no overflowing and must now do extra work to
             | handle those cases in conforming fashion, even if your
             | value width doesn 't match the CPU width. This can make
             | your code slower!
        
             | chrchang523 wrote:
             | Also, go to the Compiler Explorer and compare the generated
             | code for C++ "num / 2" when num is an int, and when num is
             | an unsigned int.
             | 
             | While there are a few cases where the compiler tends to do
             | a better job of optimizing signed ints than unsigned ints
             | (generally by exploiting the fact that signed integer
             | overflow is undefined), they are not as fundamental as "num
             | / 2". Being forced to write "num >> 1" all over the place
             | whenever I care about performance is basically a
             | dealbreaker for me in many projects; and I haven't even
             | gotten into the additional safety issues introduced by
             | undefined overflow.
        
             | adrian_b wrote:
             | While I also like to use unsigned numbers when that is the
             | correct type of a variable, the C language does not really
             | have support for unsigned integers.
             | 
             | As someone else already said, the so called "unsigned"
             | integers in C are in fact remainders modulo 2^N, not
             | unsigned integers.
             | 
             | While the sum and the product of 2 unsigned integers is
             | also an unsigned integer, the difference of 2 unsigned
             | integers is a signed integer.
             | 
             | The best behavior for a programming language would be to
             | define correctly the type of the difference of 2 unsigned
             | integers and the second best behavior would be to specify
             | that the type of the result is unsigned, but to insert
             | automatically checks for out-of-domain results, to detect
             | the negative results.
             | 
             | As C does not implement any of these behaviors, whenever
             | using unsigned integers you must either not use subtraction
             | or always check for negative results, unless it is possible
             | to always guarantee that negative results cannot happen.
             | 
             | This is a source of frequent errors in C when unsigned
             | integers are used.
             | 
             | The remainders modulo 2^N can be very useful, so an ideal
             | programming language would support signed integers,
             | unsigned integers and modular numbers.
        
       | RMPR wrote:
       | What a coincidence this gets posted today, I posted something[0]
       | a couple of hours ago about how specifically a combination of
       | these rules can bite you very hard.
       | 
       | 0: https://rmpr.xyz/Integers-in-C/
        
       | MaxBarraclough wrote:
       | [Dons language lawyer hat]
       | 
       | > floating-point number types will not be discussed at all,
       | because that mostly deals with how to analyze and handle
       | approximation errors that stem from rounding. By contrast,
       | integer math is a foundation of programming and computer science,
       | and all calculations are always exact in theory (ignoring
       | implementations issues like overflow).
       | 
       | Integer overflow is no mere implementation issue, any more than
       | errors are an implementation issue with floating-point.
       | 
       | > Unqualified char may be signed or unsigned, which is
       | implementation-defined.
       | 
       | > Unqualified short, int, long, and long long are signed. Adding
       | the unsigned keyword makes them unsigned.
       | 
       | There's an additional point here that's not mentioned: _char_ ,
       | _signed char_ , and _unsigned char_ are distinct types, but that
       | 's only true of _char_. That is, _signed int_ describes the same
       | type as _int_. You can see this using the _std::is_same_ type-
       | trait with a conforming compiler. Whether _char_ behaves like a
       | signed integer type or an unsigned integer type, depends on the
       | platform.
       | 
       | > Signed numbers may be encoded in binary as two's complement,
       | ones' complement, or sign-magnitude; this is implementation-
       | defined.
       | 
       | This is no longer true of C++. As of C++20, signed integer types
       | are defined to use two's complement. [0] I don't think C intends
       | to do the same.
       | 
       | > Character literals (in single quotes) have the type (signed)
       | int in C, but (signed or unsigned) char in C++.
       | 
       | That's not correct. In C++, the type of a character literal is
       | simply _char_ , never _signed char_ nor _unsigned char_. As I
       | mentioned above, whether _char_ is signed depends on the
       | platform, but it 's always a distinct type.
       | 
       | > Signed division can overflow - e.g. INT_MIN / -1.
       | 
       | This isn't just overflow, it's undefined behaviour.
       | 
       | > Counting down
       | 
       | > Whereas an unsigned counter would require code like:
       | 
       | > for (unsigned int i = len; i > 0; i--) { process(array[i - 1]);
       | }
       | 
       | That's one solution, but it might be a good place for a _do
       | /while_ loop.
       | 
       | [0] https://stackoverflow.com/q/57363324/
        
         | rualca wrote:
         | > This is no longer true of C++. As of C++20, signed integer
         | types are defined to use two's complement. [0] I don't think C
         | intends to do the same.
         | 
         | As no good language lawyer discussion should be free from
         | pedantry, there is no such thing as "As of C++20". C++20 is
         | just a new version of the C++ standard. Projects that target
         | C++11 or C++14 or C++17 are all still here and won't go away
         | any time soon, and the respective C++ rule still apply to them.
         | Passing a new revision of the C++ standard changes nothing with
         | regards to which rules actually apply to those projects, unless
         | project maintainers explicitly decide to migrate their
         | projects.
        
           | jcelerier wrote:
           | > C++20 is just a new version of the C++ standard.
           | 
           | and per ISO rules, older versions are withdrawn (as can be
           | confirmed for C++ here:
           | https://www.iso.org/standard/79358.html) and not to be used
           | anymore: https://www.iso.org/files/live/sites/isoorg/files/st
           | ore/en/P...                   Other reasons why a committee
           | may decide to propose a standard for withdrawal include the
           | following :         > > the standard does not reflect current
           | practice or research         > > it is not suitable for new
           | and existing applications (products,         systems or
           | processes)         > > it is not compatible with current
           | views and expectations         regarding quality, safety and
           | the environment
        
             | 0xffff2 wrote:
             | Wow. Didn't know this. It doesn't have any bearing
             | whatsoever on reality though. If it did, I wouldn't still
             | be writing C++98 conformant C++.
        
               | jcelerier wrote:
               | well, your code is nonstandard, that is all, just like a
               | house with powerplugs installed 30 years ago is not
               | standard, even if it "works"
        
         | dataflow wrote:
         | >> Character literals (in single quotes) have the type (signed)
         | int in C, but (signed or unsigned) char in C++.
         | 
         | > That's not correct. In C++, the type of a character literal
         | is simply char, never signed char nor unsigned char.
         | 
         | I'd assume the author meant (signed `char` | unsigned `char`)
         | rather than (`signed char` | `unsigned char`).
        
         | focus2020 wrote:
         | What is the reference to "Dons"?
        
           | MaxBarraclough wrote:
           | _Don_ is a somewhat uncommon verb, _To put on clothing_.
           | https://en.wiktionary.org/wiki/don#Verb
        
         | saagarjha wrote:
         | Signed overflow is undefined behavior.
        
           | quietbritishjim wrote:
           | That seems to be exactly what the parent comment said.
        
             | MaxBarraclough wrote:
             | I think saagarjha's point was that the article already
             | points out that signed overflow causes undefined behaviour.
             | That's true, but I think it still bears emphasising that
             | (INT_MIN / -1) causes undefined behaviour.
        
         | quietbritishjim wrote:
         | > char, signed char, and unsigned char are distinct types, but
         | that's only true of char.
         | 
         | That's correct, I was going to bring that up too.
         | 
         | This is particularly important because char and unsigned char
         | are special in that they are an exception the aliasing rules.
         | That is, in this function:                   float foo(char*
         | cp, float* fp) {             *fp = 7;             return
         | *(float*)cp;         }         /* ... */         float f = 2;
         | float g = foo((char*)&f, &f);
         | 
         | Then g should end up equal to 7. That's true even if you change
         | the type of the cp parameter to const char*! If you change
         | "char" to "unsigned char" in both places then its behaviour
         | stays the same, but if you change it to "signed char" in both
         | places then it has undefined behaviour (if I've remembered
         | everything correctly). Now I think about it, this conflation of
         | char's use in the C standard has probably prevented a lot of
         | optimisations where code was just using char* for strings
         | rather than for potential aliasing.
         | 
         | Another point, which is very related, is that uint8_t and
         | int8_t do not necessarily have to be a typedef for unsigned
         | char / signed char or char, even if char is 8 bits wide. So you
         | could end up with (at least) 5 types that are 8-bit wide!
         | 
         | Combined with the above aliasing rules _only_ applying to char
         | and unsigned char, that means you cannot reliably expect
         | uint8_t to have that aliasing exception. Indeed, gcc originally
         | made a new type of uint8_t and int8_t but that caused so many
         | bugs that they ended up switching them to unsigned char and
         | char (and I think Visual Studio has always done so).
         | 
         | > > Character literals (in single quotes) have the type
         | (signed) int in C, but (signed or unsigned) char in C++.
         | 
         | > That's not correct. In C++, the type of a character literal
         | is simply char, never signed char nor unsigned char.
         | 
         | I was going to bring this up too, although I wouldn't quite say
         | it's outright incorrect because I'm not sure they were making
         | the claim you think they were - it could be interpreted to mean
         | that it's always char in C++ but by the way don't forget that
         | could be a signed or unsigned type (note the lack of monospace
         | font for their use of "signed" and "unsigned"). But probably
         | best not to overanalyse it since they probably didn't know the
         | types were distinct - the main thing is reiterate, as you've
         | done, that it's always `char` regardless of whether that's
         | signed or unsigned.
        
           | logicchains wrote:
           | >So you could end up with (at least) 5 types that are 8-bit
           | wide!
           | 
           | Don't forget std::byte.
        
             | lifthrasiir wrote:
             | It is not a full-featured arithmetic type though. It
             | doesn't implement operator+/-/* etc.
        
       | criddell wrote:
       | This is one of the misconceptions:
       | 
       | > sizeof(T) represents the number of 8-bit bytes (octets) needed
       | to store a variable of type T.
       | 
       | That's a misconception I had and I've never run into a problem.
       | What's a platform where sizeof works differently?
       | 
       | Also, what's the reasoning for sizeof to be an operator rather
       | than a function?
        
         | GlitchMr wrote:
         | See https://stackoverflow.com/questions/2098149/what-
         | platforms-h.... As for `sizeof` being an operator, well, C
         | doesn't have generics, so it has no choice but to make `sizeof`
         | somehow special.
         | 
         | If you don't want to bother supporting platforms where byte is
         | not 8-bit (a reasonable choice I would say), use
         | `int8_t`/`uint8_t` instead. Those types won't exist on
         | platforms that don't have 8-bit bytes.
        
           | MaxBarraclough wrote:
           | > C doesn't have generics, so it has no choice but to make
           | `sizeof` somehow special
           | 
           | It could have used a different syntax though. Ada has a
           | special syntax for compile-time inquiries like this, so
           | there's no way to confuse them with function calls. Ada calls
           | these _attributes_.
           | 
           | https://en.wikibooks.org/wiki/Ada_Programming/Attributes#Lan.
           | ..
        
           | masklinn wrote:
           | > If you don't want to bother supporting platforms where byte
           | is not 8-bit (a reasonable choice I would say), use
           | `int8_t`/`uint8_t` instead. Those types won't exist on
           | platforms that don't have 8-bit bytes.
           | 
           | You'll have the issue that, as one of the commenters
           | explained above, `char` is its own thing, independent and
           | separate from `signed char` and `unsigned char` to say
           | nothing of `int8_t` and `uint8_t`. This means that while you
           | can use your own thing for your own functions you can _not_
           | do so if your values have to interact with libc functions (or
           | most of the ecosystem at large).
           | 
           | If you only want to support platforms using 8-bit chars, you
           | should check CHAR_BIT. That is actually reliable and correct.
        
       | dahfizz wrote:
       | One thing I think should have been mentioned: size_t is
       | guaranteed to be large enough to index all of memory, which is
       | why it is the return type of size of.
        
         | dusanz wrote:
         | size_t is only guaranteed to be large enough to store the size
         | of the largest object. This is not the same as being able to
         | index all of memory. You could imagine a platform with
         | restricted continuous allocation size where the maximum object
         | size is smaller than the size of the address space.
        
       | flohofwoe wrote:
       | In the myths section:
       | 
       | > char is always 8 bits wide. int is always 32 bits wide
       | 
       | > Signed overflow is guaranteed to be wrap around. (e.g. INT_MAX
       | + 1 == INT_MIN.)
       | 
       | Are there any current, relevant hardware architectures where this
       | is not true (e.g. bytes are not 8 bits, and integers are not 2's
       | complement)?
       | 
       | E.g. what's the point of "portability" if there is no physical
       | hardware around anymore where those restrictions would apply?
        
         | AshamedCaptain wrote:
         | Remember to add: that can actually run standard C++ (i.e. with
         | exceptions)?
         | 
         | Certainly you can find an architecture which may run some type
         | of C-like language with strange arithmetic rules (e.g. DSPs). I
         | would bet it's harder to find one such architecture where one
         | can run standard C, and impossible to find one which can run
         | standard C++.
        
           | lultimouomo wrote:
           | This. I don't understand why everyone must suffer the pain of
           | the possibility of weird char widths instead of just settling
           | on using a non-standard C in a bunch of DSPs. It's not like
           | you're going to link a bunch of regular run of the mill C
           | libraries on them anyway.
        
             | not_knuth wrote:
             | Isn't this an artifact of the age of C? When it was first
             | created it was a major concern to support every
             | architecture, so they put it in the standard. I don't think
             | anyone has wanted to go through the pain of removing it
             | ever since.
             | 
             | After all, who are language nerds to dictate chip
             | manufacturers what the ISA should look like? :P
             | 
             | And it was only in the last 2 decades that everything got
             | dominated by x86...
        
             | [deleted]
        
         | beeforpork wrote:
         | This is the trap with 'undefined behaviour': it has nothing to
         | do with portability, but it is a language level definition.
         | 
         | I.e., if the C std says it's 'undefined', it is not to be
         | avoided for portability reasons (hardware, assembler), but it
         | must not be used, end of story. The portability stuff is called
         | 'implementation defined' in C, not 'undefined behaviour'. The
         | problem is that the compiler can (and will!) exploit undefined
         | behaviour rules. E.g., the following code is officially broken
         | (and not just on weird hardware, but everywhere, as defined by
         | the C std):                 int saturated_increment(int i)
         | {           if ((i + 1) < i) { /* if it overflows, do not inc
         | */               return i;           }           return i + 1;
         | }
         | 
         | The compiler may (and many will) remove the whole if() block,
         | because i+1<i is trivially false, because int cannot overflow
         | (says the C standard).
         | 
         | As one can imagine, when compilers started exploiting this, a
         | lot of discussion about sensibility followed. And gcc added
         | -fwrapv among other things.
         | 
         | (And the code would be fine if 'unsigned' was used in stead of
         | 'int', because this is only a problem of signed ints.)
        
           | dkersten wrote:
           | "Undefined behavior" really means that the standard doesn't
           | define what should happen and that the compiler is therefore
           | free to do whatever it pleases, under the assumption that
           | such code will never occur.
           | 
           | Reminds me of the examples where the code gets compiled in a
           | way where a branch that returns from the function is
           | unintuitively always taken because the compiler was able to
           | detect that there is undefined behavior later in the function
           | and since undefined behavior isn't legal, it assumed that it
           | therefore can never reach there, so the branch must always
           | get taken and the actual condition check got optimised away
           | (IIRC).
           | 
           | So yeah, undefined behavior isn't "implementation defined"
           | nor "unportable" but rather "illegal not allowed wrong code".
        
             | MaxBarraclough wrote:
             | > So yeah, undefined behavior isn't "implementation
             | defined" nor "unportable" but rather "illegal not allowed
             | wrong code".
             | 
             | There are edge-cases even there. Calling a function
             | generated by a JIT compiler is undefined behaviour, but
             | there's a gentleman's agreement that the compiler won't
             | screw it up for you.
             | 
             | Almost all C/C++ compilers promise that floating-point
             | division-by-zero results in NaN (the IEEE 754 behaviour),
             | but according to the C/C++ standards themselves, it's
             | undefined behaviour.
             | 
             | You're right though that in general, one should not be
             | complacent about UB.
        
               | giomasce wrote:
               | > There are edge-cases even there. Calling a function
               | generated by a JIT compiler is undefined behaviour, but
               | there's a gentleman's agreement that the compiler won't
               | screw it up for you.
               | 
               | Though you're not writing C/C++ in that case. You're
               | writing "C/C++ for that particular architecture, ABI, OS
               | and compiler".
               | 
               | In general C/C++, if your code is correct every present
               | and future, known and unknown compiler is supposed to
               | generate a correct executable. If they don't, they have a
               | bug. You can pretend to be smarter and go UB, but then
               | the responsibility shifts on you, you have (in principle)
               | to validate each compiler and environment and you can
               | claim no bug on anybody other than you.
        
               | MaxBarraclough wrote:
               | Sounds right. If you're doing floating-point work it's
               | not generally a problem to assume that division by zero
               | will result in NaN. Virtually all C and C++ compilers
               | commit to this behaviour in the name of IEEE 754
               | compliance (even if the IEEE 754 compliance is
               | incomplete).
        
         | gallier2 wrote:
         | DSP's have often uncommon sizes. tms320c5502 for example has
         | following sizes: char-- 16 bits short --16 bits int --16 bits
         | long-- 32 bits long long -- 40 bits float-- 32 bits double --
         | 64 bits
        
           | rocqua wrote:
           | > 40 bits float-- 32 bits double
           | 
           | Isn't double required to have more precision than float?
        
             | ericbarrett wrote:
             | I think their formatting got swallowed by HN:
             | 
             | char-- 16 bits
             | 
             | short --16 bits
             | 
             | int --16 bits
             | 
             | long-- 32 bits
             | 
             | long long -- 40 bits
             | 
             | float-- 32 bits
             | 
             | double -- 64 bits
        
           | Asraelite wrote:
           | > long long -- 40 bits
           | 
           | Isn't this in direct contradiction to what the article says?
           | 
           | > long long: At least 64 bits, and at least as wide as long.
        
           | brandmeyer wrote:
           | Indeed. The C28x line by the same company shares CHAR_BIT ==
           | 16 with C55. C28x is quite popular in power electronics
           | applications.
           | 
           | "Relevant" is in the eyes of the beholder, and its all too
           | easy to no-true-scotsman your way out of existing
           | architectures. I claim that both of these architectures are
           | relevant by virtue of suppliers continuing to make new chips
           | that use them, and system builders continuing to select those
           | chips in new products.
        
         | jeffbee wrote:
         | There are loads of DSPs, MCUs, and other non-PC junk where
         | CHAR_BIT is not 8. For example of the SHARC, CHAR_BIT is 32,
         | absolutely every type is 32 bits wide.
        
         | beeforpork wrote:
         | Those are two different things:
         | 
         | For 'char has 8 bits': the bitwidth is 'implementation defined'
         | in C. If you know your target architectures, you can assume
         | it's 8 bits, because it's indeed a question of portability.
         | 
         | For 'int must not overflow': this is 'undefined behaviour' in
         | C. You must not do it, regardless of what you know about your
         | target architectures, because this is a language level
         | prohibition.
        
         | amelius wrote:
         | On ARM, char is always unsigned, whereas on Intel it's usually
         | signed. This silly inconsistency broke a lot of code.
        
         | pornel wrote:
         | C code doesn't just run on the architecture you compile for. It
         | first "runs" on a C virtual machine simulated by the optimizer.
         | This low-level virtual machine (you may call it LLVM) usually
         | implements signed overflow by deleting the code that caused it.
        
         | jstimpfle wrote:
         | I heard that one case where defining int-overflow as wrapping
         | would be very bad for performance is pointer arithmetic - e.g.
         | offset a pointer by i times sizeof (type)). I think the x64
         | instruction "lea" accomplishes this. If this instruction is
         | used, it is impossible to simulate 32-bit 2's complement
         | overflow by just discarding the upper 32 bits of a 64-bit
         | integer.
         | 
         | So the UB that is associated with overflowing an int is
         | required to efficiently compile loops that use a counter `int
         | i` to index an array. There is a huge number of these loops in
         | the wild.
         | 
         | This problem might be just some unfortunate coincidence with
         | how array indexing is defined in C. I don't understand this
         | deeply, but just wanted to bring it up. I believe I read this
         | on Fabian Giesen's blog.
        
           | simiones wrote:
           | The part about lea doesn't seem especially convincing, it's
           | not hard to imagine that pointer arithmetic could be defined
           | such that overflow is still UB, while allowing regular signed
           | integer arithmetic to overflow safely.
        
             | jstimpfle wrote:
             | I can't say much about this. What I know is that in C,
             | pointer arithmetic is defined in terms of "normal"
             | arithmetic. p[i] is defined as *(p + i). And (p + i) means
             | to offset p by (i * sizeof *p), and that multiplication is
             | computed as the type of i (e.g. (32-bit) int or even
             | smaller type)
        
               | simiones wrote:
               | That multiplication is entirely implicit, so there is no
               | reason the compiler needs to handle it the same as it
               | handles an explicit multiplication. Given that `p + i` is
               | obviously not an integer addition and it already has much
               | more UB then `i + j`, there is no reason why `i + j`
               | having defined overflow rules needs to mean `p + i` also
               | has them (just like `i + j` is safe for any small enough
               | i and j, while p + i is only meaningful if points within
               | the same object as p (to be fair, its not UB to compute p
               | + i for any i, it's UB to use the value).
        
           | brandmeyer wrote:
           | I think the nasty cases are in supporting subregister-sized
           | arithmetic. ARMv8 can perform almost any integer operation on
           | its registers either as 64-bit or 32-bit registers.
           | 
           | The classic RISC machines could only perform full-register
           | arithmetic. RISC-V has a small handful of instructions that
           | can accelerate signed subregister arithmetic, but none that
           | accelerate unsigned subregister arithmetic. So, if you need a
           | 32-bit unsigned integer operation to guarantee wrap-around
           | behavior on 64-bit RISC-V, the compiler may have to insert
           | additional zero-extension instruction sequences if it cannot
           | prove the absence of overflow.
        
         | tsimionescu wrote:
         | > Are there any current, relevant hardware architectures where
         | this is not true (e.g. bytes are not 8 bits, and integers are
         | not 2's complement)?
         | 
         | For char, not sure, but the problem with signed overflow is not
         | that you can't be sure whether it's 2's complement, it's that
         | the compiler is allowed to assume it won't happen. So, if you
         | read two numbers into 2 ints and add them up, then check for
         | overflow somehow, the compiler will just remove your check
         | while optimizing, since integrr addition can't overflow in a
         | valid program.
        
           | SAI_Peregrinus wrote:
           | And the compiler is allowed to remove the check, *even if not
           | optimizing*. -O0 doesn't guarantee it'll be kept.
           | 
           | In practice compilers sould be considered to follow Murphy's
           | Law: Undefined Behavior will work perfectly fine when on a
           | developer's machine or when observed by any support or QA
           | staff, but will occasionally cause intermittent problems on
           | production machines when observed by users or during
           | demonstrations to executives.
        
         | nemetroid wrote:
         | Signed integers wrapping and 2's complement are separate
         | issues. C++20 specifies that signed integer are 2's complement,
         | but signed overflow is still undefined.
        
       | dvfjsdhgfv wrote:
       | > Python only has one integer type, which is a signed bigint.
       | Compared to C/C++, this renders moot all discussions about bit
       | widths, signedness, and conversions - one type rules all the
       | code. But the price to pay includes slow execution and
       | inconsistent memory usage.
       | 
       | Well, the beauty of C is that you can have that too, if you wish,
       | and you have many options to choose from.
        
       | dreinhardt wrote:
       | It's funny that you would end up with a similar conclusion for
       | other parts of the language (e.g. operators) as well. Just a
       | gigantic set of inane rules everywhere causing you to constantly
       | be in danger of introducing bugs and portability issues.
        
         | bregma wrote:
         | It discouraging. If the language requires you actually know
         | what you're doing you can't hire dirt-cheap easily-replaced
         | code monkeys to bang out your ideas and the end result is you
         | get to keep less of the investors' money for yourself.
        
           | AndriyKunitsyn wrote:
           | It can feel good to imagine yourself an enlightened master
           | among code monkeys, yet on practice, everybody can be a code
           | monkey sometimes, and when this happens in C/C++, it will
           | leave a ticking time bomb in the codebase, that will lay
           | there until a customer blows up on it, no matter how many
           | millions went into QA of the product.
           | 
           | And on practice, C/C++ developers are among lower-paid
           | programmers - probably because "banging out ideas" and
           | producing actual programs that actually work, are valued more
           | than language elitism.
        
             | bigcorp-slave wrote:
             | It's actually not true at all that C++ developers are lower
             | paid. Rather, their pay is highly bimodal. Most work at all
             | FAANGMULA companies is C++.
        
         | Joker_vD wrote:
         | But! And that's important -- it allows for great performance,
         | so you can make ten/hundred times more mistakes per second than
         | in other, "safer" languages.
        
           | tammerk wrote:
           | Nowadays, it doesn't provide any performance gain. I didn't
           | see those days but maybe it was important for performance
           | back in 70s/80s/90s even it was risky? e.g null terminated
           | string was chosen due to low space overhead.
        
             | creato wrote:
             | It depends on what you are doing. For some kinds of
             | programs, C/C++ are going to be much faster than most
             | "modern" languages.
        
               | nicoburns wrote:
               | Most, but not all. Languages like Rust and Zig show that
               | you can have the performance without the landmines.
        
               | hajile wrote:
               | Also, theoretical performance is overrated. Almost all
               | the things that lends themselves to speed make code
               | brittle and incapable of future modification.
               | 
               | Once you've got your C code doing safety checks with data
               | types that won't break under the littlest change, the
               | code becomes much slower than code golf would suggest. A
               | common example is passing void pointers everywhere. You
               | either check every call every time (aka dynamic typing)
               | or rush everything on the idea that the programmer
               | understands the system completely and never forgets or
               | messes up. Better types give you all the speed AND all
               | the safety here.
        
               | tammerk wrote:
               | I didn't mean C is not fast or not faster than other
               | languages. It's still the fastest one I believe.
               | 
               | What I meant is undefined behaviors allow compilers to
               | optimize in a way that would not be possible otherwise.
               | So, it might be a deliberate decision back then, to
               | leverage performance. I don't know, just an idea.
        
               | jjgreen wrote:
               | It used to be "folk knowledge" that only Fortran and
               | hand-crafted ASM were faster. Not sure if that's still
               | (or ever was) true.
        
               | [deleted]
        
               | hajile wrote:
               | I guess it was maybe true one time.
               | 
               | http://www.catb.org/jargon/html/story-of-mel.html
        
           | lifthrasiir wrote:
           | > it allows for great performance, so you can make
           | ten/hundred times more mistakes per second than in other,
           | "safer" languages.
           | 
           | This is false. For a long time C performance used to be
           | inferior to Fortran, which is arguably safer than C. It's
           | hilarious that the strict aliasing and `restrict` keyword was
           | born out of making C on par with Fortran and UB became a
           | major issue to C programmers as a result!
        
             | atkwarriors wrote:
             | Yes, that's why C has undefined behavior. Absolutely
        
         | RMPR wrote:
         | It's a feature, not a bug.
        
         | tammerk wrote:
         | It's more funnier that although language is full of traps, in
         | practice it works quite well. I don't think any C developer(or
         | let's say %95) knows all the rules mentioned in the article,
         | yet we are still one piece.
         | 
         | Does anybody know any paper for bugs per lines of code for
         | different languages or something similar?
        
       ___________________________________________________________________
       (page generated 2021-04-02 23:01 UTC)