[HN Gopher] Summary of C/C++ integer rules
___________________________________________________________________
Summary of C/C++ integer rules
Author : dmulholl
Score : 101 points
Date : 2021-04-02 07:50 UTC (15 hours ago)
(HTM) web link (www.nayuki.io)
(TXT) w3m dump (www.nayuki.io)
| lifthrasiir wrote:
| > Signed numbers may be encoded in binary as two's complement,
| ones' complement, or sign-magnitude; this is implementation-
| defined.
|
| Thankfully, in addition to what MaxBarraclough helpfully pointed
| out, every (u)intN_t type provided by <stdint.h> is guaranteed to
| use two's complement even in C99.
| saagarjha wrote:
| > Having signed and unsigned variants of every integer type
| essentially doubles the number of options to choose from. This
| adds to the mental burden, yet has little payoff because signed
| types can do almost everything that unsigned ones can.
|
| Unsigned types are quite useful when doing bit twiddling because
| they don't overflow or have a bit taken up by the sign.
| dreinhardt wrote:
| Which is why sane languages of the time had a bitfield type.
| enriquto wrote:
| > Unsigned types are quite useful when doing bit twiddling
| because they don't overflow or have a bit taken up by the sign.
|
| That's essentially their only application. The rest are stupid
| single-bit memory-size optimizations. As Jens Gustedt noted,
| it's one of the (many) misnomers in the C language. It should
| be better called "modulo" instead of "unsigned". Other such
| misnomers that I recall: unsigned -> modulo
| char -> byte union -> overlay typedef
| -> typealias const -> immutable inline ->
| negligible static -> intern register ->
| addressless
|
| EDIT: found the reference
| https://gustedt.wordpress.com/2010/08/18/misnomers-in-c/
| renox wrote:
| > const -> immutable
|
| const -> read_only_view is better
| ActorNightly wrote:
| Thanks for this, gonna add some #defines to my headers :)
| dkersten wrote:
| > That's essentially their only application.
|
| What about when it doesn't make semantic sense to have
| negative values? Eg for counting things, indexing into a
| vector, size of things. If negative doesn't make sense, I use
| unsigned types. Its not about the memory-size in that case.
| enriquto wrote:
| Positive values are a particular case of signed values, you
| can still use signed ints to store positive values. No need
| to enforce your semantics through type, and especially not
| when the values of the type are trivially particular cases
| of the values of another type. For example, when you write
| a function in C that computes prime factors of an int, do
| you need a type for prime numbers? No, you just use int.
| The same thing for positive numbers, and for even numbers,
| and for odd numbers. You can and should do everything with
| signed integers, except bitfields, of course.
| masklinn wrote:
| > Positive values are a particular case of signed values,
| you can still use signed ints to store positive values.
|
| And yet Java's lack of unsigned integers is considered a
| major example of its (numerous) design errors.
|
| > No need to enforce your semantics through type, and
| especially not when the values of the type are trivially
| particular cases of the values of another type.
|
| Of course not, there's no _need_ for any type at all, you
| can do everything with just the humble byte.
|
| > The same thing for positive numbers
|
| No?
|
| > You can and should do everything with signed integers
|
| You really should not. If a value should not have
| negative values, then making it so it _can not_ have
| negative values is strictly better than the alternative.
| Making invalid values impossible makes software clearer
| and more reliable.
|
| > except bitfields, of course.
|
| There's no more justification for that than for the other
| things you object to.
| enriquto wrote:
| Well, you and me are different persons and we don't have
| to agree on everything. In this case, it seems that we
| don't agree on _anything_. But it 's still OK, if it
| works for you ;)
| dkersten wrote:
| > No need to enforce your semantics through type
|
| Maybe I'm spoiled by other languages with more powerful
| type systems, but this is exactly what I want my types to
| do! Isn't this why we have type traits and concepts and
| whatnot in C++ now? If not for semantics, why have types
| at all, the compiler could figure out what amount of
| bytes it needs to store my data in, after all.
|
| I use types for two things: to map semantics to hardware
| (if memory or performance optimization are important,
| which is rare) and to enforce correctness in my code.
| You're telling me that the latter is not a valid use of
| types and I say that's the single-biggest reason I use
| statically typed languages over dynamically typed
| languages, when I do so.
|
| But even if that's not the case, why would I use a more
| general type than I need, when I know the constraints of
| my code? If I know that negative values are not
| semantically valid, why not use a type that doesn't allow
| those? What benefit would I get from not doing that? I
| mean, why do we have different sizes of integers when all
| the possible ones I could want can be represented as a
| machine-native size and I can enforce size constraints in
| software instead? We could also just use double's for all
| numbers, like some languages do.
| enriquto wrote:
| Would you really write a function find_prime_factors()
| that takes an input of type "integer" and an output of
| type "prime", that you have previously defined? Then if
| you want to sum or multiply such primes you have to cast
| them back to integers. Maybe it makes sense for you, but
| for me this is the textbook example of useless over-
| engineering.
|
| The same ugliness occurs when using unsigned types to
| store values that happen to be positive. Well, in that
| case it is even worse, because it is incomplete and
| asymmetric. What's so special about the lower bound of
| the possible set of values? If it's an index to an array
| of length N, you'll surely want an integer type whose
| values cannot exceed N. And this is a can of worms that I
| prefer not to open...
| dkersten wrote:
| > Would you really write a function find_prime_factors()
| that takes an input of type "integer" and an output of
| type "prime", that you have previously defined?
|
| If the language allows me to and its an important
| semantic part of my program, then yes. The same way as I
| would create types for units that need conversion.
|
| Unless I'm writing low level performance sensitive code,
| yes, I want to encode as much of my semantics as I can,
| so that I can catch mistakes and mismatches at compile
| time, make sure units get properly converted and whatnot.
|
| > What's so special about the lower bound of the possible
| set of values?
|
| Nothing, I would encode a range if I can. But many things
| don't have a knowable upper-bound but do have a lower
| bound at zero: you can't have a negative size (for most
| definitions of size), usually when you have a count of
| things you don't have negatives, you know that a
| dynamically sized array can never have an element index
| less than 0, but you may not know the upper bound.
|
| Also, the language has limitations, so I have to work
| within them. I don't understand your objection for using
| what is available to make sure software is correct. Also,
| remember that many of the security bugs we've seen in
| recent years came about because of C not being great at
| enforcing constraints. Are you really suggesting not to
| even try?
|
| > And this is a can of worms that I prefer not to open...
|
| And yet many languages do and even C++20 is introducing
| ranges which kind of sort of fall into this space.
| giomasce wrote:
| To me it could totally make sense. It depends on the
| context, but I can very well see contexts where such a
| choice could make sense. For example, in line of
| principle it would make sense, for an RSA implementation,
| to accept to construct a type PublicKey only computing
| the product of two Prime's, and not two arbitrary
| numbers. And the Prime type would only be constructible
| by procedures that provably (perhaps with high
| probability) generate an actual prime number. It would be
| a totally sensible form of defensive programming. You
| don't want to screw up your key generation algorithm, so
| it makes sense to have your compiler help you to not
| construct keys from anything.
|
| For the same reason, say, in an HTTP server I could store
| a request as a char* or std::string, but I would
| definitely create a class that ensures, upon
| construction, that the request is valid and legitimate.
| Code that processes the request would accept HTTPRequest,
| but not char*, so that unverified requests cannot even
| risk to cross the trust boundary.
| UncleMeat wrote:
| But "unsigned" doesn't actually enforce the semantics you
| want. Missing an overflow check means your value will
| never be negative, but it is almost certainly still a
| bug. And because unsigned overflow is defined, the
| compiler isn't allowed to prevent you from doing it!
|
| This is just enough type semantics to injure oneself.
| dkersten wrote:
| So, because its not perfect, should you throw it all out?
| jcelerier wrote:
| > Maybe I'm spoiled by other languages with more powerful
| type systems, but this is exactly what I want my types to
| do! Isn't this why we have type traits and concepts and
| whatnot in C++ now? If not for semantics, why have types
| at all, the compiler could figure out what amount of
| bytes it needs to store my data in, after all.
|
| yes, but understand that, despite the name, what unsigned
| models in C / C++ is not "positive numbers" but "modulo
| 2^N" arithmetic (while signed models the usual
| arithmetic).
|
| There is no good type that says "always positive" by
| default in C or C++ - any type which gives you an
| infinite loop if you do
| for({int,unsigned,whatever} i = 0; i < n - 1; i++) {
| // oops, n was zero, n - 1 is 4.something billion, see
| you tomorrow }
|
| is _not_ a good type.
|
| If you want a "always positive" type use some safe_int
| template such as https://github.com/dcleblanc/SafeInt -
| here if you do "x - y" and the result of the computation
| should be negative, then you'll get the rightful runtime
| error that you want, not some arbitrarily high and
| incorrect number
|
| The correct uses of unsigned are for instance for
| computations of hashes, crypto algorithms, random number
| generation, etc... as those are in general defined in
| modular arithmetic
| oddthink wrote:
| +1 for this. I was just bitten by this last week, when I
| switched from using a custom container where size() was
| an int to a std::vector where size() is size_t.
|
| The code was check-all-pairs, e.g. for
| (int i = 0; i < container.size() - 1; ++i) { for
| (int j = i + 1; j < container.size(); ++j) {
| stuff(container[i], container[j]); } }
|
| Which worked just fine for int size, but failed
| spectacularly for size_t size when size==0.
|
| I totally should have caught that one, but I just
| couldn't see it until someone else pointed it out. And
| then it was obvious, like many bugs.
| jcelerier wrote:
| I recommend using -fsanitize=undefined -fsanitize=integer
| if you can build with clang - it will print a warning
| when an unsigned int underflows which catches a
| terrifying amount of similar bugs the first time it is
| run (there are a lot of false positives in hash
| functions, etc though but imho it's well worth using
| regularly)
| UncleMeat wrote:
| If negative doesn't make sense then you are saving one bit
| using this method, but introducing a _ton_ of fun footguns
| involving things like conversions. Further, the compiler
| cannot assume no overflowing and must now do extra work to
| handle those cases in conforming fashion, even if your
| value width doesn 't match the CPU width. This can make
| your code slower!
| chrchang523 wrote:
| Also, go to the Compiler Explorer and compare the generated
| code for C++ "num / 2" when num is an int, and when num is
| an unsigned int.
|
| While there are a few cases where the compiler tends to do
| a better job of optimizing signed ints than unsigned ints
| (generally by exploiting the fact that signed integer
| overflow is undefined), they are not as fundamental as "num
| / 2". Being forced to write "num >> 1" all over the place
| whenever I care about performance is basically a
| dealbreaker for me in many projects; and I haven't even
| gotten into the additional safety issues introduced by
| undefined overflow.
| adrian_b wrote:
| While I also like to use unsigned numbers when that is the
| correct type of a variable, the C language does not really
| have support for unsigned integers.
|
| As someone else already said, the so called "unsigned"
| integers in C are in fact remainders modulo 2^N, not
| unsigned integers.
|
| While the sum and the product of 2 unsigned integers is
| also an unsigned integer, the difference of 2 unsigned
| integers is a signed integer.
|
| The best behavior for a programming language would be to
| define correctly the type of the difference of 2 unsigned
| integers and the second best behavior would be to specify
| that the type of the result is unsigned, but to insert
| automatically checks for out-of-domain results, to detect
| the negative results.
|
| As C does not implement any of these behaviors, whenever
| using unsigned integers you must either not use subtraction
| or always check for negative results, unless it is possible
| to always guarantee that negative results cannot happen.
|
| This is a source of frequent errors in C when unsigned
| integers are used.
|
| The remainders modulo 2^N can be very useful, so an ideal
| programming language would support signed integers,
| unsigned integers and modular numbers.
| RMPR wrote:
| What a coincidence this gets posted today, I posted something[0]
| a couple of hours ago about how specifically a combination of
| these rules can bite you very hard.
|
| 0: https://rmpr.xyz/Integers-in-C/
| MaxBarraclough wrote:
| [Dons language lawyer hat]
|
| > floating-point number types will not be discussed at all,
| because that mostly deals with how to analyze and handle
| approximation errors that stem from rounding. By contrast,
| integer math is a foundation of programming and computer science,
| and all calculations are always exact in theory (ignoring
| implementations issues like overflow).
|
| Integer overflow is no mere implementation issue, any more than
| errors are an implementation issue with floating-point.
|
| > Unqualified char may be signed or unsigned, which is
| implementation-defined.
|
| > Unqualified short, int, long, and long long are signed. Adding
| the unsigned keyword makes them unsigned.
|
| There's an additional point here that's not mentioned: _char_ ,
| _signed char_ , and _unsigned char_ are distinct types, but that
| 's only true of _char_. That is, _signed int_ describes the same
| type as _int_. You can see this using the _std::is_same_ type-
| trait with a conforming compiler. Whether _char_ behaves like a
| signed integer type or an unsigned integer type, depends on the
| platform.
|
| > Signed numbers may be encoded in binary as two's complement,
| ones' complement, or sign-magnitude; this is implementation-
| defined.
|
| This is no longer true of C++. As of C++20, signed integer types
| are defined to use two's complement. [0] I don't think C intends
| to do the same.
|
| > Character literals (in single quotes) have the type (signed)
| int in C, but (signed or unsigned) char in C++.
|
| That's not correct. In C++, the type of a character literal is
| simply _char_ , never _signed char_ nor _unsigned char_. As I
| mentioned above, whether _char_ is signed depends on the
| platform, but it 's always a distinct type.
|
| > Signed division can overflow - e.g. INT_MIN / -1.
|
| This isn't just overflow, it's undefined behaviour.
|
| > Counting down
|
| > Whereas an unsigned counter would require code like:
|
| > for (unsigned int i = len; i > 0; i--) { process(array[i - 1]);
| }
|
| That's one solution, but it might be a good place for a _do
| /while_ loop.
|
| [0] https://stackoverflow.com/q/57363324/
| rualca wrote:
| > This is no longer true of C++. As of C++20, signed integer
| types are defined to use two's complement. [0] I don't think C
| intends to do the same.
|
| As no good language lawyer discussion should be free from
| pedantry, there is no such thing as "As of C++20". C++20 is
| just a new version of the C++ standard. Projects that target
| C++11 or C++14 or C++17 are all still here and won't go away
| any time soon, and the respective C++ rule still apply to them.
| Passing a new revision of the C++ standard changes nothing with
| regards to which rules actually apply to those projects, unless
| project maintainers explicitly decide to migrate their
| projects.
| jcelerier wrote:
| > C++20 is just a new version of the C++ standard.
|
| and per ISO rules, older versions are withdrawn (as can be
| confirmed for C++ here:
| https://www.iso.org/standard/79358.html) and not to be used
| anymore: https://www.iso.org/files/live/sites/isoorg/files/st
| ore/en/P... Other reasons why a committee
| may decide to propose a standard for withdrawal include the
| following : > > the standard does not reflect current
| practice or research > > it is not suitable for new
| and existing applications (products, systems or
| processes) > > it is not compatible with current
| views and expectations regarding quality, safety and
| the environment
| 0xffff2 wrote:
| Wow. Didn't know this. It doesn't have any bearing
| whatsoever on reality though. If it did, I wouldn't still
| be writing C++98 conformant C++.
| jcelerier wrote:
| well, your code is nonstandard, that is all, just like a
| house with powerplugs installed 30 years ago is not
| standard, even if it "works"
| dataflow wrote:
| >> Character literals (in single quotes) have the type (signed)
| int in C, but (signed or unsigned) char in C++.
|
| > That's not correct. In C++, the type of a character literal
| is simply char, never signed char nor unsigned char.
|
| I'd assume the author meant (signed `char` | unsigned `char`)
| rather than (`signed char` | `unsigned char`).
| focus2020 wrote:
| What is the reference to "Dons"?
| MaxBarraclough wrote:
| _Don_ is a somewhat uncommon verb, _To put on clothing_.
| https://en.wiktionary.org/wiki/don#Verb
| saagarjha wrote:
| Signed overflow is undefined behavior.
| quietbritishjim wrote:
| That seems to be exactly what the parent comment said.
| MaxBarraclough wrote:
| I think saagarjha's point was that the article already
| points out that signed overflow causes undefined behaviour.
| That's true, but I think it still bears emphasising that
| (INT_MIN / -1) causes undefined behaviour.
| quietbritishjim wrote:
| > char, signed char, and unsigned char are distinct types, but
| that's only true of char.
|
| That's correct, I was going to bring that up too.
|
| This is particularly important because char and unsigned char
| are special in that they are an exception the aliasing rules.
| That is, in this function: float foo(char*
| cp, float* fp) { *fp = 7; return
| *(float*)cp; } /* ... */ float f = 2;
| float g = foo((char*)&f, &f);
|
| Then g should end up equal to 7. That's true even if you change
| the type of the cp parameter to const char*! If you change
| "char" to "unsigned char" in both places then its behaviour
| stays the same, but if you change it to "signed char" in both
| places then it has undefined behaviour (if I've remembered
| everything correctly). Now I think about it, this conflation of
| char's use in the C standard has probably prevented a lot of
| optimisations where code was just using char* for strings
| rather than for potential aliasing.
|
| Another point, which is very related, is that uint8_t and
| int8_t do not necessarily have to be a typedef for unsigned
| char / signed char or char, even if char is 8 bits wide. So you
| could end up with (at least) 5 types that are 8-bit wide!
|
| Combined with the above aliasing rules _only_ applying to char
| and unsigned char, that means you cannot reliably expect
| uint8_t to have that aliasing exception. Indeed, gcc originally
| made a new type of uint8_t and int8_t but that caused so many
| bugs that they ended up switching them to unsigned char and
| char (and I think Visual Studio has always done so).
|
| > > Character literals (in single quotes) have the type
| (signed) int in C, but (signed or unsigned) char in C++.
|
| > That's not correct. In C++, the type of a character literal
| is simply char, never signed char nor unsigned char.
|
| I was going to bring this up too, although I wouldn't quite say
| it's outright incorrect because I'm not sure they were making
| the claim you think they were - it could be interpreted to mean
| that it's always char in C++ but by the way don't forget that
| could be a signed or unsigned type (note the lack of monospace
| font for their use of "signed" and "unsigned"). But probably
| best not to overanalyse it since they probably didn't know the
| types were distinct - the main thing is reiterate, as you've
| done, that it's always `char` regardless of whether that's
| signed or unsigned.
| logicchains wrote:
| >So you could end up with (at least) 5 types that are 8-bit
| wide!
|
| Don't forget std::byte.
| lifthrasiir wrote:
| It is not a full-featured arithmetic type though. It
| doesn't implement operator+/-/* etc.
| criddell wrote:
| This is one of the misconceptions:
|
| > sizeof(T) represents the number of 8-bit bytes (octets) needed
| to store a variable of type T.
|
| That's a misconception I had and I've never run into a problem.
| What's a platform where sizeof works differently?
|
| Also, what's the reasoning for sizeof to be an operator rather
| than a function?
| GlitchMr wrote:
| See https://stackoverflow.com/questions/2098149/what-
| platforms-h.... As for `sizeof` being an operator, well, C
| doesn't have generics, so it has no choice but to make `sizeof`
| somehow special.
|
| If you don't want to bother supporting platforms where byte is
| not 8-bit (a reasonable choice I would say), use
| `int8_t`/`uint8_t` instead. Those types won't exist on
| platforms that don't have 8-bit bytes.
| MaxBarraclough wrote:
| > C doesn't have generics, so it has no choice but to make
| `sizeof` somehow special
|
| It could have used a different syntax though. Ada has a
| special syntax for compile-time inquiries like this, so
| there's no way to confuse them with function calls. Ada calls
| these _attributes_.
|
| https://en.wikibooks.org/wiki/Ada_Programming/Attributes#Lan.
| ..
| masklinn wrote:
| > If you don't want to bother supporting platforms where byte
| is not 8-bit (a reasonable choice I would say), use
| `int8_t`/`uint8_t` instead. Those types won't exist on
| platforms that don't have 8-bit bytes.
|
| You'll have the issue that, as one of the commenters
| explained above, `char` is its own thing, independent and
| separate from `signed char` and `unsigned char` to say
| nothing of `int8_t` and `uint8_t`. This means that while you
| can use your own thing for your own functions you can _not_
| do so if your values have to interact with libc functions (or
| most of the ecosystem at large).
|
| If you only want to support platforms using 8-bit chars, you
| should check CHAR_BIT. That is actually reliable and correct.
| dahfizz wrote:
| One thing I think should have been mentioned: size_t is
| guaranteed to be large enough to index all of memory, which is
| why it is the return type of size of.
| dusanz wrote:
| size_t is only guaranteed to be large enough to store the size
| of the largest object. This is not the same as being able to
| index all of memory. You could imagine a platform with
| restricted continuous allocation size where the maximum object
| size is smaller than the size of the address space.
| flohofwoe wrote:
| In the myths section:
|
| > char is always 8 bits wide. int is always 32 bits wide
|
| > Signed overflow is guaranteed to be wrap around. (e.g. INT_MAX
| + 1 == INT_MIN.)
|
| Are there any current, relevant hardware architectures where this
| is not true (e.g. bytes are not 8 bits, and integers are not 2's
| complement)?
|
| E.g. what's the point of "portability" if there is no physical
| hardware around anymore where those restrictions would apply?
| AshamedCaptain wrote:
| Remember to add: that can actually run standard C++ (i.e. with
| exceptions)?
|
| Certainly you can find an architecture which may run some type
| of C-like language with strange arithmetic rules (e.g. DSPs). I
| would bet it's harder to find one such architecture where one
| can run standard C, and impossible to find one which can run
| standard C++.
| lultimouomo wrote:
| This. I don't understand why everyone must suffer the pain of
| the possibility of weird char widths instead of just settling
| on using a non-standard C in a bunch of DSPs. It's not like
| you're going to link a bunch of regular run of the mill C
| libraries on them anyway.
| not_knuth wrote:
| Isn't this an artifact of the age of C? When it was first
| created it was a major concern to support every
| architecture, so they put it in the standard. I don't think
| anyone has wanted to go through the pain of removing it
| ever since.
|
| After all, who are language nerds to dictate chip
| manufacturers what the ISA should look like? :P
|
| And it was only in the last 2 decades that everything got
| dominated by x86...
| [deleted]
| beeforpork wrote:
| This is the trap with 'undefined behaviour': it has nothing to
| do with portability, but it is a language level definition.
|
| I.e., if the C std says it's 'undefined', it is not to be
| avoided for portability reasons (hardware, assembler), but it
| must not be used, end of story. The portability stuff is called
| 'implementation defined' in C, not 'undefined behaviour'. The
| problem is that the compiler can (and will!) exploit undefined
| behaviour rules. E.g., the following code is officially broken
| (and not just on weird hardware, but everywhere, as defined by
| the C std): int saturated_increment(int i)
| { if ((i + 1) < i) { /* if it overflows, do not inc
| */ return i; } return i + 1;
| }
|
| The compiler may (and many will) remove the whole if() block,
| because i+1<i is trivially false, because int cannot overflow
| (says the C standard).
|
| As one can imagine, when compilers started exploiting this, a
| lot of discussion about sensibility followed. And gcc added
| -fwrapv among other things.
|
| (And the code would be fine if 'unsigned' was used in stead of
| 'int', because this is only a problem of signed ints.)
| dkersten wrote:
| "Undefined behavior" really means that the standard doesn't
| define what should happen and that the compiler is therefore
| free to do whatever it pleases, under the assumption that
| such code will never occur.
|
| Reminds me of the examples where the code gets compiled in a
| way where a branch that returns from the function is
| unintuitively always taken because the compiler was able to
| detect that there is undefined behavior later in the function
| and since undefined behavior isn't legal, it assumed that it
| therefore can never reach there, so the branch must always
| get taken and the actual condition check got optimised away
| (IIRC).
|
| So yeah, undefined behavior isn't "implementation defined"
| nor "unportable" but rather "illegal not allowed wrong code".
| MaxBarraclough wrote:
| > So yeah, undefined behavior isn't "implementation
| defined" nor "unportable" but rather "illegal not allowed
| wrong code".
|
| There are edge-cases even there. Calling a function
| generated by a JIT compiler is undefined behaviour, but
| there's a gentleman's agreement that the compiler won't
| screw it up for you.
|
| Almost all C/C++ compilers promise that floating-point
| division-by-zero results in NaN (the IEEE 754 behaviour),
| but according to the C/C++ standards themselves, it's
| undefined behaviour.
|
| You're right though that in general, one should not be
| complacent about UB.
| giomasce wrote:
| > There are edge-cases even there. Calling a function
| generated by a JIT compiler is undefined behaviour, but
| there's a gentleman's agreement that the compiler won't
| screw it up for you.
|
| Though you're not writing C/C++ in that case. You're
| writing "C/C++ for that particular architecture, ABI, OS
| and compiler".
|
| In general C/C++, if your code is correct every present
| and future, known and unknown compiler is supposed to
| generate a correct executable. If they don't, they have a
| bug. You can pretend to be smarter and go UB, but then
| the responsibility shifts on you, you have (in principle)
| to validate each compiler and environment and you can
| claim no bug on anybody other than you.
| MaxBarraclough wrote:
| Sounds right. If you're doing floating-point work it's
| not generally a problem to assume that division by zero
| will result in NaN. Virtually all C and C++ compilers
| commit to this behaviour in the name of IEEE 754
| compliance (even if the IEEE 754 compliance is
| incomplete).
| gallier2 wrote:
| DSP's have often uncommon sizes. tms320c5502 for example has
| following sizes: char-- 16 bits short --16 bits int --16 bits
| long-- 32 bits long long -- 40 bits float-- 32 bits double --
| 64 bits
| rocqua wrote:
| > 40 bits float-- 32 bits double
|
| Isn't double required to have more precision than float?
| ericbarrett wrote:
| I think their formatting got swallowed by HN:
|
| char-- 16 bits
|
| short --16 bits
|
| int --16 bits
|
| long-- 32 bits
|
| long long -- 40 bits
|
| float-- 32 bits
|
| double -- 64 bits
| Asraelite wrote:
| > long long -- 40 bits
|
| Isn't this in direct contradiction to what the article says?
|
| > long long: At least 64 bits, and at least as wide as long.
| brandmeyer wrote:
| Indeed. The C28x line by the same company shares CHAR_BIT ==
| 16 with C55. C28x is quite popular in power electronics
| applications.
|
| "Relevant" is in the eyes of the beholder, and its all too
| easy to no-true-scotsman your way out of existing
| architectures. I claim that both of these architectures are
| relevant by virtue of suppliers continuing to make new chips
| that use them, and system builders continuing to select those
| chips in new products.
| jeffbee wrote:
| There are loads of DSPs, MCUs, and other non-PC junk where
| CHAR_BIT is not 8. For example of the SHARC, CHAR_BIT is 32,
| absolutely every type is 32 bits wide.
| beeforpork wrote:
| Those are two different things:
|
| For 'char has 8 bits': the bitwidth is 'implementation defined'
| in C. If you know your target architectures, you can assume
| it's 8 bits, because it's indeed a question of portability.
|
| For 'int must not overflow': this is 'undefined behaviour' in
| C. You must not do it, regardless of what you know about your
| target architectures, because this is a language level
| prohibition.
| amelius wrote:
| On ARM, char is always unsigned, whereas on Intel it's usually
| signed. This silly inconsistency broke a lot of code.
| pornel wrote:
| C code doesn't just run on the architecture you compile for. It
| first "runs" on a C virtual machine simulated by the optimizer.
| This low-level virtual machine (you may call it LLVM) usually
| implements signed overflow by deleting the code that caused it.
| jstimpfle wrote:
| I heard that one case where defining int-overflow as wrapping
| would be very bad for performance is pointer arithmetic - e.g.
| offset a pointer by i times sizeof (type)). I think the x64
| instruction "lea" accomplishes this. If this instruction is
| used, it is impossible to simulate 32-bit 2's complement
| overflow by just discarding the upper 32 bits of a 64-bit
| integer.
|
| So the UB that is associated with overflowing an int is
| required to efficiently compile loops that use a counter `int
| i` to index an array. There is a huge number of these loops in
| the wild.
|
| This problem might be just some unfortunate coincidence with
| how array indexing is defined in C. I don't understand this
| deeply, but just wanted to bring it up. I believe I read this
| on Fabian Giesen's blog.
| simiones wrote:
| The part about lea doesn't seem especially convincing, it's
| not hard to imagine that pointer arithmetic could be defined
| such that overflow is still UB, while allowing regular signed
| integer arithmetic to overflow safely.
| jstimpfle wrote:
| I can't say much about this. What I know is that in C,
| pointer arithmetic is defined in terms of "normal"
| arithmetic. p[i] is defined as *(p + i). And (p + i) means
| to offset p by (i * sizeof *p), and that multiplication is
| computed as the type of i (e.g. (32-bit) int or even
| smaller type)
| simiones wrote:
| That multiplication is entirely implicit, so there is no
| reason the compiler needs to handle it the same as it
| handles an explicit multiplication. Given that `p + i` is
| obviously not an integer addition and it already has much
| more UB then `i + j`, there is no reason why `i + j`
| having defined overflow rules needs to mean `p + i` also
| has them (just like `i + j` is safe for any small enough
| i and j, while p + i is only meaningful if points within
| the same object as p (to be fair, its not UB to compute p
| + i for any i, it's UB to use the value).
| brandmeyer wrote:
| I think the nasty cases are in supporting subregister-sized
| arithmetic. ARMv8 can perform almost any integer operation on
| its registers either as 64-bit or 32-bit registers.
|
| The classic RISC machines could only perform full-register
| arithmetic. RISC-V has a small handful of instructions that
| can accelerate signed subregister arithmetic, but none that
| accelerate unsigned subregister arithmetic. So, if you need a
| 32-bit unsigned integer operation to guarantee wrap-around
| behavior on 64-bit RISC-V, the compiler may have to insert
| additional zero-extension instruction sequences if it cannot
| prove the absence of overflow.
| tsimionescu wrote:
| > Are there any current, relevant hardware architectures where
| this is not true (e.g. bytes are not 8 bits, and integers are
| not 2's complement)?
|
| For char, not sure, but the problem with signed overflow is not
| that you can't be sure whether it's 2's complement, it's that
| the compiler is allowed to assume it won't happen. So, if you
| read two numbers into 2 ints and add them up, then check for
| overflow somehow, the compiler will just remove your check
| while optimizing, since integrr addition can't overflow in a
| valid program.
| SAI_Peregrinus wrote:
| And the compiler is allowed to remove the check, *even if not
| optimizing*. -O0 doesn't guarantee it'll be kept.
|
| In practice compilers sould be considered to follow Murphy's
| Law: Undefined Behavior will work perfectly fine when on a
| developer's machine or when observed by any support or QA
| staff, but will occasionally cause intermittent problems on
| production machines when observed by users or during
| demonstrations to executives.
| nemetroid wrote:
| Signed integers wrapping and 2's complement are separate
| issues. C++20 specifies that signed integer are 2's complement,
| but signed overflow is still undefined.
| dvfjsdhgfv wrote:
| > Python only has one integer type, which is a signed bigint.
| Compared to C/C++, this renders moot all discussions about bit
| widths, signedness, and conversions - one type rules all the
| code. But the price to pay includes slow execution and
| inconsistent memory usage.
|
| Well, the beauty of C is that you can have that too, if you wish,
| and you have many options to choose from.
| dreinhardt wrote:
| It's funny that you would end up with a similar conclusion for
| other parts of the language (e.g. operators) as well. Just a
| gigantic set of inane rules everywhere causing you to constantly
| be in danger of introducing bugs and portability issues.
| bregma wrote:
| It discouraging. If the language requires you actually know
| what you're doing you can't hire dirt-cheap easily-replaced
| code monkeys to bang out your ideas and the end result is you
| get to keep less of the investors' money for yourself.
| AndriyKunitsyn wrote:
| It can feel good to imagine yourself an enlightened master
| among code monkeys, yet on practice, everybody can be a code
| monkey sometimes, and when this happens in C/C++, it will
| leave a ticking time bomb in the codebase, that will lay
| there until a customer blows up on it, no matter how many
| millions went into QA of the product.
|
| And on practice, C/C++ developers are among lower-paid
| programmers - probably because "banging out ideas" and
| producing actual programs that actually work, are valued more
| than language elitism.
| bigcorp-slave wrote:
| It's actually not true at all that C++ developers are lower
| paid. Rather, their pay is highly bimodal. Most work at all
| FAANGMULA companies is C++.
| Joker_vD wrote:
| But! And that's important -- it allows for great performance,
| so you can make ten/hundred times more mistakes per second than
| in other, "safer" languages.
| tammerk wrote:
| Nowadays, it doesn't provide any performance gain. I didn't
| see those days but maybe it was important for performance
| back in 70s/80s/90s even it was risky? e.g null terminated
| string was chosen due to low space overhead.
| creato wrote:
| It depends on what you are doing. For some kinds of
| programs, C/C++ are going to be much faster than most
| "modern" languages.
| nicoburns wrote:
| Most, but not all. Languages like Rust and Zig show that
| you can have the performance without the landmines.
| hajile wrote:
| Also, theoretical performance is overrated. Almost all
| the things that lends themselves to speed make code
| brittle and incapable of future modification.
|
| Once you've got your C code doing safety checks with data
| types that won't break under the littlest change, the
| code becomes much slower than code golf would suggest. A
| common example is passing void pointers everywhere. You
| either check every call every time (aka dynamic typing)
| or rush everything on the idea that the programmer
| understands the system completely and never forgets or
| messes up. Better types give you all the speed AND all
| the safety here.
| tammerk wrote:
| I didn't mean C is not fast or not faster than other
| languages. It's still the fastest one I believe.
|
| What I meant is undefined behaviors allow compilers to
| optimize in a way that would not be possible otherwise.
| So, it might be a deliberate decision back then, to
| leverage performance. I don't know, just an idea.
| jjgreen wrote:
| It used to be "folk knowledge" that only Fortran and
| hand-crafted ASM were faster. Not sure if that's still
| (or ever was) true.
| [deleted]
| hajile wrote:
| I guess it was maybe true one time.
|
| http://www.catb.org/jargon/html/story-of-mel.html
| lifthrasiir wrote:
| > it allows for great performance, so you can make
| ten/hundred times more mistakes per second than in other,
| "safer" languages.
|
| This is false. For a long time C performance used to be
| inferior to Fortran, which is arguably safer than C. It's
| hilarious that the strict aliasing and `restrict` keyword was
| born out of making C on par with Fortran and UB became a
| major issue to C programmers as a result!
| atkwarriors wrote:
| Yes, that's why C has undefined behavior. Absolutely
| RMPR wrote:
| It's a feature, not a bug.
| tammerk wrote:
| It's more funnier that although language is full of traps, in
| practice it works quite well. I don't think any C developer(or
| let's say %95) knows all the rules mentioned in the article,
| yet we are still one piece.
|
| Does anybody know any paper for bugs per lines of code for
| different languages or something similar?
___________________________________________________________________
(page generated 2021-04-02 23:01 UTC)