[HN Gopher] Choosing the Right Integers
___________________________________________________________________
Choosing the Right Integers
Author : xmhidalgo
Score : 53 points
Date : 2022-04-17 15:50 UTC (7 hours ago)
(HTM) web link (www.thecodedmessage.com)
(TXT) w3m dump (www.thecodedmessage.com)
| jandrewrogers wrote:
| I am generally in the camp of maximally explicit and consistent
| semantics for integer types. The old C-style types (short, int,
| etc) and related promotion rules was the source of numerous bugs
| in practice, particularly when porting code across architectures
| with different type models. While not convenient in C, in C++
| there is almost unlimited ability to define new families of
| integer types that follow their own rules.
|
| I don't follow the argument for using signed integers outside of
| contexts where a negative value is valid. In code bases that
| always use unsigned integers for values that cannot be negative
| (common in C++), the MAX value is treated as a sentinel. This
| extends the range of valid values versus signed types, which is
| sometimes useful, and there will never be undefined behavior
| (decrementing zero generates the sentinel, for example).
|
| The infinite loop example often given for why to use signed
| integers doesn't happen in practice for "default unsigned" code
| bases because that is not idiomatic expression of that case --
| and this case will come up even in code bases that are default
| signed. Writing loops like this is often a code smell.
| andrepd wrote:
| How will there never be undefined behaviour?
| jandrewrogers wrote:
| I meant in the sense that unsigned integers have C-specified
| defined behaviors for conditions like overflow that are
| undefined for signed integers. That said, if you bothering to
| create these integers types (e.g. the common MAX sentinel
| idiom), their behavior is often fully defined. If you go
| crazy with metaprogramming, type inference, etc in recent
| C++, you can often verify at compile-time that these
| conditions will not occur across arithmetic operations,
| though this is a lot less trivial.
| Jensson wrote:
| > I am generally in the camp of maximally explicit and
| consistent semantics for integer types
|
| Size 3 minus size 4 being 4 billion is not correct type
| semantics though. If unsigned integers crashed the program for
| those operations or returned something akin to NaN it would be
| correct. Which means that signed integers are more correct for
| sizes, since when signed integers overflow you try to allocate
| a negative length array, and that doesn't work so it crashes,
| that is what you want. When you do the same thing with unsigned
| integers you now get a very small array, and when you try to
| write values to this array you now write them to arbitrary
| places in memory, causing memory corruption, that is much much
| worse than any issues you get with signed integers.
|
| So the reason to use signed everywhere is that it reduces the
| mental overhead, and it is safer, you now only needs to learn a
| single integer math system and don't have to worry about all
| the issues that comes with casting unsigned and signed values
| to each other. Unsigned only makes sense if you explicitly want
| the overflow mathematics of unsigned, you don't want overflow
| math for array sizes though so they should be signed in any
| sane type system.
| jandrewrogers wrote:
| A negative size is generally not a valid type instance
| regardless. In well-behaved code I would expect this to be an
| unreachable condition because you are explicitly checking for
| sensible inputs _before_ doing this type of arithmetic. It
| makes no sense to check for sensible inputs _after_ you do
| the arithmetic, that is the same amount of code done in the
| wrong order. It is not safer to use signed in this case nor
| does it decrease mental overhead.
|
| Assuming an integer is signed can be much _less_ safe in
| practice because there is often no guarantee that an integer
| type is always signed or unsigned in a given context.
| Typedefs are a ubiquitous thing, and you don 't always
| control those definitions. In code designed to be robust and
| maintainable, it is usually idiomatic to always implement
| these things in a way that is correct regardless of signed-
| ness unless that signed-ness is an immutable contract (e.g.
| some syscall stuff), because whether or not a type is signed
| can sometimes change unexpectedly.
| MauranKilom wrote:
| > In well-behaved code I would expect this to be an
| unreachable condition because you are explicitly checking
| for sensible inputs before doing this type of arithmetic
|
| You previously claimed "there will never be undefined
| behavior [by using unsigned types]", but now you also
| require that the UB scenario is checked beforehand? I'm
| confused how you conclude unsigned types to be safer than
| signed ones ("no UB") when you require UB be ruled out in
| the first place.
|
| Unless you explicitly intend for wraparound to happen, your
| program has a bug on overflow regardless of signedness. You
| can (and possibly should) guard against the situation, but
| that favors neither side.
|
| > Assuming an integer is signed can be much less safe in
| practice because there is often no guarantee that an
| integer type is always signed or unsigned in a given
| context. [...]
|
| That is not an argument for or against using signed types.
| If types change from under you, you have a host of other
| issues anyway (as you say). Mixed sign comparisons,
| unexpected promotions, overflows... You get the exact same
| issues whether you yourself favor signed or unsigned types
| when mixing them with (what are, in your description)
| effectively unknown types. So again, your argument against
| signed types appears to be "you have to be careful either
| way", which gives zero reasons to favor unsigned over
| signed types.
| xscott wrote:
| > [unsigned] extends the range of valid values versus signed
| types, which is sometimes useful
|
| When is it useful?
|
| On a 32 bit platform, for file offsets, maybe that last bit was
| useful for a few years. But as soon as disks got bigger than
| 4GB, you needed to switch to a 64 bit `off_t` and `lseek`
| anyways.
|
| For array indexes, on a 32 bit platform, there is really only
| one VERY CONTRIVED case where that last bit for `unsigned`
| matters: your program creates a single array of bytes greater
| than 2GB. As soon as it's an array of 2-byte shorts or anything
| larger, you never need that last bit. And the configurations
| where a 32 bit kernel lets a program use more than 2GB of
| address space were not super common - it was better to switch
| to a 64 bit platform at that point.
|
| On 64 bit platforms, you never need that last bit for array
| indexes or file offsets because no system has 10,000 petabytes
| of memory or 10,000 petabyte file sizes. And they won't for a
| while. Unless clock speeds get back on Moore's law, that for
| loop is going to take a long time to run anyways...
|
| > Writing loops like this is often a code smell.
|
| One person's code smell is another person's daily idiom. I
| think signed array indexes are like complex numbers - people
| who don't need them can't imagine why anyone else does.
| (https://github.com/golang/go/issues/19921)
|
| For me, I frequently do math on the array index. Reverse loops
| are necessary sometimes (I can give examples), but let's ignore
| that for now. How would you write this simple loop with
| unsigned loop indices? double scale = <non-
| integer value> for (ssize_t ii = 0; ii<len; ii++) {
| data[ii] = sinc(scale*(ii - len/2)); }
|
| Note that `len/2` using integer division (truncating) is doing
| the right thing for both even and odd lengths. I really want
| that `sinc` function centered on a specific sample.
|
| If either `ii` or `len` is unsigned, C/C++ quietly does
| something awful here. Stuff like this makes me resent the STL
| for choosing size_t over ssize_t.
|
| I think it's ugly in Rust too: let len =
| data.len(); // usize because that's the way for ii in
| 0..len { data[ii] = sinc(scale*(ii as isize - len
| as isize / 2) as f64); }
|
| Thankfully the `as` operator binds pretty tightly.
|
| I don't think there are any good reasons to use unsigned
| integers for array indexes or sizes. It doesn't help with bound
| checking (it's the same assembly to check for unsigned out of
| bounds as it is to check for signed out of bounds or less than
| zero). It doesn't help with "large" arrays (except in one very
| contrived case). And it gets in the way when you need to do
| math on the indexes.
| tialaramex wrote:
| (In Rust) the ugliness for me stems from the fact we got this
| array from somewhere and now we're overwriting it, and I
| start to think probably the code should actually make the
| data structure instead, whereupon the need for these to be
| indices goes away and it's some sort of iterator into collect
| ?
| reikonomusha wrote:
| I feel like, ideally, one should never have to face the conundrum
| of choosing an integer width for general-purpose use.
|
| I _really_ like that the "normal" integers in Common Lisp are
| arbitrary length, but they're represented automatically as
| register-sized quantities when small enough (which, for typical
| code, is most of the time). That way, at least in terms of space
| efficiency, and for the most part in terms of time efficiency,
| you can have your cake and eat it too.
|
| If I really need performance, I can always declare that a
| variable will never contain a non-machine-sized integer:
| (defun f (x) (declare (type (unsigned-byte 64) x))
| ...)
|
| This is me promising to Lisp that x will always be bound to an
| unsigned 64-bit integer. If I dial the optimization settings with
| high speed and low safety (which I can scope to just that file,
| or even just that function): (declaim (optimize
| (speed 3) (safety 0)))
|
| then Lisp will eliminate bounds checks, bignum-promotions, etc.
| and produce the native assembly you expect.
|
| My point is that I really appreciate the "safe and general by
| default" philosophy, while still being able to opt-in to
| practical efficiency considerations when I absolutely need it.
|
| (If I'm pedantic: The above optimization behavior, while
| encouraged in various ways by the Common Lisp standard, is
| technically not _de jure_ required. But any general purpose Lisp
| compiler from the past 30 years will do these things, like SBCL.)
| cpeterso wrote:
| Here's Rob Pike's proposal to change Go's int to be arbitrary
| precision in Go 2.0. But after five years of debate about
| "performance", the proposal has stalled.
|
| https://github.com/golang/go/issues/19623
| lisper wrote:
| This should really have been called, "Choosing the right integer
| types in C/C++" rather than "Choosing the right integers".
| thesneakin wrote:
| Been thinking the same thing for just as long. "int" was supposed
| to be a full native word. Nowadays it's best to never use "int".
| cpeterso wrote:
| Google's C++ coding style recommends using bare "int" by
| default or "int64_t" if the number might be "big". It also
| recommends using signed integers over unsigned.
|
| https://google.github.io/styleguide/cppguide.html#Integer_Ty...
| Animats wrote:
| Too many years ago, I wrote something in-house, now lost, titled
| "Type integer considered harmful". I was arguing for the idea
| that everything should have an explicit range, written
| foo: 0..100;
|
| This was around the time Ada was being designed. The way I wanted
| this to work is that the size of intermediate variables was
| determined by the compiler, with the following rules:
|
| - Unless a named typed variable result will overflow, no unnamed
| intermediate variable can overflow. The compiler must size
| intermediate temporary values accordingly.
|
| - If this requires a larger intermediate variable than the
| machine can provide, a compile error is reported.
|
| The implication is that you often need longer integers than you'd
| expect. Expressions such as x = a + b - c
| x = (a + b) / c
|
| with signed values, all, say, 32-bit integers, can overflow in
| a+b without overflowing x. So such expressions have to be
| computed in 64 bits, then checked for overflow. This eliminates
| hidden overflow in intermediates. An expression with only one
| operator never overflows in a recoverable way, so it just has to
| be checked, not expanded.
|
| That was written in an era when there were computers in use with
| word sizes of 8, 12, 16, 18, 24, 32, 36, 48, 60, and 64 bits. So
| word size was more of a practical issue than it is now, when non
| power of 2 machines have died off. Also, machines of that era
| tended to be slower on longer values. Much slower on some
| machines which had to do double-length arithmetic in software.
| Thus, there was a performance hit for this approach, which was
| the main objection.
| [deleted]
| Someone wrote:
| > An expression with only one operator never overflows in a
| recoverable way, so it just has to be checked, not expanded
|
| (Nitpick: I think you intended to write either _always_ or
| _unrecoverable_ )
|
| _If the target CPU doesn't have status flags indicating
| overflow_, that may be more work than using a larger
| intermediate. For _a+b_ , detecting overflow after the fact
| highly likely is faster, but already is a bit convoluted if _a_
| and _b_ are signed and, thus, can be negative
| (https://stackoverflow.com/a/45261894)
|
| For _axb_ https://stackoverflow.com/questions/1815367/catch-
| and-comput... has a simple algorithm, but it requires a
| division by _b_. I don't see that perform well. Possibly,
| double size multiplication and then testing for overflow beats
| that (but, if the CPU has a 'count leading zeroes' instruction,
| I would use that (https://stackoverflow.com/a/59109502. I
| haven't checked that for correctness, but even if it can't be
| made correct, it likely can provide a fast path for most
| multiplications that your program does)
|
| For CPUs with an overflow bit, the compiler can use it, but not
| all CPUs have that.
| tialaramex wrote:
| WUFFS call that type refinement, although you are free to
| refine much larger types e.g. you can say base.u32[0 ..= 15]
| and this is a 32-bit unsigned integer... that can't be more
| than 15.
|
| WUFFS not only refuses overflow (you get a compiler diagnostic)
| it also infers these refinements from array sizing so as to (at
| compile time) ensure bounds errors never occur. If you try to
| index into an array of 426 integers with k, WUFFS will go ahead
| and refine k to the 0..425 range.
|
| Of course Ada (and presumably your proposal) are for general
| purpose software, whereas WUFFS is specifically targeted at
| software for well, Wrangling Untrusted File Formats Safely. So
| it's fine for some things to be completely impossible in WUFFS
| (e.g. you cannot write "Hello, World" in WUFFS, because WUFFS
| intentionally has no idea what a "string" is, or what "console
| output" is).
___________________________________________________________________
(page generated 2022-04-17 23:00 UTC)