[HN Gopher] Static Integer Types
       ___________________________________________________________________
        
       Static Integer Types
        
       Author : matt_d
       Score  : 50 points
       Date   : 2021-06-29 21:44 UTC (2 days ago)
        
 (HTM) web link (tratt.net)
 (TXT) w3m dump (tratt.net)
        
       | nightpool wrote:
       | > Let's start with Rust, which offers the usize type, defining it
       | as "The pointer-sized unsigned integer type. The size of this
       | primitive is how many bytes it takes to reference any location in
       | memory." That definition is based on the assumption that a
       | machine's address and data widths are the same
       | 
       | Is this true? I don't know the history here, but just from
       | reading that definition, it seems like `usize` is tied _only_ to
       | the address width. Where does the data width come in?
        
         | magicalhippo wrote:
         | I might be wrong but what came to mind was segmented memory, ie
         | like 8086 where registers are 16bit but (far) pointers are
         | 16:16 (ignoring the quirk with overlapping segments).
        
           | moring wrote:
           | That would make usize a 32-bit integer, regardless of the
           | data width.
           | 
           | It does make other hidden assumptions though, such as that
           | there is a single "memory" that is uniquely identified by a
           | usize value, regardless of the way it is accessed. For x86,
           | for example, the same address could either refer to a
           | location in memory or to a location in an I/O device,
           | depending on whether in/out instructions or "normal"
           | instructions were used.
        
             | magicalhippo wrote:
             | > That would make usize a 32-bit integer, regardless of the
             | data width.
             | 
             | Doh, long day at work, misinterpreted "data width".
        
         | gautamcgoel wrote:
         | Yeah I didn't understand that part either. The quote doesn't
         | mention data width at all...
        
         | fanf2 wrote:
         | Rust's usize does the jobs of both uintptr_t (a pointer-width
         | type) and size_t (a data-width type)
        
       | oj002 wrote:
       | > As far as I can see, the fast types are a failure, and the
       | least types are unused, possibly because no-one can understand
       | exactly what the latter mean.
       | 
       | I personally quite like that we have the fast/least types,
       | because they allow you to write portable code, that would
       | otherwise require uintN_t types, which are only defined if the
       | platform supports the specific width of the type.
       | 
       | Let's say you want to implement a cryptographic (often require
       | 32/64 bit modular arithmetic) algorithm that works on any
       | confirming c compiler. uint32_t isn't guaranteed to be defined,
       | but uint_least32_t is. You can now use uint_least32_t instead of
       | uint32_t and mask out the upper bits that may exist when
       | necessary. You'd do this with e.g. UINT32_C(0xFFFFFFFF), which
       | ~~coincidentally~~ by design also has the type uint_least32_t.
       | This would result in "perfect" code gen on any platform that has
       | a 32-bit unsigned integer type (given a reasonable optimizing
       | compiler), since in that case uint_least32_t is the same type as
       | uint32_t, and otherwise it will still work.
       | 
       | The fast types could be used for something similar, if they were
       | properly defined. But C has a history of compilers implementing
       | features not as intended, I'm looking at you rand(): "The C89
       | Committee decided that an implementation should be allowed to
       | provide a rand function which generates the best random sequence
       | possible in that implementation, and therefore mandated no
       | standard algorithm" (c99 rational).
       | 
       | Edit: Note that you also need to make sure that the variables
       | aren't promoted to signed integers, e.g. x << 3 should be 1u*x <<
       | 3.
        
         | nayuki wrote:
         | I found that types like uint32_least_t to be much less useful
         | after learning that standard C/C++ types already have these
         | guarantees. Namely: char is at least 8 bits wide, short is at
         | least 16 bits wide, int is at least 16 bits wide, long is at
         | least 32 bits wide, long long is at least 64 bits wide.
         | 
         | Implementing cryptographic algorithms using uint32_least_t can
         | be dangerous. Suppose that short is 32 bits wide and int is 48
         | bits wide. Your uint32_least_t might be mapped to unsigned
         | short. When you do something like (x << 31), x gets promoted to
         | signed int, and then you might be shifting a 1 into the sign
         | bit, which is undefined behavior. So I would say that a better
         | idea is to use unsigned long, which is guaranteed not to
         | promote to a higher-rank type.
        
           | rssoconnor wrote:
           | Indeed. The trick here appears to be to write (1U*x << 31) so
           | that 1U*x becomes a type that is the larger of uint32_least_t
           | and unsigned int.
        
             | nayuki wrote:
             | Yup. Alternatively ((0U + x) << 31). I first thought about
             | this problem years ago:
             | https://stackoverflow.com/questions/39964651/is-masking-
             | befo...
        
           | oj002 wrote:
           | Oh, you are right, this can be problematic, although I don't
           | like your solution either, since unsigned long is going to be
           | larger than 32-bit on 50% of the platforms (not really, but
           | you know the drill, Linux: 64-bit, Windows: 32-bit).
        
         | steerablesafe wrote:
         | Well, rand has problems other than just being a poor random
         | generator.
        
       | steerablesafe wrote:
       | > If you dive into the C specification, you'll discover that the
       | values I tried to cast lead to undefined behaviour [10] which
       | means that my program could have done anything. In practice, the
       | behaviour I observed can be explained very simply.
       | 
       | This is not entirely correct. Note that this is integer
       | conversion as opposed to signed integer arithmetic overflow. The
       | latter is indeed undefined behavior, while the former is
       | implementation defined.
       | 
       | https://cigix.me/c17#6.3.1.3.p3
       | 
       | > Otherwise, the new type is signed and the value cannot be
       | represented in it; either the result is implementation-defined or
       | an implementation-defined signal is raised.
       | 
       | In C++ since C++20 the result is defined. Before that the result
       | was implementation-defined (without the option to raise a
       | signal).
       | 
       | AFAIK all major compilers do the obvious thing in both C and C++.
       | The undefined behavior of signed integer overflow is indeed used
       | by compiler optimizations.
        
       | bmc7505 wrote:
       | I thought this blog post would discuss type-level integers but I
       | didn't see anything about that: https://users.rust-
       | lang.org/t/notes-on-type-level-integral-n...
        
       | goldenkey wrote:
       | The author of this post is oblivious to fractional bit encodings.
       | When encoding an array of items, you can always encode the array
       | optimally, only wasting less than 1 bit.
       | 
       | For example, if we had an array of n test scores from 1-100, we
       | only need ceil(n*log2(100)) bits.
       | 
       | The key is that when reading the values, instead of shifting like
       | we would normally to iterate an array, we instead divide by 100.
       | Instead of masking bits, we get the current value by doing mod
       | 100. Its equivalent to shifting/masking for a power of 2, but
       | extends packing to fractional bits. It's quite beautiful.
       | 
       | Take a look at this blog post to understand more about fractional
       | bit representations:
       | 
       | https://hbfs.wordpress.com/2011/11/01/fractional-bits-part-i...
        
         | jerf wrote:
         | Speaking from experience on this... you get some _crazy looks_
         | from code reviewers. I got perhaps overclever with this at one
         | point using it to generate passwords by extracting log2(44)
         | bits progressive out of a random number to choose from my
         | blessed set of characters for passwords and it took some
         | patience with the security team to verify that it was safe.
         | 
         | I did it just because it's one of the more elegant ways out of
         | the problem of truncation of bits when you have a non-even-
         | power-of-two set of possibilities to choose from, especially
         | when you need to do it repeatedly from a random source. Instead
         | of discarding or fussing about, if you need fractional bits
         | from a given random source of bits, you just... take fractional
         | bits. No fuss. No muss.
        
           | jhgb wrote:
           | > by extracting log2(44) bits progressive out of a random
           | number to choose from my blessed set of characters for
           | passwords and it took some patience with the security team to
           | verify that it was safe.
           | 
           | It's safe but if you somehow depended on all the passwords
           | having exactly equal probability, you'd be wrong (unless
           | you're already using a random number from a limited range).
           | Of course this is easily fixed by a range check on the random
           | number and generating it again if it failed the check.
        
         | smrq wrote:
         | I feel like the core idea here is really intuitive, but the
         | presentation as fractional bit encoding obscures it--at least
         | to me. It's just encoding an array of values with K distinct
         | possibilities as a base-K number. I had to read the code in the
         | link above to understand that; the exposition didn't help me at
         | all.
        
           | zozbot234 wrote:
           | > It's encoding an array of values with K distinct
           | possibilities as a base-K number
           | 
           | Yes, and the encoding and decoding step needs to do
           | arithmetic so there's a practical limit to how big these
           | numbers can be. But sometimes you do get some real gains.
        
         | jhgb wrote:
         | > Coding with an non-integer number of bits is counter-
         | intuitive
         | 
         | I've reinvented this very thing several times in the past. I
         | don't see how it is in any way unnatural or unintuitive. You're
         | simply numbering the members of a Cartesian product of some
         | finite sets. The maximum number possible, if you start from
         | number 1, is the product of the cardinalities of all of your
         | sets. It's basically elementary school math, which is probably
         | why I had no problems with coming up with this idea even in my
         | dilettante past.
        
         | bryondowd wrote:
         | Interesting concept, but seems like it only works for fairly
         | small arrays, right?
         | 
         | You could fit 40 base-3 values into a 64 bit integer, but in
         | order to read/write index 39, you have to perform 39
         | multiplications/divisions. So no efficient random access. And
         | is there any way to encode more than 40 base-3 values if your
         | largest integer is 64 bits? If you need 45 elements, I guess
         | you could use an array of 5 8-bit integers where each 8-bit
         | element represents 5 base-3 elements. Of course, now you're
         | limited to increasing the size of your array to increments of 5
         | (or more generally, smallest efficient integer size / log base
         | 2 of the number of representable values)
        
           | goldenkey wrote:
           | > Interesting concept, but seems like it only works for
           | fairly small arrays, right?
           | 
           | Any size array will work.
           | 
           | > And is there any way to encode more than 40 base-3 values
           | if your largest integer is 64 bits?
           | 
           | Yes, absolutely. And with ease. Bignums make it seamless.
           | Some languages like Perl and Python and Mathematica support
           | them out of the box.
           | 
           | You can still do random access, it just becomes: divide
           | bignum by 3^n, take result mod 3. It will be costly though,
           | like you mentioned. There is a way to bound the cost.
           | 
           | You can chunk your elements to a number of bits that is a
           | multiple of 8 bits, a byte. This allows you to do random
           | access with minimal overhead, while giving you the savings,
           | bounded by your choice of chunk size.
           | 
           | For example, 100 of your elements can be neatly fit into 20
           | bytes.
           | 
           | 101 elements would need more than 20 bytes or 160 bits, but
           | 100 elements only needs 158.5 bits, we only lose 1.5 bits per
           | chunk.
           | 
           | https://www.wolframalpha.com/input/?i=log2%283%29*100.+vs+lo.
           | ..
           | 
           | Consider that we do this chunking. Let us say we have 10000
           | elements and want to look at the value of element 6000. We
           | can immediately skip to byte (6000/100)*20=1200 and simply
           | take the value there mod 3. This is possible to do because
           | each chunk is no longer related to eachother. So we get
           | random access with a small procedure that gives us a huge
           | space savings.
           | 
           | If we choose a byte-sized chunk size, we can graph
           | (log2(3)*n) % 8 to look for which chunk sizes would lose the
           | least bits of storage.
           | 
           | https://www.wolframalpha.com/input/?i=discrete+plot+%28log2%.
           | ..
           | 
           | We want to choose chunk sizes where the mod value is close to
           | 8, since we have to round up.
           | 
           | Hope this helps.
        
       | chrismorgan wrote:
       | On Rust's pointer-sized integer types, the design was subject to
       | much agonising in 2013-2014. If you're interested in the history,
       | the end point is https://rust-lang.github.io/rfcs/0544-rename-
       | int-uint.html, and the GitHub comments on that RFC and earlier
       | proposals, and earlier discussions on discuss.rust-lang.org as it
       | was at the time (now renamed without redirect (grr!), change any
       | links to internals.rust-lang.org).
       | 
       | Beyond this, I speak as one who hasn't done any embedded work or
       | touched a platform where what this article calls "data width" and
       | "address width" differ.
       | 
       | I suspect that the key here for Rust is that it simply doesn't
       | _have_ a type for what this article calls "data width", because
       | it wasn't found to be useful. I'm not certain why you'd actually
       | _need_ such a type; I'm guessing that the only place it matters
       | is the size of registers, which surface-level Rust doesn't need
       | to worry about, though I'm not sure if asm! has anything to do
       | with it. I can imagine that things like GPIO pins which might
       | need data width? will be done in a library for a specific board,
       | and can thus define the type themselves.
       | 
       | I also have a vague feeling that there's also a third concept
       | here, not just the two: that what I've called "data width" in the
       | previous paragraph is actually "word size", which could
       | potentially be smaller again. Something about an 8-bit
       | architecture with 16-bit address space and 32-bit pointers. Can't
       | remember if register sizes are consistent, either.
       | 
       | I'm writing fairly incoherently and inaccurately here because I
       | really don't know what I'm talking about. :-)
        
         | Someone wrote:
         | One can also argue data width is the size of cache lines, which
         | nowadays often are larger than registers (and might even be
         | different in the various caches, but I don't know whether such
         | architectures exist)
        
         | myrrlyn wrote:
         | I am the author of a library that requires knowledge and
         | awareness of integers that are strictly not wider than the
         | general purpose register on the processor; the fact that
         | `usize` is "the largest GPR" on every target Rust knows about
         | is great but I'd still appreciate a distinction between "this
         | fits in exactly one GPR" vs "this fits in the address bus"
         | 
         | _especially_ since the library in question creates pointers
         | that are wider than `usize`
        
       | thrower123 wrote:
       | One feature I wish was more common is the ability to easily
       | range-restrict primitive types, without all the boxing and
       | boilerplate and operator overloading.
       | 
       | I can write a type that only handles integers 0-10, or floats
       | [0...1], but it is more of a lift than I'd like to do in most
       | cases, inefficient, and so I don't, even though it would be a
       | boon for correctness and avoiding idiot bugs.
        
         | touisteur wrote:
         | Ada is for you then : https://learn.adacore.com/courses/intro-
         | to-ada/chapters/stro...
        
           | mpweiher wrote:
           | Ada? Newfangled nonsense. :-)) Pascal! http://www.dcs.ed.ac.u
           | k/home/SUNWspro/3.0/pascal/lang_ref/re...
        
             | touisteur wrote:
             | TIL Pascal had those. I'll need to check Turbo Pascal some
             | day. I could have the joys of static bondage typing sooner.
        
           | _0ffh wrote:
           | Or Nim's subrange types.
           | 
           | https://nim-lang.org/docs/manual.html#types-subrange-types
        
             | mixedCase wrote:
             | Or Idris dependent types, which allows you to prove
             | properties of any value, not just numbers.
        
               | zokier wrote:
               | Afaik refinement types (e.g. LiquidHaskell) are bit more
               | appropriate than dependent types for this sort of thing
        
         | staticassertion wrote:
         | Rust's const generics eval is quite powerful. It _may_ be able
         | to represent this.
        
           | nynx wrote:
           | It probably can't yet on stable, but it will be able to
           | within the next year or so.
        
             | staticassertion wrote:
             | Oh yeah, I guess I should have mentioned I'm talking
             | explicitly about the nightly const feature.
        
       ___________________________________________________________________
       (page generated 2021-07-01 23:02 UTC)