[HN Gopher] Static Integer Types
___________________________________________________________________
Static Integer Types
Author : matt_d
Score : 50 points
Date : 2021-06-29 21:44 UTC (2 days ago)
(HTM) web link (tratt.net)
(TXT) w3m dump (tratt.net)
| nightpool wrote:
| > Let's start with Rust, which offers the usize type, defining it
| as "The pointer-sized unsigned integer type. The size of this
| primitive is how many bytes it takes to reference any location in
| memory." That definition is based on the assumption that a
| machine's address and data widths are the same
|
| Is this true? I don't know the history here, but just from
| reading that definition, it seems like `usize` is tied _only_ to
| the address width. Where does the data width come in?
| magicalhippo wrote:
| I might be wrong but what came to mind was segmented memory, ie
| like 8086 where registers are 16bit but (far) pointers are
| 16:16 (ignoring the quirk with overlapping segments).
| moring wrote:
| That would make usize a 32-bit integer, regardless of the
| data width.
|
| It does make other hidden assumptions though, such as that
| there is a single "memory" that is uniquely identified by a
| usize value, regardless of the way it is accessed. For x86,
| for example, the same address could either refer to a
| location in memory or to a location in an I/O device,
| depending on whether in/out instructions or "normal"
| instructions were used.
| magicalhippo wrote:
| > That would make usize a 32-bit integer, regardless of the
| data width.
|
| Doh, long day at work, misinterpreted "data width".
| gautamcgoel wrote:
| Yeah I didn't understand that part either. The quote doesn't
| mention data width at all...
| fanf2 wrote:
| Rust's usize does the jobs of both uintptr_t (a pointer-width
| type) and size_t (a data-width type)
| oj002 wrote:
| > As far as I can see, the fast types are a failure, and the
| least types are unused, possibly because no-one can understand
| exactly what the latter mean.
|
| I personally quite like that we have the fast/least types,
| because they allow you to write portable code, that would
| otherwise require uintN_t types, which are only defined if the
| platform supports the specific width of the type.
|
| Let's say you want to implement a cryptographic (often require
| 32/64 bit modular arithmetic) algorithm that works on any
| confirming c compiler. uint32_t isn't guaranteed to be defined,
| but uint_least32_t is. You can now use uint_least32_t instead of
| uint32_t and mask out the upper bits that may exist when
| necessary. You'd do this with e.g. UINT32_C(0xFFFFFFFF), which
| ~~coincidentally~~ by design also has the type uint_least32_t.
| This would result in "perfect" code gen on any platform that has
| a 32-bit unsigned integer type (given a reasonable optimizing
| compiler), since in that case uint_least32_t is the same type as
| uint32_t, and otherwise it will still work.
|
| The fast types could be used for something similar, if they were
| properly defined. But C has a history of compilers implementing
| features not as intended, I'm looking at you rand(): "The C89
| Committee decided that an implementation should be allowed to
| provide a rand function which generates the best random sequence
| possible in that implementation, and therefore mandated no
| standard algorithm" (c99 rational).
|
| Edit: Note that you also need to make sure that the variables
| aren't promoted to signed integers, e.g. x << 3 should be 1u*x <<
| 3.
| nayuki wrote:
| I found that types like uint32_least_t to be much less useful
| after learning that standard C/C++ types already have these
| guarantees. Namely: char is at least 8 bits wide, short is at
| least 16 bits wide, int is at least 16 bits wide, long is at
| least 32 bits wide, long long is at least 64 bits wide.
|
| Implementing cryptographic algorithms using uint32_least_t can
| be dangerous. Suppose that short is 32 bits wide and int is 48
| bits wide. Your uint32_least_t might be mapped to unsigned
| short. When you do something like (x << 31), x gets promoted to
| signed int, and then you might be shifting a 1 into the sign
| bit, which is undefined behavior. So I would say that a better
| idea is to use unsigned long, which is guaranteed not to
| promote to a higher-rank type.
| rssoconnor wrote:
| Indeed. The trick here appears to be to write (1U*x << 31) so
| that 1U*x becomes a type that is the larger of uint32_least_t
| and unsigned int.
| nayuki wrote:
| Yup. Alternatively ((0U + x) << 31). I first thought about
| this problem years ago:
| https://stackoverflow.com/questions/39964651/is-masking-
| befo...
| oj002 wrote:
| Oh, you are right, this can be problematic, although I don't
| like your solution either, since unsigned long is going to be
| larger than 32-bit on 50% of the platforms (not really, but
| you know the drill, Linux: 64-bit, Windows: 32-bit).
| steerablesafe wrote:
| Well, rand has problems other than just being a poor random
| generator.
| steerablesafe wrote:
| > If you dive into the C specification, you'll discover that the
| values I tried to cast lead to undefined behaviour [10] which
| means that my program could have done anything. In practice, the
| behaviour I observed can be explained very simply.
|
| This is not entirely correct. Note that this is integer
| conversion as opposed to signed integer arithmetic overflow. The
| latter is indeed undefined behavior, while the former is
| implementation defined.
|
| https://cigix.me/c17#6.3.1.3.p3
|
| > Otherwise, the new type is signed and the value cannot be
| represented in it; either the result is implementation-defined or
| an implementation-defined signal is raised.
|
| In C++ since C++20 the result is defined. Before that the result
| was implementation-defined (without the option to raise a
| signal).
|
| AFAIK all major compilers do the obvious thing in both C and C++.
| The undefined behavior of signed integer overflow is indeed used
| by compiler optimizations.
| bmc7505 wrote:
| I thought this blog post would discuss type-level integers but I
| didn't see anything about that: https://users.rust-
| lang.org/t/notes-on-type-level-integral-n...
| goldenkey wrote:
| The author of this post is oblivious to fractional bit encodings.
| When encoding an array of items, you can always encode the array
| optimally, only wasting less than 1 bit.
|
| For example, if we had an array of n test scores from 1-100, we
| only need ceil(n*log2(100)) bits.
|
| The key is that when reading the values, instead of shifting like
| we would normally to iterate an array, we instead divide by 100.
| Instead of masking bits, we get the current value by doing mod
| 100. Its equivalent to shifting/masking for a power of 2, but
| extends packing to fractional bits. It's quite beautiful.
|
| Take a look at this blog post to understand more about fractional
| bit representations:
|
| https://hbfs.wordpress.com/2011/11/01/fractional-bits-part-i...
| jerf wrote:
| Speaking from experience on this... you get some _crazy looks_
| from code reviewers. I got perhaps overclever with this at one
| point using it to generate passwords by extracting log2(44)
| bits progressive out of a random number to choose from my
| blessed set of characters for passwords and it took some
| patience with the security team to verify that it was safe.
|
| I did it just because it's one of the more elegant ways out of
| the problem of truncation of bits when you have a non-even-
| power-of-two set of possibilities to choose from, especially
| when you need to do it repeatedly from a random source. Instead
| of discarding or fussing about, if you need fractional bits
| from a given random source of bits, you just... take fractional
| bits. No fuss. No muss.
| jhgb wrote:
| > by extracting log2(44) bits progressive out of a random
| number to choose from my blessed set of characters for
| passwords and it took some patience with the security team to
| verify that it was safe.
|
| It's safe but if you somehow depended on all the passwords
| having exactly equal probability, you'd be wrong (unless
| you're already using a random number from a limited range).
| Of course this is easily fixed by a range check on the random
| number and generating it again if it failed the check.
| smrq wrote:
| I feel like the core idea here is really intuitive, but the
| presentation as fractional bit encoding obscures it--at least
| to me. It's just encoding an array of values with K distinct
| possibilities as a base-K number. I had to read the code in the
| link above to understand that; the exposition didn't help me at
| all.
| zozbot234 wrote:
| > It's encoding an array of values with K distinct
| possibilities as a base-K number
|
| Yes, and the encoding and decoding step needs to do
| arithmetic so there's a practical limit to how big these
| numbers can be. But sometimes you do get some real gains.
| jhgb wrote:
| > Coding with an non-integer number of bits is counter-
| intuitive
|
| I've reinvented this very thing several times in the past. I
| don't see how it is in any way unnatural or unintuitive. You're
| simply numbering the members of a Cartesian product of some
| finite sets. The maximum number possible, if you start from
| number 1, is the product of the cardinalities of all of your
| sets. It's basically elementary school math, which is probably
| why I had no problems with coming up with this idea even in my
| dilettante past.
| bryondowd wrote:
| Interesting concept, but seems like it only works for fairly
| small arrays, right?
|
| You could fit 40 base-3 values into a 64 bit integer, but in
| order to read/write index 39, you have to perform 39
| multiplications/divisions. So no efficient random access. And
| is there any way to encode more than 40 base-3 values if your
| largest integer is 64 bits? If you need 45 elements, I guess
| you could use an array of 5 8-bit integers where each 8-bit
| element represents 5 base-3 elements. Of course, now you're
| limited to increasing the size of your array to increments of 5
| (or more generally, smallest efficient integer size / log base
| 2 of the number of representable values)
| goldenkey wrote:
| > Interesting concept, but seems like it only works for
| fairly small arrays, right?
|
| Any size array will work.
|
| > And is there any way to encode more than 40 base-3 values
| if your largest integer is 64 bits?
|
| Yes, absolutely. And with ease. Bignums make it seamless.
| Some languages like Perl and Python and Mathematica support
| them out of the box.
|
| You can still do random access, it just becomes: divide
| bignum by 3^n, take result mod 3. It will be costly though,
| like you mentioned. There is a way to bound the cost.
|
| You can chunk your elements to a number of bits that is a
| multiple of 8 bits, a byte. This allows you to do random
| access with minimal overhead, while giving you the savings,
| bounded by your choice of chunk size.
|
| For example, 100 of your elements can be neatly fit into 20
| bytes.
|
| 101 elements would need more than 20 bytes or 160 bits, but
| 100 elements only needs 158.5 bits, we only lose 1.5 bits per
| chunk.
|
| https://www.wolframalpha.com/input/?i=log2%283%29*100.+vs+lo.
| ..
|
| Consider that we do this chunking. Let us say we have 10000
| elements and want to look at the value of element 6000. We
| can immediately skip to byte (6000/100)*20=1200 and simply
| take the value there mod 3. This is possible to do because
| each chunk is no longer related to eachother. So we get
| random access with a small procedure that gives us a huge
| space savings.
|
| If we choose a byte-sized chunk size, we can graph
| (log2(3)*n) % 8 to look for which chunk sizes would lose the
| least bits of storage.
|
| https://www.wolframalpha.com/input/?i=discrete+plot+%28log2%.
| ..
|
| We want to choose chunk sizes where the mod value is close to
| 8, since we have to round up.
|
| Hope this helps.
| chrismorgan wrote:
| On Rust's pointer-sized integer types, the design was subject to
| much agonising in 2013-2014. If you're interested in the history,
| the end point is https://rust-lang.github.io/rfcs/0544-rename-
| int-uint.html, and the GitHub comments on that RFC and earlier
| proposals, and earlier discussions on discuss.rust-lang.org as it
| was at the time (now renamed without redirect (grr!), change any
| links to internals.rust-lang.org).
|
| Beyond this, I speak as one who hasn't done any embedded work or
| touched a platform where what this article calls "data width" and
| "address width" differ.
|
| I suspect that the key here for Rust is that it simply doesn't
| _have_ a type for what this article calls "data width", because
| it wasn't found to be useful. I'm not certain why you'd actually
| _need_ such a type; I'm guessing that the only place it matters
| is the size of registers, which surface-level Rust doesn't need
| to worry about, though I'm not sure if asm! has anything to do
| with it. I can imagine that things like GPIO pins which might
| need data width? will be done in a library for a specific board,
| and can thus define the type themselves.
|
| I also have a vague feeling that there's also a third concept
| here, not just the two: that what I've called "data width" in the
| previous paragraph is actually "word size", which could
| potentially be smaller again. Something about an 8-bit
| architecture with 16-bit address space and 32-bit pointers. Can't
| remember if register sizes are consistent, either.
|
| I'm writing fairly incoherently and inaccurately here because I
| really don't know what I'm talking about. :-)
| Someone wrote:
| One can also argue data width is the size of cache lines, which
| nowadays often are larger than registers (and might even be
| different in the various caches, but I don't know whether such
| architectures exist)
| myrrlyn wrote:
| I am the author of a library that requires knowledge and
| awareness of integers that are strictly not wider than the
| general purpose register on the processor; the fact that
| `usize` is "the largest GPR" on every target Rust knows about
| is great but I'd still appreciate a distinction between "this
| fits in exactly one GPR" vs "this fits in the address bus"
|
| _especially_ since the library in question creates pointers
| that are wider than `usize`
| thrower123 wrote:
| One feature I wish was more common is the ability to easily
| range-restrict primitive types, without all the boxing and
| boilerplate and operator overloading.
|
| I can write a type that only handles integers 0-10, or floats
| [0...1], but it is more of a lift than I'd like to do in most
| cases, inefficient, and so I don't, even though it would be a
| boon for correctness and avoiding idiot bugs.
| touisteur wrote:
| Ada is for you then : https://learn.adacore.com/courses/intro-
| to-ada/chapters/stro...
| mpweiher wrote:
| Ada? Newfangled nonsense. :-)) Pascal! http://www.dcs.ed.ac.u
| k/home/SUNWspro/3.0/pascal/lang_ref/re...
| touisteur wrote:
| TIL Pascal had those. I'll need to check Turbo Pascal some
| day. I could have the joys of static bondage typing sooner.
| _0ffh wrote:
| Or Nim's subrange types.
|
| https://nim-lang.org/docs/manual.html#types-subrange-types
| mixedCase wrote:
| Or Idris dependent types, which allows you to prove
| properties of any value, not just numbers.
| zokier wrote:
| Afaik refinement types (e.g. LiquidHaskell) are bit more
| appropriate than dependent types for this sort of thing
| staticassertion wrote:
| Rust's const generics eval is quite powerful. It _may_ be able
| to represent this.
| nynx wrote:
| It probably can't yet on stable, but it will be able to
| within the next year or so.
| staticassertion wrote:
| Oh yeah, I guess I should have mentioned I'm talking
| explicitly about the nightly const feature.
___________________________________________________________________
(page generated 2021-07-01 23:02 UTC)