[HN Gopher] Float Self-Tagging
       ___________________________________________________________________
        
       Float Self-Tagging
        
       Author : laurenth
       Score  : 24 points
       Date   : 2024-11-26 20:19 UTC (2 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | vanderZwan wrote:
       | Really cool solution! One question: maybe I missed it, but
       | there's no _technical_ reason the tag bits could not use the
       | entire range of exponent bits, no? Other than the fact that
       | having up to 2048 tags would be ridiculously branchy, I guess.
       | 
       | Here's a variation I just thought of, which probably has a few
       | footguns I'm overlooking right now: use the seven highest
       | exponent bits instead of three. Then we can directly read the
       | most significant byte of a double, skipping the need for a
       | rotation altogether.
       | 
       | After reading the most significant byte, subtract 16 (or
       | 0b00010000) from it, then mask out the sign. To test for unboxed
       | doubles, test if the resulting value is bigger than 15. If so, it
       | is an unboxed double, otherwise the lower four bits of the byte
       | are available as alternative tags (so 16 tags).
       | 
       | Effectively, we adjusted the exponent range 2-767..2-511 into the
       | 0b(0)0000000 - 0b(0)0001111 range and made them boxed doubles.
       | _Every_ other double, which is 15 /16th of all possible doubles,
       | is now unboxed. This includes subnormals, zero, all NaN encodings
       | (so it even can exist in superposition with a NaN or NuN tagging
       | system I guess?) and both infinities.
       | 
       | To be clear, this is off the top of my head so maybe I made a few
       | crucial mistakes here.
        
         | vanderZwan wrote:
         | Also, funny enough, this idea and variations of it look
         | surprisingly easy to implement in JS itself using typed arrays.
         | I won't any claims of impact on performance though...
         | 
         | You need a Float64Array for the values (plus a
         | Uint32Array/Uint8Array using the same buffer to be able to
         | manipulate the integer bits, which _technically_ also requires
         | knowledge of endianness since it 's _technically_ not in the JS
         | spec and some funny mobile hardware still exists out there).
         | Mocking a heap is easy enough: the  "heap" can be a plain
         | array, the "pointers" indices into the array. Using a
         | Map<value, index> would you "intern" duplicate values (i.e.
         | ensure doubles outside the unboxed range are only "heap"
         | allocated once).
        
         | omelancon wrote:
         | Hi, I'm one of the authors of the paper. Thanks for your
         | questions and comments!
         | 
         | There are many reasons why 3-bit tags work well in practice.
         | Importantly, it allows aligning heap objects on 64-bit machine
         | words. Dereferencing a tagged pointer can then be done in a
         | single machine instruction, a MOV offset by the tag.
         | 
         | One of our goals is to make self-tagging as straightforward to
         | implement in existing systems as possible. Thus requiring to
         | move away from the ubiquitous 3-bit tag scheme is definitely a
         | no-go.
         | 
         | Another goal is performance, obviously. It turns out that if
         | the transformation from self-tagged float to IEEE754 float
         | requires more than a few (2-3) machine instructions, it is no
         | longer as advantageous. Thus the choice of tags 000, 011 and
         | 100, which encoding/decoding is a single bitwise rotation.
         | 
         | Also keep in mind that assigning more tags to self-tagging to
         | capture floats that are never used in practice just adds a
         | strain on other objects. That's why we include a usage analysis
         | of floats to guide tag selection in our paper.
         | 
         | In fact, we are currently working on an improved version that
         | uses a single tag to capture all "useful" values. Capturing 3/8
         | of floats seems unnecessary, especially since one of the tag is
         | only meant to capture +-0.0. The trick is to rotate the tag to
         | bits 2-3-4 of the exponent instead of 1-2-3 and add an offset
         | to the exponent to "shift" the range of captured values.
         | 
         | But in the end, it feels like we are barely scratching the
         | surface and that a lot of fine-tuning can still be done by
         | toying with tag placement and transformation. But I think the
         | next step is a quality over quantity improvement: capture less
         | floats but capture the right ones.
        
           | vanderZwan wrote:
           | Thank you for the explanation! I was not aware of the context
           | behind the choice of three-bit tags.
           | 
           | > _The trick is to rotate the tag to bits 2-3-4 of the
           | exponent instead of 1-2-3 and add an offset to the exponent
           | to "shift" the range of captured values._
           | 
           | Maybe I misunderstand, but isn't that a similar idea to what
           | I just described? Adding an offset to "rotate" the ranges of
           | the exponent by a segment, putting the one with zero in the
           | high side? The main difference being that you stick to the
           | upper four bits of the exponent, and that I suggested using
           | one of the upper three bit sequences to mark bits 4-5-6-7 as
           | tag bits? (I mistakenly included the sign bit before and
           | claimed this unboxes 15/16ths of all doubles, it's actually
           | 7/8ths) Which probably has consequences for minimizing
           | instructions, like you mentioned.
           | 
           | > _But I think the next step is a quality over quantity
           | improvement: capture less floats but capture the right ones._
           | 
           | I suspect that ensuring NaN and Infinity are in there will be
           | crucial to avoid performance cliffs in certain types of code;
           | I have seen production code where it is known that initiated
           | values are always finite, so either of those two are then
           | used as ways to "tag" a float value as "missing" or something
           | to that degree.
           | 
           | Anyway, looking forward to future results of your research!
        
             | omelancon wrote:
             | > _Maybe I misunderstand, but isn 't that a similar idea to
             | what I just described? Adding an offset to "rotate" the
             | ranges of the exponent by a segment..._
             | 
             | Yes it is similar. It seems to me that there really isn't
             | that many useful operations that can be applied to the
             | exponent beside adding an offset. But that's only a
             | suspicion, do not take my word for it.
             | 
             | > _I suspect that ensuring NaN and Infinity are in there
             | will be crucial to avoid performance cliffs..._
             | 
             | This is a reasonable assumption. There are in fact ways to
             | rotate and add an offset such that the exponent can
             | overflows/underflows to capture exponents 0 and 0x7ff (for
             | inf and nan) with a single well-positioned tag. Making it
             | work in practice is not as simple, but we are working on
             | it.
        
           | funny_falcon wrote:
           | Just correction: CRuby uses "Float Self-Tagging" for years.
        
             | refulgentis wrote:
             | This is your 4th comment claiming this: it's not true.
             | 
             | You're right that Ruby uses tags, ex. Objective-C does also
             | and has for a while.
             | 
             | The innovation here is its a tag _without the tag bits_.
             | That 's why its self-tagging, not tagging.
        
           | jacobp100 wrote:
           | Just to add some JS engine info - every engine stores numbers
           | as 32 bit integers if possible, which is usually done by
           | tagging a pointer. Also, JSC needs pointers to be represented
           | exactly as-is, because it will scan the stack for anything
           | that looks like a pointer, and retain it when running the GC
        
       | pavpanchekha wrote:
       | I loved this clever, weird, awesome paper, so a short summary.
       | 
       | In many dynamic languages some values are stored on the heap
       | ("boxed") and represented as a pointer, while others are
       | represented as an immediate value ("unboxed" or "immediate").
       | Pointer tagging is a common way to do that: the low bit of the
       | value tells you the value's type, and some types are immediate
       | while others are boxed.
       | 
       | Naturally, the tag bits have a fixed value, so can't be used to
       | store data. So for example your language might offer 61-bit
       | integer immediates instead of 64-bit integers; the other three
       | bits are used for tags. Possibly, larger integers are stored on
       | the heap and treated as a different type (for example Python 2.X
       | had separate int and long types for these cases).
       | 
       | However, it's hard to use this strategy for floats, because
       | floats need all 64 bits (or 32 bits for single-precision, same
       | difference). There's a trick called "NaN boxing" which makes use
       | of the large number of NaNs in the float representation, but read
       | the paper if you want more on that.
       | 
       | The authors' insight is that, suppose you have a three-bit tag
       | and 011 is the tag for floats. By totally random chance, _some_
       | floats will end in 011; you can represent those as immediates
       | with those tag bits. Obviously, that's unlikely, though you can
       | raise the chances by using, like, 010, 011, 100, and 101 all as
       | float tags. Still, the low bits are a bad choice. But what about
       | high bits? Most floats have one of a couple high bit patterns,
       | because most floats are either 0 or between, say, 1e-100 and
       | 1e100. Floats outside that range can be boxed but since they're
       | really rare it's not a big cost to box them.
       | 
       | So basically, we use high bits as our tag bits and map all the
       | common float prefixes to float tags. This allows unboxing the
       | vast majority of floats, which leads to _big_ speedups on float-
       | heavy benchmarks.
       | 
       | A personal note: I've been working in numerics and floating-point
       | for a decade now and have had to deal with float boxing both from
       | a research point of view (lots of runtime analysis systems for
       | floats), from a user point of view (using unboxed float vectors
       | for significant speedup in my own software), and from a teaching
       | point of view (discussing boxing in my compilers class, using
       | NaN-boxing as an example of cleverness).
       | 
       | This idea is so simple, so crazy, so stupid, and works so well,
       | but I never thought of it. Bravo to the authors.
        
         | jonnycomputer wrote:
         | Thank you for the clear explanation!
        
         | pansa2 wrote:
         | > _This allows unboxing the vast majority of floats, which
         | leads to big speedups on float-heavy benchmarks._
         | 
         | NaN-boxing allows _all_ floats to be unboxed though. The main
         | benefit of the self-tagging approach seems to be that by boxing
         | _some_ floats, we can make space for 64-bit pointers which are
         | too large for NaN-boxing.
         | 
         | The surprising part of the paper is that " _some_ floats " is
         | only a small minority of values - not, say, 50% of them.
        
           | adgjlsfhk1 wrote:
           | 50% means you only get 1 tag bit.
           | 
           | also you totally can fit 64 bit pointers inside a NaN. 46 bit
           | pointers are only 48 bits and you have 53 bits of NaN
           | payload. (you also could get an extra 3 bits if you only
           | allow storing 8 byte aligned pointers unboxed)
        
             | pansa2 wrote:
             | > _50% means you only get 1 tag bit._
             | 
             | That's enough to distinguish between "unboxed float" and
             | "something else", where the latter can have additional tag
             | bits.
             | 
             | > _[64-bit] pointers are only 48 bits and you have 53 bits
             | of NaN payload._
             | 
             | The paper specifically talks about support for "high memory
             | addresses that do not fit in 48 bits". If you don't have to
             | handle those high addresses, I don't think this approach
             | has any benefits compared to NaN-boxing.
        
               | dzaima wrote:
               | Of note is that even if you have some massive >=2^48 data
               | sources, you could still quite likely get away with
               | having NaN-boxed pointers to within the low-size heap,
               | with an extra indirection for massive data. This only
               | would break apart if you managed to reach around 2^45
               | distinct referenceable objects, which you probably
               | shouldn't ever have (esp. in a GCd language).
        
           | skybrian wrote:
           | A small minority, but apparently it includes all the floats
           | you're likely to use. It seems the insight is that you only
           | need 8 bits of exponent in most cases. (And single-precision
           | floating point only has 8 bits of exponent.)
           | 
           |  _Most_ double-precision floats are never used because they
           | have high exponents.
        
             | pansa2 wrote:
             | > _A small minority, but apparently it includes all the
             | floats you're likely to use._
             | 
             | Sorry, I meant a small minority need to be boxed - all the
             | floats you're likely to use can remain unboxed.
        
         | rtpg wrote:
         | Do all float operations need to reconfirm those bits afterwards
         | though? I suppose if you have some sort of JIT you can end up
         | with a bunch of unboxed floats and would only pay the cost on
         | boundaries though
        
           | lifthrasiir wrote:
           | Only when they have to be boxed, but yes if you are talking
           | about that.
        
             | ndjdjddjsjj wrote:
             | You need to check after the floating point operation though
             | just in case. Or after the boundary where you pass the
             | float to something else expecting this scheme.
        
           | pansa2 wrote:
           | > _reconfirm those bits afterwards_
           | 
           | Thanks - I hadn't thought about that but it seems to be the
           | main downside of this approach. The benefit of NaN-boxing is
           | that it reassigns values that are otherwise unused -
           | floating-point calculations will never generate NaNs with
           | those bit patterns.
        
             | AlotOfReading wrote:
             | An additional wrinkle is that NaNs are a bit unstable and
             | can have large performance penalties. You can't let the
             | NaNs ever escape into arithmetic and you may even have
             | issues even storing them in a register.
        
           | phire wrote:
           | Yes, but there should be some optimisation opportunities.
           | 
           | Off the top of my head: Any multiply by a constant less than
           | 1.0 will never overflow the unboxed range (but might
           | underflow) and there should be times when it's provably
           | better to check the inputs are inside a range, rather than
           | checking the outputs.
           | 
           | It's worth pointing out that these overflow/underflow checks
           | will be very cold (on typical code). They won't waste much in
           | the way of branch-prediction resources.
           | 
           | I wonder if it's worth taking advantage of floating point
           | overflow/underflow exceptions. I think a multiplication by
           | 2^767 will trigger an exception if the value would overflow,
           | and the corresponding multiply by 2^-765 will catch
           | underflows.
           | 
           | It's tempting to allocate two more tags for floats (001 and
           | 010), covering the entire range from -2^257 to +2^257. It
           | will be rare to actually see those small floats near zero,
           | but it could be worth eliminating the possibility of
           | underflows.
        
           | ithkuil wrote:
           | You check the tag before doing float operations
        
             | pansa2 wrote:
             | And afterwards, because floating-point arithmetic can
             | change the value of the tag. This isn't necessary with NaN-
             | boxing, because it uses NaN bit patterns that the hardware
             | never generates.
        
         | daanx wrote:
         | > This idea is so simple, so crazy, so stupid, and works so
         | well, but I never thought of it. Bravo to the authors.
         | 
         | Thanks for the nice summary -- looking forward to read the
         | paper!
         | 
         | The same idea of self-tagging is actually also used in Koka
         | language [1] runtime system where by default the Koka compiler
         | only heap allocates float64's when their absolute value is
         | outside the range [2e-511,2e512) and not 0, infinity, or NaN
         | (see [2]). This saves indeed many (many!) heap allocations for
         | float intensive programs.
         | 
         | Since Koka only uses 1 bit to distinguish pointers from values,
         | another slightly faster option is to only box negative
         | float64's but of course, negative numbers are still quite
         | common so it saves less allocations in general.
         | 
         | [1] https://koka-lang.github.io/koka/doc/book.html#sec-value-
         | typ...
         | 
         | [2] https://github.com/koka-
         | lang/koka/blob/dev/kklib/src/box.c#L...
         | 
         | ps. If you enjoy reading about tagging, I recently wrote a note
         | on efficiently supporting seamless large integer arithmetic (as
         | used in Koka as well) and discuss how certain hardware
         | instructions could really help to speed this up [3]:
         | 
         | [3] https://www.microsoft.com/en-
         | us/research/uploads/prod/2022/0... (ML workshop 2022)
        
         | skybrian wrote:
         | It's clever, but not random chance. That would be too much of a
         | coincidence. They rotate the floats to make it happen the way
         | they want.
         | 
         | It's hardly random that only 8 bits of exponent are needed for
         | many calculations.
        
         | gus_massa wrote:
         | Nice explanation but it took me a while to understand the
         | trick. They are hidding the tag in the "exponent" of the float,
         | not in the "mantisa"!
        
       | plagiarist wrote:
       | The idea in the paper is really cool.
       | 
       | People who enjoyed this might also like to read how Apple used
       | tagged pointers for short strings in Objective-C [0]. I think
       | that's when I first learned about tagged pointers. NaN-boxing was
       | mindblowing for me. I love this kind of stuff.
       | 
       | [0] https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-
       | point...
        
         | wging wrote:
         | Another cool thing that seems related: exploiting alignment to
         | free up N bits in a 'pointer' representation, because your
         | values have to be aligned. The JVM does this to expand the set
         | of possible addresses representable in 32 bits:
         | https://shipilev.net/jvm/anatomy-quarks/23-compressed-refere...
         | 
         | So, for example, with 3 bits of alignment required, the first
         | valid address for a pointer to point to after 0x0 is 0x8, and
         | after that is 0x10, but you represent those as 0x1 and 0x2
         | respectively, and use a shift to get back the actual address
         | (0x1 << 3 = 0x8, actual address). I think this is gestured at
         | in section 1.1 of the paper, sort of, except they envision
         | using the space thus freed for tags, rather than additional
         | bits. (Which only makes sense if your address is 32 bits
         | anyway, rather than 64 as in the paper: no one has 67-bit
         | addresses. So saving 3 bits doesn't buy you anything. I think.)
         | 
         | > Aligning all heap-allocated values to 64-bit machine words
         | conveniently frees the low bits of pointers to store a 3-bit
         | tag.
        
           | plagiarist wrote:
           | It's interesting which runtimes exploit the extra space for
           | what reasons! Definitely makes more sense to have the extra
           | address space on 32 bits compared to 64. I wonder if the
           | extra addresses are specific to JVM / not something that
           | works well in the C family?
        
             | lmm wrote:
             | Well in C you have non-aligned pointers, because you can
             | have pointers to things that aren't objects and might not
             | be aligned (e.g. individual chars or shorts). In Java
             | everything is at least 8-byte-aligned, you can't store a
             | loose char/short/int on the heap (it has to go in a boxed
             | object that's 8-byte-aligned, though the compiler will do
             | this semi-automatically) and you can't take a pointer to an
             | individual element of an array.
             | 
             | If you applied the C standard strictly, you could use a
             | JVM-style representation for pointers to longs, pointers,
             | and structs that start with longs and pointers, so you
             | could theoretically have an implementation where those
             | pointers were shorter. But you'd have to convert back and
             | forth when casting to and from void* (and char*), and in
             | practice C people expect to be able to cast a long* to int,
             | cast that to void*, and get the same result as casting
             | long* to void*, even though doing that and using it is
             | undefined behaviour according to the standard.
        
       | comex wrote:
       | > For instance, NaN-tagging prevents (or largely complicates)
       | optimizations relying on stack allocations. The stack uses high
       | memory addresses that do not fit in 48 bits unless encoded
       | relative to the location of the stack segment.
       | 
       | Er, what? The paper says they tested on a Xeon CPU, so x86-64,
       | running Linux. On traditional x86-64, all pointers fit in 48
       | bits, period. Stack memory is no exception. More recently the
       | architecture was extended to allow 56-bit pointers, but my
       | impression is that Linux (like other OSes) keeps them disabled by
       | default in userspace. According to the documentation [1]:
       | 
       | > Not all user space is ready to handle wide addresses. [..] To
       | mitigate this, we are not going to allocate virtual address space
       | above 47-bit by default.
       | 
       | So how would the stack end up above 47 bits? Is the documentation
       | out of date?
       | 
       | [1] https://docs.kernel.org/arch/x86/x86_64/5level-paging.html
        
         | pizlonator wrote:
         | Yeah I think they're just wrong about this.
        
         | 4ad wrote:
         | The address space size limitations doesn't mean that only the
         | least significants bits are used, the memory hole is in the
         | middle of the address space[1].
         | 
         | I don't know what Linux does specifically (or under what
         | configurations), but one some other operating systems the user
         | space stack is in the higher half[2].
         | 
         | [1]
         | https://en.wikipedia.org/wiki/X86-64#Canonical_form_addresse...
         | 
         | [2] https://github.com/illumos/illumos-
         | gate/blob/master/usr/src/...
        
       | funny_falcon wrote:
       | CRuby uses this technique on 64bit platforms for years.
       | 
       | Edit: the commit
       | https://github.com/ruby/ruby/commit/b3b5e626ad69bf22be3228f8...
        
         | scottlamb wrote:
         | > CRuby uses this technique on 64bit platforms for years.
         | 
         | What do you mean by "this technique"?
         | 
         | The paper says that CRuby uses tagged objects but could benefit
         | from the innovation being discussed here, a specific bit
         | pattern used to tag floats. See the following quote:
         | 
         | > Therefore, implementations that represent floats as tagged
         | pointers could benefit from it with minimal implementation
         | effort. Such popular implementations include CPython [11],
         | CRuby [32] and Google's V8 [33].
        
           | funny_falcon wrote:
           | I mean, CRuby does "Float Self Tagging" for years. Paper just
           | has the mistake about CRuby.
        
           | funny_falcon wrote:
           | In fact, CRuby successfully combined "Self Tagging" with
           | pointer tagging. Here's the commit:
           | 
           | https://github.com/ruby/ruby/commit/b3b5e626ad69bf22be3228f8.
           | ..
        
       | pizlonator wrote:
       | I doubt this is actually faster than NaN or why they call NuN
       | tagging. The code sequences they cite for encoding and decoding
       | are worse than what I expect NuN tagging to give you.
       | 
       | If they want to convince me that their thing is faster, they
       | should do a comparison against a production implementation of NuN
       | tagging. Note that the specifics of getting it right involve
       | wacky register allocation tricks on x86 and super careful
       | instruction selection on arm.
       | 
       | It seems that they use some very nonstandard JS implementation of
       | NaN tagging as a strawman comparison.
       | 
       | (Source: I wrote a lot of the NuN tagging optimizations in
       | JavaScriptCore, but I didn't invent the technique.)
        
       ___________________________________________________________________
       (page generated 2024-11-28 23:00 UTC)