[HN Gopher] Float Self-Tagging
___________________________________________________________________
Float Self-Tagging
Author : laurenth
Score : 24 points
Date : 2024-11-26 20:19 UTC (2 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| vanderZwan wrote:
| Really cool solution! One question: maybe I missed it, but
| there's no _technical_ reason the tag bits could not use the
| entire range of exponent bits, no? Other than the fact that
| having up to 2048 tags would be ridiculously branchy, I guess.
|
| Here's a variation I just thought of, which probably has a few
| footguns I'm overlooking right now: use the seven highest
| exponent bits instead of three. Then we can directly read the
| most significant byte of a double, skipping the need for a
| rotation altogether.
|
| After reading the most significant byte, subtract 16 (or
| 0b00010000) from it, then mask out the sign. To test for unboxed
| doubles, test if the resulting value is bigger than 15. If so, it
| is an unboxed double, otherwise the lower four bits of the byte
| are available as alternative tags (so 16 tags).
|
| Effectively, we adjusted the exponent range 2-767..2-511 into the
| 0b(0)0000000 - 0b(0)0001111 range and made them boxed doubles.
| _Every_ other double, which is 15 /16th of all possible doubles,
| is now unboxed. This includes subnormals, zero, all NaN encodings
| (so it even can exist in superposition with a NaN or NuN tagging
| system I guess?) and both infinities.
|
| To be clear, this is off the top of my head so maybe I made a few
| crucial mistakes here.
| vanderZwan wrote:
| Also, funny enough, this idea and variations of it look
| surprisingly easy to implement in JS itself using typed arrays.
| I won't any claims of impact on performance though...
|
| You need a Float64Array for the values (plus a
| Uint32Array/Uint8Array using the same buffer to be able to
| manipulate the integer bits, which _technically_ also requires
| knowledge of endianness since it 's _technically_ not in the JS
| spec and some funny mobile hardware still exists out there).
| Mocking a heap is easy enough: the "heap" can be a plain
| array, the "pointers" indices into the array. Using a
| Map<value, index> would you "intern" duplicate values (i.e.
| ensure doubles outside the unboxed range are only "heap"
| allocated once).
| omelancon wrote:
| Hi, I'm one of the authors of the paper. Thanks for your
| questions and comments!
|
| There are many reasons why 3-bit tags work well in practice.
| Importantly, it allows aligning heap objects on 64-bit machine
| words. Dereferencing a tagged pointer can then be done in a
| single machine instruction, a MOV offset by the tag.
|
| One of our goals is to make self-tagging as straightforward to
| implement in existing systems as possible. Thus requiring to
| move away from the ubiquitous 3-bit tag scheme is definitely a
| no-go.
|
| Another goal is performance, obviously. It turns out that if
| the transformation from self-tagged float to IEEE754 float
| requires more than a few (2-3) machine instructions, it is no
| longer as advantageous. Thus the choice of tags 000, 011 and
| 100, which encoding/decoding is a single bitwise rotation.
|
| Also keep in mind that assigning more tags to self-tagging to
| capture floats that are never used in practice just adds a
| strain on other objects. That's why we include a usage analysis
| of floats to guide tag selection in our paper.
|
| In fact, we are currently working on an improved version that
| uses a single tag to capture all "useful" values. Capturing 3/8
| of floats seems unnecessary, especially since one of the tag is
| only meant to capture +-0.0. The trick is to rotate the tag to
| bits 2-3-4 of the exponent instead of 1-2-3 and add an offset
| to the exponent to "shift" the range of captured values.
|
| But in the end, it feels like we are barely scratching the
| surface and that a lot of fine-tuning can still be done by
| toying with tag placement and transformation. But I think the
| next step is a quality over quantity improvement: capture less
| floats but capture the right ones.
| vanderZwan wrote:
| Thank you for the explanation! I was not aware of the context
| behind the choice of three-bit tags.
|
| > _The trick is to rotate the tag to bits 2-3-4 of the
| exponent instead of 1-2-3 and add an offset to the exponent
| to "shift" the range of captured values._
|
| Maybe I misunderstand, but isn't that a similar idea to what
| I just described? Adding an offset to "rotate" the ranges of
| the exponent by a segment, putting the one with zero in the
| high side? The main difference being that you stick to the
| upper four bits of the exponent, and that I suggested using
| one of the upper three bit sequences to mark bits 4-5-6-7 as
| tag bits? (I mistakenly included the sign bit before and
| claimed this unboxes 15/16ths of all doubles, it's actually
| 7/8ths) Which probably has consequences for minimizing
| instructions, like you mentioned.
|
| > _But I think the next step is a quality over quantity
| improvement: capture less floats but capture the right ones._
|
| I suspect that ensuring NaN and Infinity are in there will be
| crucial to avoid performance cliffs in certain types of code;
| I have seen production code where it is known that initiated
| values are always finite, so either of those two are then
| used as ways to "tag" a float value as "missing" or something
| to that degree.
|
| Anyway, looking forward to future results of your research!
| omelancon wrote:
| > _Maybe I misunderstand, but isn 't that a similar idea to
| what I just described? Adding an offset to "rotate" the
| ranges of the exponent by a segment..._
|
| Yes it is similar. It seems to me that there really isn't
| that many useful operations that can be applied to the
| exponent beside adding an offset. But that's only a
| suspicion, do not take my word for it.
|
| > _I suspect that ensuring NaN and Infinity are in there
| will be crucial to avoid performance cliffs..._
|
| This is a reasonable assumption. There are in fact ways to
| rotate and add an offset such that the exponent can
| overflows/underflows to capture exponents 0 and 0x7ff (for
| inf and nan) with a single well-positioned tag. Making it
| work in practice is not as simple, but we are working on
| it.
| funny_falcon wrote:
| Just correction: CRuby uses "Float Self-Tagging" for years.
| refulgentis wrote:
| This is your 4th comment claiming this: it's not true.
|
| You're right that Ruby uses tags, ex. Objective-C does also
| and has for a while.
|
| The innovation here is its a tag _without the tag bits_.
| That 's why its self-tagging, not tagging.
| jacobp100 wrote:
| Just to add some JS engine info - every engine stores numbers
| as 32 bit integers if possible, which is usually done by
| tagging a pointer. Also, JSC needs pointers to be represented
| exactly as-is, because it will scan the stack for anything
| that looks like a pointer, and retain it when running the GC
| pavpanchekha wrote:
| I loved this clever, weird, awesome paper, so a short summary.
|
| In many dynamic languages some values are stored on the heap
| ("boxed") and represented as a pointer, while others are
| represented as an immediate value ("unboxed" or "immediate").
| Pointer tagging is a common way to do that: the low bit of the
| value tells you the value's type, and some types are immediate
| while others are boxed.
|
| Naturally, the tag bits have a fixed value, so can't be used to
| store data. So for example your language might offer 61-bit
| integer immediates instead of 64-bit integers; the other three
| bits are used for tags. Possibly, larger integers are stored on
| the heap and treated as a different type (for example Python 2.X
| had separate int and long types for these cases).
|
| However, it's hard to use this strategy for floats, because
| floats need all 64 bits (or 32 bits for single-precision, same
| difference). There's a trick called "NaN boxing" which makes use
| of the large number of NaNs in the float representation, but read
| the paper if you want more on that.
|
| The authors' insight is that, suppose you have a three-bit tag
| and 011 is the tag for floats. By totally random chance, _some_
| floats will end in 011; you can represent those as immediates
| with those tag bits. Obviously, that's unlikely, though you can
| raise the chances by using, like, 010, 011, 100, and 101 all as
| float tags. Still, the low bits are a bad choice. But what about
| high bits? Most floats have one of a couple high bit patterns,
| because most floats are either 0 or between, say, 1e-100 and
| 1e100. Floats outside that range can be boxed but since they're
| really rare it's not a big cost to box them.
|
| So basically, we use high bits as our tag bits and map all the
| common float prefixes to float tags. This allows unboxing the
| vast majority of floats, which leads to _big_ speedups on float-
| heavy benchmarks.
|
| A personal note: I've been working in numerics and floating-point
| for a decade now and have had to deal with float boxing both from
| a research point of view (lots of runtime analysis systems for
| floats), from a user point of view (using unboxed float vectors
| for significant speedup in my own software), and from a teaching
| point of view (discussing boxing in my compilers class, using
| NaN-boxing as an example of cleverness).
|
| This idea is so simple, so crazy, so stupid, and works so well,
| but I never thought of it. Bravo to the authors.
| jonnycomputer wrote:
| Thank you for the clear explanation!
| pansa2 wrote:
| > _This allows unboxing the vast majority of floats, which
| leads to big speedups on float-heavy benchmarks._
|
| NaN-boxing allows _all_ floats to be unboxed though. The main
| benefit of the self-tagging approach seems to be that by boxing
| _some_ floats, we can make space for 64-bit pointers which are
| too large for NaN-boxing.
|
| The surprising part of the paper is that " _some_ floats " is
| only a small minority of values - not, say, 50% of them.
| adgjlsfhk1 wrote:
| 50% means you only get 1 tag bit.
|
| also you totally can fit 64 bit pointers inside a NaN. 46 bit
| pointers are only 48 bits and you have 53 bits of NaN
| payload. (you also could get an extra 3 bits if you only
| allow storing 8 byte aligned pointers unboxed)
| pansa2 wrote:
| > _50% means you only get 1 tag bit._
|
| That's enough to distinguish between "unboxed float" and
| "something else", where the latter can have additional tag
| bits.
|
| > _[64-bit] pointers are only 48 bits and you have 53 bits
| of NaN payload._
|
| The paper specifically talks about support for "high memory
| addresses that do not fit in 48 bits". If you don't have to
| handle those high addresses, I don't think this approach
| has any benefits compared to NaN-boxing.
| dzaima wrote:
| Of note is that even if you have some massive >=2^48 data
| sources, you could still quite likely get away with
| having NaN-boxed pointers to within the low-size heap,
| with an extra indirection for massive data. This only
| would break apart if you managed to reach around 2^45
| distinct referenceable objects, which you probably
| shouldn't ever have (esp. in a GCd language).
| skybrian wrote:
| A small minority, but apparently it includes all the floats
| you're likely to use. It seems the insight is that you only
| need 8 bits of exponent in most cases. (And single-precision
| floating point only has 8 bits of exponent.)
|
| _Most_ double-precision floats are never used because they
| have high exponents.
| pansa2 wrote:
| > _A small minority, but apparently it includes all the
| floats you're likely to use._
|
| Sorry, I meant a small minority need to be boxed - all the
| floats you're likely to use can remain unboxed.
| rtpg wrote:
| Do all float operations need to reconfirm those bits afterwards
| though? I suppose if you have some sort of JIT you can end up
| with a bunch of unboxed floats and would only pay the cost on
| boundaries though
| lifthrasiir wrote:
| Only when they have to be boxed, but yes if you are talking
| about that.
| ndjdjddjsjj wrote:
| You need to check after the floating point operation though
| just in case. Or after the boundary where you pass the
| float to something else expecting this scheme.
| pansa2 wrote:
| > _reconfirm those bits afterwards_
|
| Thanks - I hadn't thought about that but it seems to be the
| main downside of this approach. The benefit of NaN-boxing is
| that it reassigns values that are otherwise unused -
| floating-point calculations will never generate NaNs with
| those bit patterns.
| AlotOfReading wrote:
| An additional wrinkle is that NaNs are a bit unstable and
| can have large performance penalties. You can't let the
| NaNs ever escape into arithmetic and you may even have
| issues even storing them in a register.
| phire wrote:
| Yes, but there should be some optimisation opportunities.
|
| Off the top of my head: Any multiply by a constant less than
| 1.0 will never overflow the unboxed range (but might
| underflow) and there should be times when it's provably
| better to check the inputs are inside a range, rather than
| checking the outputs.
|
| It's worth pointing out that these overflow/underflow checks
| will be very cold (on typical code). They won't waste much in
| the way of branch-prediction resources.
|
| I wonder if it's worth taking advantage of floating point
| overflow/underflow exceptions. I think a multiplication by
| 2^767 will trigger an exception if the value would overflow,
| and the corresponding multiply by 2^-765 will catch
| underflows.
|
| It's tempting to allocate two more tags for floats (001 and
| 010), covering the entire range from -2^257 to +2^257. It
| will be rare to actually see those small floats near zero,
| but it could be worth eliminating the possibility of
| underflows.
| ithkuil wrote:
| You check the tag before doing float operations
| pansa2 wrote:
| And afterwards, because floating-point arithmetic can
| change the value of the tag. This isn't necessary with NaN-
| boxing, because it uses NaN bit patterns that the hardware
| never generates.
| daanx wrote:
| > This idea is so simple, so crazy, so stupid, and works so
| well, but I never thought of it. Bravo to the authors.
|
| Thanks for the nice summary -- looking forward to read the
| paper!
|
| The same idea of self-tagging is actually also used in Koka
| language [1] runtime system where by default the Koka compiler
| only heap allocates float64's when their absolute value is
| outside the range [2e-511,2e512) and not 0, infinity, or NaN
| (see [2]). This saves indeed many (many!) heap allocations for
| float intensive programs.
|
| Since Koka only uses 1 bit to distinguish pointers from values,
| another slightly faster option is to only box negative
| float64's but of course, negative numbers are still quite
| common so it saves less allocations in general.
|
| [1] https://koka-lang.github.io/koka/doc/book.html#sec-value-
| typ...
|
| [2] https://github.com/koka-
| lang/koka/blob/dev/kklib/src/box.c#L...
|
| ps. If you enjoy reading about tagging, I recently wrote a note
| on efficiently supporting seamless large integer arithmetic (as
| used in Koka as well) and discuss how certain hardware
| instructions could really help to speed this up [3]:
|
| [3] https://www.microsoft.com/en-
| us/research/uploads/prod/2022/0... (ML workshop 2022)
| skybrian wrote:
| It's clever, but not random chance. That would be too much of a
| coincidence. They rotate the floats to make it happen the way
| they want.
|
| It's hardly random that only 8 bits of exponent are needed for
| many calculations.
| gus_massa wrote:
| Nice explanation but it took me a while to understand the
| trick. They are hidding the tag in the "exponent" of the float,
| not in the "mantisa"!
| plagiarist wrote:
| The idea in the paper is really cool.
|
| People who enjoyed this might also like to read how Apple used
| tagged pointers for short strings in Objective-C [0]. I think
| that's when I first learned about tagged pointers. NaN-boxing was
| mindblowing for me. I love this kind of stuff.
|
| [0] https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-
| point...
| wging wrote:
| Another cool thing that seems related: exploiting alignment to
| free up N bits in a 'pointer' representation, because your
| values have to be aligned. The JVM does this to expand the set
| of possible addresses representable in 32 bits:
| https://shipilev.net/jvm/anatomy-quarks/23-compressed-refere...
|
| So, for example, with 3 bits of alignment required, the first
| valid address for a pointer to point to after 0x0 is 0x8, and
| after that is 0x10, but you represent those as 0x1 and 0x2
| respectively, and use a shift to get back the actual address
| (0x1 << 3 = 0x8, actual address). I think this is gestured at
| in section 1.1 of the paper, sort of, except they envision
| using the space thus freed for tags, rather than additional
| bits. (Which only makes sense if your address is 32 bits
| anyway, rather than 64 as in the paper: no one has 67-bit
| addresses. So saving 3 bits doesn't buy you anything. I think.)
|
| > Aligning all heap-allocated values to 64-bit machine words
| conveniently frees the low bits of pointers to store a 3-bit
| tag.
| plagiarist wrote:
| It's interesting which runtimes exploit the extra space for
| what reasons! Definitely makes more sense to have the extra
| address space on 32 bits compared to 64. I wonder if the
| extra addresses are specific to JVM / not something that
| works well in the C family?
| lmm wrote:
| Well in C you have non-aligned pointers, because you can
| have pointers to things that aren't objects and might not
| be aligned (e.g. individual chars or shorts). In Java
| everything is at least 8-byte-aligned, you can't store a
| loose char/short/int on the heap (it has to go in a boxed
| object that's 8-byte-aligned, though the compiler will do
| this semi-automatically) and you can't take a pointer to an
| individual element of an array.
|
| If you applied the C standard strictly, you could use a
| JVM-style representation for pointers to longs, pointers,
| and structs that start with longs and pointers, so you
| could theoretically have an implementation where those
| pointers were shorter. But you'd have to convert back and
| forth when casting to and from void* (and char*), and in
| practice C people expect to be able to cast a long* to int,
| cast that to void*, and get the same result as casting
| long* to void*, even though doing that and using it is
| undefined behaviour according to the standard.
| comex wrote:
| > For instance, NaN-tagging prevents (or largely complicates)
| optimizations relying on stack allocations. The stack uses high
| memory addresses that do not fit in 48 bits unless encoded
| relative to the location of the stack segment.
|
| Er, what? The paper says they tested on a Xeon CPU, so x86-64,
| running Linux. On traditional x86-64, all pointers fit in 48
| bits, period. Stack memory is no exception. More recently the
| architecture was extended to allow 56-bit pointers, but my
| impression is that Linux (like other OSes) keeps them disabled by
| default in userspace. According to the documentation [1]:
|
| > Not all user space is ready to handle wide addresses. [..] To
| mitigate this, we are not going to allocate virtual address space
| above 47-bit by default.
|
| So how would the stack end up above 47 bits? Is the documentation
| out of date?
|
| [1] https://docs.kernel.org/arch/x86/x86_64/5level-paging.html
| pizlonator wrote:
| Yeah I think they're just wrong about this.
| 4ad wrote:
| The address space size limitations doesn't mean that only the
| least significants bits are used, the memory hole is in the
| middle of the address space[1].
|
| I don't know what Linux does specifically (or under what
| configurations), but one some other operating systems the user
| space stack is in the higher half[2].
|
| [1]
| https://en.wikipedia.org/wiki/X86-64#Canonical_form_addresse...
|
| [2] https://github.com/illumos/illumos-
| gate/blob/master/usr/src/...
| funny_falcon wrote:
| CRuby uses this technique on 64bit platforms for years.
|
| Edit: the commit
| https://github.com/ruby/ruby/commit/b3b5e626ad69bf22be3228f8...
| scottlamb wrote:
| > CRuby uses this technique on 64bit platforms for years.
|
| What do you mean by "this technique"?
|
| The paper says that CRuby uses tagged objects but could benefit
| from the innovation being discussed here, a specific bit
| pattern used to tag floats. See the following quote:
|
| > Therefore, implementations that represent floats as tagged
| pointers could benefit from it with minimal implementation
| effort. Such popular implementations include CPython [11],
| CRuby [32] and Google's V8 [33].
| funny_falcon wrote:
| I mean, CRuby does "Float Self Tagging" for years. Paper just
| has the mistake about CRuby.
| funny_falcon wrote:
| In fact, CRuby successfully combined "Self Tagging" with
| pointer tagging. Here's the commit:
|
| https://github.com/ruby/ruby/commit/b3b5e626ad69bf22be3228f8.
| ..
| pizlonator wrote:
| I doubt this is actually faster than NaN or why they call NuN
| tagging. The code sequences they cite for encoding and decoding
| are worse than what I expect NuN tagging to give you.
|
| If they want to convince me that their thing is faster, they
| should do a comparison against a production implementation of NuN
| tagging. Note that the specifics of getting it right involve
| wacky register allocation tricks on x86 and super careful
| instruction selection on arm.
|
| It seems that they use some very nonstandard JS implementation of
| NaN tagging as a strawman comparison.
|
| (Source: I wrote a lot of the NuN tagging optimizations in
| JavaScriptCore, but I didn't invent the technique.)
___________________________________________________________________
(page generated 2024-11-28 23:00 UTC)