[HN Gopher] What is the best pointer tagging method?
       ___________________________________________________________________
        
       What is the best pointer tagging method?
        
       Author : celeritascelery
       Score  : 102 points
       Date   : 2024-09-09 13:33 UTC (2 days ago)
        
 (HTM) web link (coredumped.dev)
 (TXT) w3m dump (coredumped.dev)
        
       | erik_seaberg wrote:
       | I had never heard of "top byte ignore," but it reminds me of the
       | macOS migration to "32-bit clean" software as the hardware began
       | to support more than the low 24 address lines.
       | 
       | https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme...
       | 
       | The other approach is CompressedOops, where instead of wasting
       | pointer bits (and maybe using them for tags), Java's HotSpot VM
       | chooses to only store a 32-bit offset for an eight-aligned heap
       | object if the entire heap is known to fit within 2^(32+3) which
       | is 32 GB from its base address.
       | 
       | https://news.ycombinator.com/item?id=22398251
       | 
       | And didn't somebody write about creating a large aligned arena
       | for each type and essentially grabbing the base address of the
       | arena as a (non-unique) type tag for its objects? Then the moving
       | GC would use these arenas as semispaces.
        
       | saagarjha wrote:
       | Sorry, if the tests were run on a MacBook why are there Intel
       | assembly snippets?
        
         | hansvm wrote:
         | Older MacBooks are Intel, and newer MacBooks claim to have an
         | emulation layer faster than native x86. If nothing else, it's
         | the machine the author had, and it's some data point.
        
           | saagarjha wrote:
           | They said it was an ARM MacBook
        
         | celeritascelery wrote:
         | I wrote the snippets in x86 asm because I assumed that more
         | people are familiar with it than ARM. Also I ran them on an x86
         | processor as well.
        
       | jandrewrogers wrote:
       | It is worth mentioning that on Intel x86 going all the way back
       | to Haswell you can use PDEP/PEXT instructions to efficiently
       | combine the high bits and the low bits into a single seamless
       | tag. This can provide a lot of bit width. The caveat is AMD x86
       | implemented these instructions as uselessly slow microcode until
       | quite recently, which created some portability issues.
        
       | chc4 wrote:
       | > Currently, ARM is the only major vendor with support for TBI
       | 
       | is not true. Intel and AMD both have variants of TBI on their
       | chips, called Linear Address Masking and Upper Address Ignore
       | respectively. It's a bit of a mess, unfortunately, with both
       | masking off different bits from the top of the address (and
       | different bits than ARM TBI does), but it does exist.
        
         | cryptonector wrote:
         | Then it might as well not exist, if it's so messy.
        
         | Joker_vD wrote:
         | Honestly, TBI seems like a very bad idea, because it breaks
         | forward-compatibility.
        
           | zero_iq wrote:
           | 72057594037927936 addresses ought to be enough for anybody...
           | ;)
        
             | Joker_vD wrote:
             | The problem is, those addresses are completely
             | interchangeable, nothing stops e.g. malloc() from
             | allocating addresses somewhere around the very top of the
             | legal addresses instead from starting near the .data's end.
             | In fact, it seems that mmap(3) in general does pretty much
             | that by default, so reusing address's top-bits is
             | inherently unreliable: you don't know how much of those are
             | actually unused which is precisely the reason why x64 made
             | addresses effectively signed-extended integers.
        
               | chc4 wrote:
               | You opt-in to any of the top byte masking schemes via
               | prctl on Linux. It's fully forward compatible, in that
               | programs that don't enable it will continue to work like
               | normal. Additionally, Linux won't map memory at addresses
               | higher than 2*48 by default either because _non-hardware
               | accelerated_ top bits pointer tagging would have the same
               | problem. I don 't think either of your complaints are
               | valid here.
        
           | hinkley wrote:
           | Java has been using (or at least had the ability to use) the
           | upper bits for concurrent mark and sweep for a decade - to
           | implement write barriers on objects that are still in the
           | process of being manipulated.
           | 
           | An idea Cliff Click first employed while at Azul and has now
           | made it back into Hotspot.
        
       | dzaima wrote:
       | On NaN-boxing, it's possible to put tags in the top instead of
       | low bits - 64-bit floats have 52 bits of mantissa, 4 of which are
       | in the top 16; though you only get 7 tags due to having to leave
       | qNaN & infinity (8 may be possible if you can guarantee that the
       | zero tag never has zero payload), or 4 for potentially simpler
       | tag checks. Or, going the other direction, you can double the tag
       | count to 14 or 16 by also using the sign, at the cost of a "<<1"
       | in the is-float check.
        
       | o11c wrote:
       | This doesn't mention split tagging (or of course, the best answer
       | of "no tagging", which is often best implemented as "tag stored
       | in debug mode, but not normally since static types are tracked").
       | 
       | If you can reduce your tag to a single bit (object vs primitive),
       | a single byte of tag data can cover 8 variables, and a bigger
       | integer can cover a whole 64 variables, plenty for most
       | functions.
        
         | adgjlsfhk1 wrote:
         | these tag bits are often used for the GC and there you really
         | don't want to collect all of the tag data together since it
         | would cause false sharing issues
        
       | JonChesterfield wrote:
       | I suspect, but haven't properly measured, that pointer tagging
       | upsets speculative loads / branch prediction (when you're loading
       | an address) to varying extent across different tagging schemes
       | and different cpu implementations. I'd hope setting low bits are
       | cheaper than high bits but really should write the microbenchmark
       | to find out someday. Anyone know of existing attempts to
       | characterise that?
        
         | Joker_vD wrote:
         | I'm fairly certain that the lower bits are masked away on
         | memory reads by pretty much everything that has an on-board
         | cache anyhow, so they're fair game. Some ISAs even mandate this
         | masking-away for large-than-byte loads.
        
           | JonChesterfield wrote:
           | My guesswork for x64 would be that all is good if
           | dereferencing the tagged value would hit in the same cache
           | line as dereferencing the untagged value. Though I could also
           | be persuaded that x64 completely ignores the top 16 bits
           | until the last moment (to check consistency with the 17th
           | bit) in which case high tagging would be free. It seems
           | relatively likely to be something that is different across
           | different x64 implementations. But so far I'm just running
           | with "it's probably fine, should benchmark later"
        
         | celeritascelery wrote:
         | Why would that impact speculative loads/branch prediction? The
         | pointers are untagged before they are accessed so it should not
         | impact the loads.
        
           | JonChesterfield wrote:
           | You want the address to be visible to the CPU somewhat early
           | so that the target (might be) in the cache before you use it.
           | I'd expect pointer tagging to obstruct that mechanism - in
           | the worst case codegen might mask out the bits immediately
           | before the memory operation. I don't know how transparent
           | this sort of thing is to the core in practice and haven't
           | found anyone else measuring it.
        
             | chc4 wrote:
             | That's not really how out-of-order execution in CPUs work.
             | The address doesn't have to be fully computed X cycles
             | before a load in order to be filled. Loads are filled as
             | their dependencies are computed: requiring an additional
             | operation to compute the address means your address is
             | essentially 1 cycle delayed - but that's delay, not
             | throughput, and only actually makes your code slower if
             | your pipeline stalls
        
               | dzaima wrote:
               | Data memory-dependent prefetchers are a thing (..with
               | expected side-channel potential), and tagging would
               | conceivably make it non-functional. Though,
               | realistically, I wouldn't expect for it to make much
               | difference.
        
       | tempodox wrote:
       | I like how SBCL does it. Heap box addresses have their bit 0 set,
       | which makes them odd and thus unfit for direct access. But real
       | accesses are register indirect with offset, with odd offsets to
       | get an even address. So each time you see an odd address offset
       | in SBCL-generated assembly, you know you're dealing with a heap
       | box. I can only surmise this was a deliberate choice to aid
       | orientation when reading generated assembly. If so, somebody
       | among the designers of SBCL has a heart for crazy people like me.
        
         | dzaima wrote:
         | On the other hand, this may make it worse for aarch64 & RISC-V,
         | which can have shorter encodings for loads/stores with an
         | immediate offset that's a multiple of the operated-on data:
         | https://godbolt.org/z/zT5czdvWc
        
         | whizzter wrote:
         | One huge benefit of keeping boxed objects on odd values on the
         | low bits is that you can implement addition/subtraction of the
         | integers without removing the flag bit, doing the operation and
         | then re-adding it, instead you can just add 2 values with the
         | regular add instruction and then use it (since the lowest bit
         | won't change). On the other hand having the pointer offset
         | doesn't do much of a difference since heap values will often be
         | accessed with an offset (and offset-load/store instructions are
         | more or less for free in most cases so subtracting a one to an
         | offset doesn't change the cost)
        
       | naasking wrote:
       | > However, that is not something that is easy to create a
       | microbenchmark for. The benefit of nan-boxing is that we don't
       | have to dereference a pointer to get the float.
       | 
       | That's not the only benefit. The main benefit is arguably that
       | you don't have to allocate floats on the heap and garbage collect
       | them. Numerical code allocates _lots_ of numbers, so having these
       | all be inline rather than heap-allocated saves lots of space and
       | time.
        
         | hinkley wrote:
         | The wasted space has administrative overhead, and the
         | administrative overhead can partially poison the cache.
        
       ___________________________________________________________________
       (page generated 2024-09-11 23:02 UTC)