[HN Gopher] What is the best pointer tagging method?
___________________________________________________________________
What is the best pointer tagging method?
Author : celeritascelery
Score : 102 points
Date : 2024-09-09 13:33 UTC (2 days ago)
(HTM) web link (coredumped.dev)
(TXT) w3m dump (coredumped.dev)
| erik_seaberg wrote:
| I had never heard of "top byte ignore," but it reminds me of the
| macOS migration to "32-bit clean" software as the hardware began
| to support more than the low 24 address lines.
|
| https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme...
|
| The other approach is CompressedOops, where instead of wasting
| pointer bits (and maybe using them for tags), Java's HotSpot VM
| chooses to only store a 32-bit offset for an eight-aligned heap
| object if the entire heap is known to fit within 2^(32+3) which
| is 32 GB from its base address.
|
| https://news.ycombinator.com/item?id=22398251
|
| And didn't somebody write about creating a large aligned arena
| for each type and essentially grabbing the base address of the
| arena as a (non-unique) type tag for its objects? Then the moving
| GC would use these arenas as semispaces.
| saagarjha wrote:
| Sorry, if the tests were run on a MacBook why are there Intel
| assembly snippets?
| hansvm wrote:
| Older MacBooks are Intel, and newer MacBooks claim to have an
| emulation layer faster than native x86. If nothing else, it's
| the machine the author had, and it's some data point.
| saagarjha wrote:
| They said it was an ARM MacBook
| celeritascelery wrote:
| I wrote the snippets in x86 asm because I assumed that more
| people are familiar with it than ARM. Also I ran them on an x86
| processor as well.
| jandrewrogers wrote:
| It is worth mentioning that on Intel x86 going all the way back
| to Haswell you can use PDEP/PEXT instructions to efficiently
| combine the high bits and the low bits into a single seamless
| tag. This can provide a lot of bit width. The caveat is AMD x86
| implemented these instructions as uselessly slow microcode until
| quite recently, which created some portability issues.
| chc4 wrote:
| > Currently, ARM is the only major vendor with support for TBI
|
| is not true. Intel and AMD both have variants of TBI on their
| chips, called Linear Address Masking and Upper Address Ignore
| respectively. It's a bit of a mess, unfortunately, with both
| masking off different bits from the top of the address (and
| different bits than ARM TBI does), but it does exist.
| cryptonector wrote:
| Then it might as well not exist, if it's so messy.
| Joker_vD wrote:
| Honestly, TBI seems like a very bad idea, because it breaks
| forward-compatibility.
| zero_iq wrote:
| 72057594037927936 addresses ought to be enough for anybody...
| ;)
| Joker_vD wrote:
| The problem is, those addresses are completely
| interchangeable, nothing stops e.g. malloc() from
| allocating addresses somewhere around the very top of the
| legal addresses instead from starting near the .data's end.
| In fact, it seems that mmap(3) in general does pretty much
| that by default, so reusing address's top-bits is
| inherently unreliable: you don't know how much of those are
| actually unused which is precisely the reason why x64 made
| addresses effectively signed-extended integers.
| chc4 wrote:
| You opt-in to any of the top byte masking schemes via
| prctl on Linux. It's fully forward compatible, in that
| programs that don't enable it will continue to work like
| normal. Additionally, Linux won't map memory at addresses
| higher than 2*48 by default either because _non-hardware
| accelerated_ top bits pointer tagging would have the same
| problem. I don 't think either of your complaints are
| valid here.
| hinkley wrote:
| Java has been using (or at least had the ability to use) the
| upper bits for concurrent mark and sweep for a decade - to
| implement write barriers on objects that are still in the
| process of being manipulated.
|
| An idea Cliff Click first employed while at Azul and has now
| made it back into Hotspot.
| dzaima wrote:
| On NaN-boxing, it's possible to put tags in the top instead of
| low bits - 64-bit floats have 52 bits of mantissa, 4 of which are
| in the top 16; though you only get 7 tags due to having to leave
| qNaN & infinity (8 may be possible if you can guarantee that the
| zero tag never has zero payload), or 4 for potentially simpler
| tag checks. Or, going the other direction, you can double the tag
| count to 14 or 16 by also using the sign, at the cost of a "<<1"
| in the is-float check.
| o11c wrote:
| This doesn't mention split tagging (or of course, the best answer
| of "no tagging", which is often best implemented as "tag stored
| in debug mode, but not normally since static types are tracked").
|
| If you can reduce your tag to a single bit (object vs primitive),
| a single byte of tag data can cover 8 variables, and a bigger
| integer can cover a whole 64 variables, plenty for most
| functions.
| adgjlsfhk1 wrote:
| these tag bits are often used for the GC and there you really
| don't want to collect all of the tag data together since it
| would cause false sharing issues
| JonChesterfield wrote:
| I suspect, but haven't properly measured, that pointer tagging
| upsets speculative loads / branch prediction (when you're loading
| an address) to varying extent across different tagging schemes
| and different cpu implementations. I'd hope setting low bits are
| cheaper than high bits but really should write the microbenchmark
| to find out someday. Anyone know of existing attempts to
| characterise that?
| Joker_vD wrote:
| I'm fairly certain that the lower bits are masked away on
| memory reads by pretty much everything that has an on-board
| cache anyhow, so they're fair game. Some ISAs even mandate this
| masking-away for large-than-byte loads.
| JonChesterfield wrote:
| My guesswork for x64 would be that all is good if
| dereferencing the tagged value would hit in the same cache
| line as dereferencing the untagged value. Though I could also
| be persuaded that x64 completely ignores the top 16 bits
| until the last moment (to check consistency with the 17th
| bit) in which case high tagging would be free. It seems
| relatively likely to be something that is different across
| different x64 implementations. But so far I'm just running
| with "it's probably fine, should benchmark later"
| celeritascelery wrote:
| Why would that impact speculative loads/branch prediction? The
| pointers are untagged before they are accessed so it should not
| impact the loads.
| JonChesterfield wrote:
| You want the address to be visible to the CPU somewhat early
| so that the target (might be) in the cache before you use it.
| I'd expect pointer tagging to obstruct that mechanism - in
| the worst case codegen might mask out the bits immediately
| before the memory operation. I don't know how transparent
| this sort of thing is to the core in practice and haven't
| found anyone else measuring it.
| chc4 wrote:
| That's not really how out-of-order execution in CPUs work.
| The address doesn't have to be fully computed X cycles
| before a load in order to be filled. Loads are filled as
| their dependencies are computed: requiring an additional
| operation to compute the address means your address is
| essentially 1 cycle delayed - but that's delay, not
| throughput, and only actually makes your code slower if
| your pipeline stalls
| dzaima wrote:
| Data memory-dependent prefetchers are a thing (..with
| expected side-channel potential), and tagging would
| conceivably make it non-functional. Though,
| realistically, I wouldn't expect for it to make much
| difference.
| tempodox wrote:
| I like how SBCL does it. Heap box addresses have their bit 0 set,
| which makes them odd and thus unfit for direct access. But real
| accesses are register indirect with offset, with odd offsets to
| get an even address. So each time you see an odd address offset
| in SBCL-generated assembly, you know you're dealing with a heap
| box. I can only surmise this was a deliberate choice to aid
| orientation when reading generated assembly. If so, somebody
| among the designers of SBCL has a heart for crazy people like me.
| dzaima wrote:
| On the other hand, this may make it worse for aarch64 & RISC-V,
| which can have shorter encodings for loads/stores with an
| immediate offset that's a multiple of the operated-on data:
| https://godbolt.org/z/zT5czdvWc
| whizzter wrote:
| One huge benefit of keeping boxed objects on odd values on the
| low bits is that you can implement addition/subtraction of the
| integers without removing the flag bit, doing the operation and
| then re-adding it, instead you can just add 2 values with the
| regular add instruction and then use it (since the lowest bit
| won't change). On the other hand having the pointer offset
| doesn't do much of a difference since heap values will often be
| accessed with an offset (and offset-load/store instructions are
| more or less for free in most cases so subtracting a one to an
| offset doesn't change the cost)
| naasking wrote:
| > However, that is not something that is easy to create a
| microbenchmark for. The benefit of nan-boxing is that we don't
| have to dereference a pointer to get the float.
|
| That's not the only benefit. The main benefit is arguably that
| you don't have to allocate floats on the heap and garbage collect
| them. Numerical code allocates _lots_ of numbers, so having these
| all be inline rather than heap-allocated saves lots of space and
| time.
| hinkley wrote:
| The wasted space has administrative overhead, and the
| administrative overhead can partially poison the cache.
___________________________________________________________________
(page generated 2024-09-11 23:02 UTC)