[HN Gopher] Pointer Tagging for x86 Systems
       ___________________________________________________________________
        
       Pointer Tagging for x86 Systems
        
       Author : rwmj
       Score  : 147 points
       Date   : 2022-03-31 09:47 UTC (13 hours ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | devit wrote:
       | Seems like, at least as an initial implementation, the kernel
       | could disable the feature when entering kernel mode and then
       | manually mask all accesses to user space addresses. This way
       | mistakes will not impact security but only cause spurious system
       | call failures.
        
         | rwmj wrote:
         | These kind of virtual address related mode switches tend to be
         | very slow, so, although I've not tested it, I'm going to guess
         | this would be an absolute performance killer.
        
       | erwincoumans wrote:
       | In some cases (especially user land code), using array indexing
       | can be better than using pointers. It allows to re-use data
       | between GPU and CPU, use fewer bits for indices and simplify
       | serialization. For example Flatbuffers is a very nice example:
       | https://google.github.io/flatbuffers/
        
         | rayiner wrote:
         | The issue addressed in the article with pointer tagging applies
         | whether you use array indexing or pointers. Both are converted
         | to pointer operations under the hood, which on X86 are some
         | form of ADDR = BASE + SCALE*INDEX + OFFSET. Such an operation
         | is still considered a pointer dereference (of BASE) from the
         | view of the hardware, and will trap if tags are stored in the
         | unused upper bits of the pointer.
         | 
         | Depending on the size of the array elements, you won't even be
         | able to use the built-in index addressing mode, because the
         | index can only be up to 8 bytes. For that reason, LLVM doesn't
         | even have array indexing. Loads/stores can only be done against
         | pointers, and pointers to items in arrays must be expressly
         | computed using a get-element-pointer instruction:
         | https://llvm.org/docs/GetElementPtr.html.
        
           | erwincoumans wrote:
           | Thanks for the insightful reply. I didn't realize array
           | indexing gets converted to pointers in the CPU. I expected
           | the array indexing assembly instructions to be plumbed all
           | the way down without converting to pointers until the very
           | end.
           | 
           | LLVM not supporting array indexing means all those
           | instructions go unused, on both x86 and arm? How about gcc or
           | other compilers?
        
             | clord wrote:
             | The compiler can convert small known-size known-offset
             | array accesses into registers or direct static stores and
             | loads, but array indexing is usually on large chunks
             | through a dynamic index via a ref or pointer, which will be
             | via pointer math.
        
             | rayiner wrote:
             | They don't go unused. Typically the separate address
             | calculation and load will get merged back together during
             | instruction selection is there is an applicable
             | instruction. (Although that may not happen either--the
             | compiler might transform the code to get rid of the array
             | indexing. Keeping around the base pointer takes an
             | additional register.)
             | 
             | But on x86 and ARM these instructions are limited. On x86,
             | a load instruction can have a prefix that treats one
             | operand as the array base pointer, and another operand as
             | the index. But the size of the array element is limited to
             | 8 bytes so if your array elements are say 16 byte structs,
             | you can't use those instructions. As to ARM, I think it
             | doesn't have an array indexing instruction at all, just
             | base + offset.
        
             | saagarjha wrote:
             | This is less about "array indexing" and more about arrays
             | being located at a specific memory address. At some point
             | you're going to have a buffer at address 0x1000 and want to
             | access the 42nd element, and you can't just use a tagged
             | address (e.g. 0x1000000001000) and 42 together without
             | stripping the tag somewhere.
        
         | uvdn7 wrote:
         | And not to mention better CPU cacheline hits!
        
           | KMag wrote:
           | I used to sit next to a guy who previously professionally
           | wrote compilers for both DEC Alpha and Itanium
           | (http://robert.muth.org/), who mentioned that programs often
           | ran faster on Alpha (which only supported 64-bit addressing)
           | when modified to use 32-bit array indexes instead of
           | pointers, due to reduced memory bandwidth/cache usage. Of
           | course, one had to first determine if one needs to use any
           | arrays larger than 2^32 elements.
        
             | mmastrac wrote:
             | Linux used to have the x32 ABI for similar improvements:
             | https://en.wikipedia.org/wiki/X32_ABI
        
       | skybrian wrote:
       | It seems like this isn't all that useful for JavaScript or other
       | dynamic languages where what you really want is NaN-boxing? But I
       | think I read somewhere that there are special instructions for
       | this.
        
         | saagarjha wrote:
         | NaN-boxing is just one way of doing pointer tagging.
        
       | kvakvs wrote:
       | Virtual machines use pointer tagging in least significant bits,
       | as most of the data is 4x or 8x byte aligned, and you can assume
       | that zeroing those bits will always give you a safe correct
       | pointer value. And the tag bits can instruct the code that the
       | value is not a pointer but contains more data in other bits, or
       | that the value is a structure on heap with special treatment.
        
         | khuey wrote:
         | Depends on the virtual machine. LuaJIT and Spidermonkey (and
         | probably others that I don't know about) use NaN-boxing which
         | ends up storing the metadata in the higher bits.
        
       | ChuckMcM wrote:
       | The UAI feature is nice! Back when I was noodling on creative
       | ways to use 64 bit addresses the idea of a 16 bit 'space id' was
       | bandied about. A space ID in the upper 16 bits would indicate the
       | address space for a particular thing, be it a process, the
       | kernel, a shared library, etc. And a new opcode to trampoline
       | from one space to another for doing things like system calls or
       | library calls, could provide a gateway for doing secure code
       | analysis (basically no "side" entrances into a space) and it
       | would be a much more robust pointer testing mechanism because any
       | read or write with a pointer "outside" the space would be
       | immediately flagged as an error. The use of segments was sort of
       | like this on the 286.
       | 
       | Ideally you'd have a fine grained page map so that you could
       | avoid copies from one space to another (instead of
       | copy_in()/copy_out() you would do map_in()/map_out()) to get
       | access to physical memory, while preserving the semantic that
       | when something was mapped out of your space you couldn't affect
       | it.(zero copy network stacks have been known to have issues when
       | a process left holding a pointer to a buffer modified it when it
       | wasn't "owned" kinds of bugs)
       | 
       | Could be a fun project for a RISC-V design. Not sure how many
       | people are experimenting with various MMU systems but it is at
       | least _possible_ to do these kinds of things in nominally  "real"
       | hardware (it has always been possible to simulate of course).
        
       | nraynaud wrote:
       | I can imagine the gold rush on those high order bits!
        
       | bogomipz wrote:
       | >"There are numerous use cases for stashing metadata into those
       | unused bits. Memory allocators could use that space to track
       | different memory pools, for example, or for garbage collection.
       | Database management systems have their own uses for that space."
       | 
       | Can someone say which databases do this and what they're making
       | use of those unused bits in the address for?
        
         | ayende wrote:
         | Marking a page pointer as modified / accessed for example
        
       | celeritascelery wrote:
       | What would make this extra useful is if they could add an
       | instructions that read the tag portion of the pointer (instead of
       | needing to do a shift). The ucode could optimize this away to
       | nothing more then a special case of load.
        
         | adrian_b wrote:
         | The tag is in the high byte (always for AMD and also for Intel
         | in the 57-bit mode).
         | 
         | Instead of a shift and a mask, on Intel/AMD CPUs you can use
         | the instruction BSWAP (to move the highest byte into the lowest
         | byte) and then you may access the register as a byte (even if
         | usually it does not matter what size you use to access the
         | register, because in most cases you just test individual tag
         | bits).
         | 
         | Moreover, on any Atom CPUs or any other CPUs newer than Haswell
         | (2013), you may skip the BSWAP by loading the pointer with a
         | MOVBE instruction, which does the byte swapping during the
         | load. This use of BSWAP/MOVBE is identical to the handling of
         | big-endian (a.k.a. network byte order) data, like in the
         | headers of network data packets. Because of that, they are
         | already available in C/C++ in macros that depend on the
         | operating system, e.g. __be64_to_cpu(x) for Linux, which can be
         | used in C/C++ to access the pointer tag.
         | 
         | No new instruction could be more compact than the old BSWAP or
         | MOVBE instructions, so there is absolutely no need for a new
         | special instruction.
         | 
         | For Intel in 48-bit address mode, the same BSWAP or MOVBE will
         | bring the tag into the 16-bit part of a register. You just need
         | to take care that the 2 bytes of the tag are swapped. This does
         | not matter, because you should define the structure that
         | corresponds to the tag to match the layout that results after
         | loading the byte-swapped pointer.
        
       | netfl0 wrote:
       | https://d3fend.mitre.org/technique/d3f:PointerAuthentication
       | 
       | We are building a KB of these sorts of things.
        
         | pjmlp wrote:
         | I failed to find any reference to SPARC ADI, the longest stable
         | solution to this problem,
         | 
         | https://docs.oracle.com/cd/E53394_01/html/E54815/gqajs.html
         | 
         | https://www.kernel.org/doc/html/latest/sparc/adi.html
        
           | netfl0 wrote:
           | Thank you, it sounds like you have a lot of experience in
           | this domain, if you'd like to contribute we'd welcome more of
           | your perspective.
           | 
           | https://github.com/d3fend/d3fend-ontology
           | 
           | Otherwise we'll get this reference added.
        
       | londons_explore wrote:
       | Years ago, someone said "32 bit addresses! Thats huge! Lets use
       | the top few bits for other stuff, like gate A20 and select lines
       | of hardware".
       | 
       | That came to bite people in the form of the "3GB hole", the "PCI
       | hole", etc. And those hacks were painful for lots of people.
       | 
       | I feel that by reusing address bits for other purposes in a non-
       | flexible way like this, we're just repeating the mistakes of
       | history. After all, there are 44 zettabytes of data in the world
       | (in 2020), and addressing that is already beyond what a 64 bit
       | number can do!
       | 
       | And one day, we'll have that much storage in your hand - your
       | phone has about 2^80 atoms in it, so storing 2^64 bytes in there
       | is totally physically possible.
        
         | scottlamb wrote:
         | > I feel that by reusing address bits for other purposes in a
         | non-flexible way like this, we're just repeating the mistakes
         | of history.
         | 
         | Those old schemes were permanent machine-wide assumptions on
         | physical addresses. They were essentially saying "32 - N bits
         | is enough for anyone" (on this hardware design).
         | 
         | This is on virtual addresses, and the kernel developers want
         | (and it sounds like the Intel feature allows) this to be
         | configured per-process. I think it's totally reasonable for
         | some process to say at runtime "56 bits is enough for me". In
         | fact, it's common for Java code to run with "Compressed Oops"
         | (32-bit pointers to 8-byte-aligned addresses, for a total of 32
         | GiB of addressable heap). This happens automatically when the
         | configured heap size is sufficiently small.
         | 
         | > After all, there are 44 zettabytes of data in the world (in
         | 2020), and addressing that is already beyond what a 64 bit
         | number can do!
         | 
         | I don't think it makes sense for one process to be able to mmap
         | all the data in the world. The major page fault latency would
         | be nasty!
         | 
         | > And one day, we'll have that much storage in your hand - your
         | phone has about 2^80 atoms in it, so storing 2^64 bytes in
         | there is totally physically possible.
         | 
         | That day is pretty far off I think, but even when it happens
         | I'm not sure it makes sense for all virtual pointers to be
         | >=64-bit. A couple reasons it might not:
         | 
         | * Most programs (depending somewhat on programming language)
         | use a lot of their memory for pointers, so it seems wasteful.
         | 
         | * I assume this hypothetical future device will still have
         | slower and faster bytes to access. It may still make more sense
         | to have different APIs for the fast stuff and the slow stuff.
         | mmap's limitation of stalling the thread while something is
         | paged in is a real problem today, and I don't know if that
         | would get any better. Likewise error handling.
        
         | hyperman1 wrote:
         | The A20 gate story was something different though, it abused
         | the numeric overflow of 16 bit addresses, not some unused bits
         | in the middle of an address.
         | 
         | An x86 adres had a 16 bit segment and a 16 bit offset, with the
         | address being calculated as 16*segment+offset, truncated to 20
         | bits. Segment F000 was in ROM and 0000 was in RAM, with the low
         | RAM addresses used for different BIOS functionalities.
         | 
         | If you use a segment like 0xFF00 , then offset 0 to 0x0FFF
         | correspond to linear addresses 0xFF000 to 0xFFFFF, in ROM.
         | Offset 0x1000 to 0xFFFF correspond to 0x0000 to 0xEFFF, in RAM.
         | This trick meant you can set the segment register only once,
         | and use the offset to read both tables in ROM and use variables
         | in RAM. Of course contemporary BIOS and other software used
         | this trick to shave of a few instructions.
         | 
         | You know what happens next. The 80286 has 24 address lines, not
         | 20, so the addresses just above the existing 0xFFFFF = 1MB
         | became valid instead of wrapping around. In the name of
         | backward compatibility, someone found an unused port on the
         | keyboard controller, attached an AND port to it and to A20. If
         | you talk to the keyboard controller, you can choose if you want
         | 16MB ram or the old wraparound behavior. The keyboard
         | controller was of course dog slow, and you have to set and
         | reset the A20 gate the whole time to switch between not
         | triggering BIOS/DOS/TSR bugs and usable upper RAM. This hack is
         | AFAIK still there today in every x86 based PC, even if the gate
         | stays enabled all the time.
        
           | ajross wrote:
           | > The A20 gate story was something different though, it
           | abused the numeric overflow of 16 bit addresses, not some
           | unused bits in the middle of an address.
           | 
           | It wasn't even an abuse. It was a genuine attempt by the
           | board manufacturer (IBM) to address a real backwards
           | compatibility problem with a new CPU part. Obviously as
           | memories grew far beyond 1M and real mode became a part of
           | history (well, and of bootstrap code) it was a giant wart.
           | But it solved a real problem.
        
         | colejohnson66 wrote:
         | The whole reason x86-64 required "canonical"[a] addresses in
         | the first place was to prevent people using them for this
         | purpose. Sure, this proposal allows applications to know how
         | many bits they can work with, but what happens when an
         | application developer writes `assert(freeBits >= 7)` when only
         | 6 are available in the future?
         | 
         | [a] a "canonical" address is one where bits 63:MaxPhysAddr are
         | identical. So on a processor supporting 48 address lines, bits
         | 63:48 must be identical (either all clear or all set).
         | Attempting reads or writes with non canonical addresses raises
         | a #GP fault.
        
           | zozbot234 wrote:
           | > Sure, this proposal allows applications to know how many
           | bits they can work with
           | 
           | That is entirely orthogonal to this "upper bits ignore"
           | feature. Ideally, a process would be able to set _any_
           | reasonable amount of upper bits as  "reserved for tagging",
           | and the system allocator and kernel would then simply not
           | require it to work with user virtual addresses that involve
           | those upper bits. But upper bits can't simply be "ignored" if
           | an app is to be forward-compatible; they still need to be
           | canonicalized whenever external code is called.
        
           | saagarjha wrote:
           | Then the application does not work on a future processor,
           | that is correct. The person who wrote it, who likely is a JIT
           | engineer, will get a bug report asking for this to be fixed,
           | although most likely they're already aware of the upcoming
           | hardware change and have prepared a different tagging scheme
           | already.
        
         | pjmlp wrote:
         | Hopefully by then the languages whose reason for pointer
         | tagging exists in first place, won't matter as much as they
         | still do today.
        
           | cmrdporcupine wrote:
           | Given its longevity and installed base, I see no reasonable
           | possibility of C/C++ going away or being significantly
           | curtailed in the next half century.
           | 
           | C is 50 years old this year. And it's _everywhere_
           | 
           | It's more likely we'll be (or my kids will be) involved in
           | harm reduction around these systems, rather than outright
           | replacement.
        
             | praptak wrote:
             | Who knows, maybe we'll have a cheap (semi-)automatic
             | rewrite from C to a memory-safe language by then.
        
             | londons_explore wrote:
             | Look to other industries. There are plenty of standard
             | things that are suboptimal, but still used because they're
             | so standard. For example, the English language has plenty
             | of inefficiencies and inconsistencies that date back over
             | 1000 years. But to fix them would be too much work for too
             | little gain.
        
             | pjmlp wrote:
             | Cobol and Fortran are still around, they even support
             | modules, some form of generics and OOP, yet you don't see
             | many people jump of joy writing new libraries in them.
        
               | cmrdporcupine wrote:
               | Whereas people still jump for joy at writing new
               | libraries in C++ and it's coming up on 40 years old.
               | 
               | C++11 saved the language from dying off. Somehow it's my
               | favourite language to work in (though mainly because
               | nobody is paying me to work in Rust)
        
               | pjmlp wrote:
               | We are talking mostly about C here, but in any case, I
               | don't jump of joy when looking at code that depends on
               | SFINAE, ADL and CTAD.
        
           | rwmj wrote:
           | I don't think C is going away any time soon, and even if it
           | did, C reflects a common model of how hardware works that is
           | shared by plenty of other higher level languages. Also GCs
           | use pointer tagging.
        
             | pjmlp wrote:
             | The irony of C Machines.
        
           | karatinversion wrote:
           | Pointers are a core part of the CPU memory model on both ARM
           | and x64, so they aren't going away anytime soon; there's no
           | reason compilers, JITs or runtimes couldn't use pointer
           | tagging, just the same as C programmers.
        
             | pjmlp wrote:
             | True, but the way they are exposed to the programmers
             | matter.
             | 
             | Languages like Modula-2 or Ada, to use two C like
             | languages, don't suffer from pointer abuse as much as C,
             | because they provide better abstractions, thus raw pointer
             | tricks aren't as dangerous to the whole system stability.
             | 
             | While JIT and runtimes could make use of tagging as well,
             | Lisp and Ada machines kind of proved the point that with
             | memory safe runtimes, the CPU specialy extensions aren't as
             | useful as general purpose CPUs hence why they moved away
             | from such solutions.
             | 
             | So there is a certain irony that we are now going back into
             | C Machines as the ultimate mitigation.
        
               | saagarjha wrote:
               | Lisp and Ada machines happened to be very slow in
               | practice. Turns out people prefer CPU extensions to make
               | their C code fast rather than compiling it to a more
               | exotic architecture.
        
               | pjmlp wrote:
               | I missed the point, Lisp and Ada machines were never
               | designed to run C code, other than a curiosity.
        
               | saagarjha wrote:
               | Well, the two points I was making were 1. people want to
               | run C code anyways, so that's not really an option for
               | general-purpose use and 2. Lisp and Ada machines ended up
               | being slow for Lisp and Ada, which makes them pretty
               | unattractive.
        
           | saagarjha wrote:
           | You mean most languages that run on virtual machines today,
           | right? Java, JavaScript, C#, Lua, ...
        
             | pjmlp wrote:
             | LLVM, WASM,....
        
               | tawaypol wrote:
               | LLVM is as much a virtual machine as the "C Virtual
               | Machine" from the spec is. To my knowledge there are no
               | LLVM IR interpreters.
        
       | amelius wrote:
       | I suppose this could be useful for (mark-and-sweep) garbage
       | collectors.
        
         | fsfod wrote:
         | I know at least the Java ZGC use top bits as metadata for a
         | load barrier to check for relocated objects. They fake hardware
         | pointer masking by mapping the the same heap to multiple
         | addresses https://www.opsian.com/blog/javas-new-zgc-is-very-
         | exciting/
        
       | pjmlp wrote:
       | Intel history has quite a few failed attempts to memory tagging
       | (iAPX 432, i960, MPX), maybe it needs again an AMD push to make
       | it work.
        
         | adrian_b wrote:
         | There are several cases when Intel had botched the definition
         | of some instructions and later AMD had to redefine them
         | correctly, and eventually Intel also adopted the corrected
         | versions in their ISA.
         | 
         | Nevertheless, in this case AMD is wrong, without doubt.
         | 
         | Intel has published one year ago their version of "Linear
         | Address Masking", which might become available in Sapphire
         | Rapids.
         | 
         | There is nothing to criticize about Intel LAM. It keeps the
         | highest address bit with the same meaning of today, to
         | distinguish supervisor (kernel) pointers from user pointers.
         | 
         | Address masking can be enabled or disabled separately for
         | supervisor pointers and user pointers. Address masking has 2
         | variants, depending on whether you choose 48-bit physical
         | addresses (4-level page tables) or 57-bit physical addresses
         | (5-level page tables).
         | 
         | AMD has published now a different incompatible method. It is
         | likely that they have conceived it a few years ago, when the
         | design of Zen 4 had started, which is why it is incompatible
         | with Intel.
         | 
         | The fact that it is incompatible with Intel would not have been
         | a big deal, except that the AMD method is wrong, because they
         | also mask the MSB, breaking all the kernel code that tests
         | pointers to determine if they are user pointers or not.
         | 
         | Like everyone else, I cannot understand why AMD did not specify
         | rightly this feature, like Intel. It certainly is neither
         | rocket science nor brain surgery.
        
           | pjmlp wrote:
           | Oh, really bad then. So only SPARC and ARM will rescue us.
        
             | adrian_b wrote:
             | The 64-bit ARM ISA has allowed since the beginning the use
             | of the high byte as a pointer tag.
             | 
             | Now Intel and AMD are catching up with this feature.
             | Unfortunately they went separate ways.
             | 
             | This is a trivial feature, which does not have any
             | potential for competitive differentiation.
             | 
             | However, for the users, it is very important for it to be
             | standardized, to behave identically on all systems. It
             | would have been much better if Intel and AMD had discussed
             | this and they had decided on a common specification.
             | 
             | AMD is guiltier about this, because since the beginning of
             | Zen they had only very seldom provided any information in
             | advance about what features will be present in their next
             | generation.
             | 
             | While there are many other policies that are very bad at
             | Intel, at least they have remained the CPU company
             | providing the best documentation for their products and
             | they have always announced any ISA changes with at least
             | one year, if not a few years, in advance. Therefore
             | everybody has known about the proposed Intel Linear Address
             | Masking, while nobody has known how AMD will do it.
             | 
             | The correction to the AMD address masking method is
             | absolutely trivial, just disable the masking gate for the
             | highest address bit, reducing the usable pointer tag from 7
             | bits to 6 bits, but maintaining compatibility with the
             | existing operating systems.
             | 
             | This small change can be made in 5 minutes by the AMD CPU
             | design team and it does not require any change in the
             | existing mask layout.
             | 
             | At least one year has passed since Intel published their
             | better version. Supposing that AMD has implemented their
             | masking version in Zen 4, but they have kept this secret, I
             | am pretty certain that 1 year ago AMD did not have their
             | final mask set for Zen 4 and there have been some revisions
             | since then. So there have certainly been opportunities when
             | this could have been corrected without absolutely any cost
             | for AMD.
             | 
             | It appears that nobody at AMD has realized that breaking
             | software compatibility with their address masking variant
             | is not good.
        
         | saagarjha wrote:
         | You're misunderstanding what this tag is used for: it's to
         | accelerate virtual machines, rather than for memory safety.
        
         | hansendc wrote:
         | By an "AMD push" do you mean Intel should post a superior
         | implementation a year before AMD does? ;)
         | 
         | https://lore.kernel.org/lkml/20210205151631.43511-1-kirill.s...
         | 
         | BTW, MPX was clearly not the right thing. Nobody _really_
         | wanted it. The world would be a better place if these address-
         | bit-ignoring things (ARM TBI, AMD UAI, Intel LAM) had been
         | implemented years ago instead all the effort spent on MPX.
         | 
         | Believe me, I know. I put a lot of blood, sweat and tears into
         | MPX for Linux.
         | 
         | Disclaimer: If you didn't figure it out by now, I work on Linux
         | at Intel.
        
       | wongarsu wrote:
       | Instead of dealing with all the complexities of giving user
       | processes control over the CPU's pointer masking with all the
       | involved complexities of context switching and processes now
       | using bit 63 (which marks kernel addresses), why doesn't the
       | kernel just turn the feature on system wide when available,
       | reserve a couple bits for kernel usage (like say bit 63 to mark
       | kernel addresses) and provide a syscall that simply informs
       | processes which bits they can use for pointer tagging, if any.
       | 
       | Are there any compatibility concerns that make it necessary to
       | keep faulting on addresses outside the valid address range?
        
         | zozbot234 wrote:
         | > Are there any compatibility concerns that make it necessary
         | to keep faulting on addresses outside the valid address range?
         | 
         | As a sibling comment points out, forward compatibility is the
         | whole reason why this "faulting on non-canonical addresses" was
         | introduced. We used to have systems with 32-bit addresses and
         | "upper byte ignore", leaving 24-bits of actual address space.
         | 
         | Applications that took advantage of that "ignore" feature by
         | issuing non-normalized load or store operations broke on
         | "32-bit clean" versions of the same architectures, even when
         | limited to the same 24-bit address space. If they had stuck to
         | issuing loads and stores with canonicalized addresses and
         | treated "tagging" as a pure memory-saving thing, they could've
         | been supported.
        
           | irdc wrote:
           | Note that this was a problem on the classic Mac OS (https://e
           | n.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme..., upper
           | bits of pointers were used as flags) and early ARM processors
           | (https://en.wikipedia.org/wiki/26-bit_computing#Early_ARM_pro
           | ..., combined program status register and program counter
           | resulting in an effectively 26-bit system). Neither of these
           | systems had an MMU, thus causing the breakage you mention. On
           | a modern system with an MMU, one could instead reduce the
           | size of a process' address space in exchange for tag bits.
        
         | ajross wrote:
         | Because that changes behavior. A userspace process that would
         | expect to fault currently on access to pointers with high bits
         | set would suddently start touching different memory. I don't
         | have specific examples, but there have been implementations of
         | this kind of thing in the past that rely on traps like that to
         | detect equivalent pointer tags or alternate memory spaces.
        
       | JonChesterfield wrote:
       | Not what I was expecting from the article. I don't see how new
       | ISA features could be net positive here.
       | 
       | Pointer tagging works fine on x86 already. Low ones from
       | alignment and high 16 bits provided one does the masking/shifting
       | to get back to canonical form before dereferencing. High 8 if
       | paranoid about future address space increase.
        
         | saagarjha wrote:
         | Are you familiar with what the ISA feature does? It lets code
         | skip the masking step, allowing them to dereference a non-
         | canonical tagged address as if it was masked into canonical
         | form. Since applications that rely on tagging are typically
         | performance-sensitive, skipping that step can be a win for
         | them.
        
           | JonChesterfield wrote:
           | I see that benefit. For code compiled appropriately. The
           | article also thinks this will need to be per-process, so
           | context switching gets even more expensive.
           | 
           | Maybe the mask gets in the way of prefetching the pointer
           | destination, though the machine might speculate that the
           | pointer will be in canonical form. The arithmetic mask itself
           | is very close to free.
           | 
           | I guess someone thinks it's worth the silicon and software
           | dev necessary to deploy. If it makes my code faster then I'll
           | probably use it.
        
             | saagarjha wrote:
             | It's one less instruction, so it also reduces code size.
        
       | bitcharmer wrote:
        
       | temac wrote:
       | I'm intrigued by:
       | 
       | > Turning on UAI would allow user space to create pointer values
       | that look like kernel addresses, but which would actually be
       | valid user-space pointers. Those pointers can, of course, be
       | passed into the kernel via system calls where, in the absence of
       | due care, they might be interpreted as kernel-space addresses.
       | The consequences of such confusion would not be good, and the
       | possibility of it happening is relatively high.
       | 
       | Userspace can already forge "pointers" with whatever it wants in
       | their bits. If the idea is to allow the transfer taggued pointers
       | to userspace to the kernel and from there back to userspace,
       | maybe just don't allow that?
       | 
       | I actually don't see why the kernel should accept tagged pointers
       | to userspace at all. And it seems it should _already_ be checked
       | everywhere otherwise userspace could already make the kernel
       | access unintended kernel memory. I don 't see how, if some of
       | them could start to be dereferenceable under some configuration
       | from standard userspace, it would changes anything.
        
         | temac wrote:
         | Ok so I read the mail from Andy and understand the real problem
         | better: UAI is not context switched and is to be enabled system
         | wide. I don't know what AMD has been smoking.
        
           | saagarjha wrote:
           | Pretty sure this is how it works in ARM as well, except that
           | TBI can be configured per-exception level so it can be turned
           | off in the kernel.
        
             | pm215 wrote:
             | Yes. The thing which _is_ controlled per-process (per-
             | thread, really) is the extent to which you can pass a
             | tagged address to a kernel syscall and have it strip the
             | tag and operate on the  'real' address versus handing you a
             | 'bad address' error. The default is 'syscalls (mostly)
             | don't detag addresses'.
             | https://www.kernel.org/doc/html/latest/arm64/tagged-
             | address-... has the details.
             | 
             | Because TBI is separate for userspace and the kernel you
             | could make it per-process -- just context switch the TBI
             | control bit. But there's no need.
        
       | KMag wrote:
       | The historical longevity of hard-wired pointer tagged systems
       | isn't great.
       | 
       | Others have alluded to early m68k Macintoshes ignoring the top 8
       | bits of pointers, resulting in later Macintosh systems needing
       | per-application compatibility mode to allow some applications to
       | use more than 16 MB of RAM and others to keep using pointer
       | tagging.
       | 
       | Sun's SPARC processors have dedicated instructions that have
       | 30-bit addressing and 2 tag bits. It turned out basically no
       | software ended up using those instructions, even Sun's JVM JITs
       | on SPARC avoided these instructions, I believe.
       | 
       | There's older Lisp machines, Buroughs machines, etc. IBM's AS/400
       | Technology Independent Machine Interface (TIMI) (AoT compiled
       | bytecode for userland executables, essentially) uses tagged
       | 128-bit pointers. These all seemed better designed for longevity,
       | but I'm less familiar with these details, and they're all
       | significantly less popular than m68k or SPARC.
       | 
       | It's a shame that none of these pointer tagging solutions are
       | implemented as faster/more compact encodings of common userspace
       | pointer tagging operations already emitted by JITs. For instance,
       | pointer tagging usually is used to inline the fast path where the
       | tagged type is what's expected, with a conditional branch for
       | handling all of the unexpected cases. So, a compact
       | representation of "IP-relative jump by W if the X most
       | significant bits of Y aren't Z" would reduce the code size (and
       | instruction cache/memory bandwidth used) for dynamically typed
       | language JITs. Similarly, it's common to have instructions for
       | bitwise-and with a sign-extended small immediate (32 bits, 13
       | bits, etc., depending on architecture). For these pointer tagging
       | operations, it's much more useful to instead use the immediate to
       | specify the most significant N bits of the bitwise-and, with all
       | of the low bits 1.
       | 
       | By not hard-wiring the number of bits used in tagging, the
       | features would seem to be more future-proof. By making the
       | instructions purely user-space, they wouldn't require any kernel
       | support and could be adopted faster. Adoption in JITs (the most
       | common current use case for tagged pointers) could be especially
       | fast, since JITs can perform a runtime check to avoid emitting
       | the new instructions on unsupported CPUs. Granted, the extent of
       | my CPU design experience was one undergrad class. I'm sure it's
       | much more complex to push this extra logic into the instruction
       | decoding and micro-op issuing logic instead of modifying the MMU.
       | However, hard-wired pointer tagging systems have typically had
       | shorter lifetimes than their designers expected.
       | 
       | On a related note, if your userspace programs are at the top of
       | your address space, you get NaN-boxed pointers "for free", since
       | any 64-bit pointer with the first 13 bits set is some form of NaN
       | when interpreted as an IEEE-754 double. What this means is that
       | the different NaN-tagging tradeoffs (Safari decided to make
       | object pointer use faster, Firefox chose to make Number use
       | faster) collapse. Neither Numbers nor pointers need to be munged
       | (though, needing to be careful about the two actual NaN values
       | the FPU might generate). Unfortunately for Linux, too much kernel
       | code (and perhaps also too much userland code) at this point
       | assumes the kernel lives at the top of the address space for
       | Linux to change this.
       | 
       | On a side note, garbage collectors would benefit with a feature
       | allowing traps of writes to read-only memory to a userland
       | handler without a context switch. That way, the collector can
       | mark a few pages read-only, lazily copy objects and fix up
       | pointers as user code mutates objects, and finally fix up any
       | stragglers when the GC attempts to mark the unmarked objects
       | during a mark phase. Some garbage collectors already do this sort
       | of thing, but using SIGSEG handlers that trap to the kernel and
       | call back. It would be nice to reduce the overhead of these.
       | Presumably, installing these lightweight handlers would require
       | kernel cooperation.
       | 
       | Historical tangent: early 16-bit x86 processors generated 20-bit
       | addresses by left-shifting the segment by 4 bits, adding the
       | address, and ignoring overflow. So, once machines started
       | supporting more than 1 MB of memory (not ignoring the overflow),
       | IBM needed a backward-compatibility mechanism. They had an unused
       | pin on their keyboard controller chip, so they routed the 21st
       | bit (address line 20, using zero-based numbering aka "a20")
       | through the keyboard controller. The OS had to explicitly tell
       | the keyboard controller to stop zeroing-out the "a20" line. So,
       | during your boot up process, your x86 CPU still has to emulate
       | the portion of the IBM keyboard controller that zeroes out the
       | a20 line until the mode-switch. Your 64-bit machine boots with
       | 20-bit addressing, switches to 21-bit addressing, switches to
       | 32-bit addressing, and finally switches to 64-bit addressing.
       | Given the BIOS checks that the first two bytes of the boot sector
       | are 0x55AA, it's a shame that when designing x86_64, AMD didn't
       | have the processor start in 64-bit addressing with an instruction
       | to drop back down to 20-bit addressing for legacy systems. That
       | way, the BIOS could use the first two bytes of the boot sector to
       | determine if it needed to drop into legacy mode immediately
       | before executing the boot sector. It's not a lot of transistors
       | or complexity to emulate the accumulated cruft, but there was an
       | opportunity to set a roadmap to remove that cruft from the
       | hardware.
        
       ___________________________________________________________________
       (page generated 2022-03-31 23:01 UTC)