[HN Gopher] Pointer Tagging for x86 Systems
___________________________________________________________________
Pointer Tagging for x86 Systems
Author : rwmj
Score : 147 points
Date : 2022-03-31 09:47 UTC (13 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| devit wrote:
| Seems like, at least as an initial implementation, the kernel
| could disable the feature when entering kernel mode and then
| manually mask all accesses to user space addresses. This way
| mistakes will not impact security but only cause spurious system
| call failures.
| rwmj wrote:
| These kind of virtual address related mode switches tend to be
| very slow, so, although I've not tested it, I'm going to guess
| this would be an absolute performance killer.
| erwincoumans wrote:
| In some cases (especially user land code), using array indexing
| can be better than using pointers. It allows to re-use data
| between GPU and CPU, use fewer bits for indices and simplify
| serialization. For example Flatbuffers is a very nice example:
| https://google.github.io/flatbuffers/
| rayiner wrote:
| The issue addressed in the article with pointer tagging applies
| whether you use array indexing or pointers. Both are converted
| to pointer operations under the hood, which on X86 are some
| form of ADDR = BASE + SCALE*INDEX + OFFSET. Such an operation
| is still considered a pointer dereference (of BASE) from the
| view of the hardware, and will trap if tags are stored in the
| unused upper bits of the pointer.
|
| Depending on the size of the array elements, you won't even be
| able to use the built-in index addressing mode, because the
| index can only be up to 8 bytes. For that reason, LLVM doesn't
| even have array indexing. Loads/stores can only be done against
| pointers, and pointers to items in arrays must be expressly
| computed using a get-element-pointer instruction:
| https://llvm.org/docs/GetElementPtr.html.
| erwincoumans wrote:
| Thanks for the insightful reply. I didn't realize array
| indexing gets converted to pointers in the CPU. I expected
| the array indexing assembly instructions to be plumbed all
| the way down without converting to pointers until the very
| end.
|
| LLVM not supporting array indexing means all those
| instructions go unused, on both x86 and arm? How about gcc or
| other compilers?
| clord wrote:
| The compiler can convert small known-size known-offset
| array accesses into registers or direct static stores and
| loads, but array indexing is usually on large chunks
| through a dynamic index via a ref or pointer, which will be
| via pointer math.
| rayiner wrote:
| They don't go unused. Typically the separate address
| calculation and load will get merged back together during
| instruction selection is there is an applicable
| instruction. (Although that may not happen either--the
| compiler might transform the code to get rid of the array
| indexing. Keeping around the base pointer takes an
| additional register.)
|
| But on x86 and ARM these instructions are limited. On x86,
| a load instruction can have a prefix that treats one
| operand as the array base pointer, and another operand as
| the index. But the size of the array element is limited to
| 8 bytes so if your array elements are say 16 byte structs,
| you can't use those instructions. As to ARM, I think it
| doesn't have an array indexing instruction at all, just
| base + offset.
| saagarjha wrote:
| This is less about "array indexing" and more about arrays
| being located at a specific memory address. At some point
| you're going to have a buffer at address 0x1000 and want to
| access the 42nd element, and you can't just use a tagged
| address (e.g. 0x1000000001000) and 42 together without
| stripping the tag somewhere.
| uvdn7 wrote:
| And not to mention better CPU cacheline hits!
| KMag wrote:
| I used to sit next to a guy who previously professionally
| wrote compilers for both DEC Alpha and Itanium
| (http://robert.muth.org/), who mentioned that programs often
| ran faster on Alpha (which only supported 64-bit addressing)
| when modified to use 32-bit array indexes instead of
| pointers, due to reduced memory bandwidth/cache usage. Of
| course, one had to first determine if one needs to use any
| arrays larger than 2^32 elements.
| mmastrac wrote:
| Linux used to have the x32 ABI for similar improvements:
| https://en.wikipedia.org/wiki/X32_ABI
| skybrian wrote:
| It seems like this isn't all that useful for JavaScript or other
| dynamic languages where what you really want is NaN-boxing? But I
| think I read somewhere that there are special instructions for
| this.
| saagarjha wrote:
| NaN-boxing is just one way of doing pointer tagging.
| kvakvs wrote:
| Virtual machines use pointer tagging in least significant bits,
| as most of the data is 4x or 8x byte aligned, and you can assume
| that zeroing those bits will always give you a safe correct
| pointer value. And the tag bits can instruct the code that the
| value is not a pointer but contains more data in other bits, or
| that the value is a structure on heap with special treatment.
| khuey wrote:
| Depends on the virtual machine. LuaJIT and Spidermonkey (and
| probably others that I don't know about) use NaN-boxing which
| ends up storing the metadata in the higher bits.
| ChuckMcM wrote:
| The UAI feature is nice! Back when I was noodling on creative
| ways to use 64 bit addresses the idea of a 16 bit 'space id' was
| bandied about. A space ID in the upper 16 bits would indicate the
| address space for a particular thing, be it a process, the
| kernel, a shared library, etc. And a new opcode to trampoline
| from one space to another for doing things like system calls or
| library calls, could provide a gateway for doing secure code
| analysis (basically no "side" entrances into a space) and it
| would be a much more robust pointer testing mechanism because any
| read or write with a pointer "outside" the space would be
| immediately flagged as an error. The use of segments was sort of
| like this on the 286.
|
| Ideally you'd have a fine grained page map so that you could
| avoid copies from one space to another (instead of
| copy_in()/copy_out() you would do map_in()/map_out()) to get
| access to physical memory, while preserving the semantic that
| when something was mapped out of your space you couldn't affect
| it.(zero copy network stacks have been known to have issues when
| a process left holding a pointer to a buffer modified it when it
| wasn't "owned" kinds of bugs)
|
| Could be a fun project for a RISC-V design. Not sure how many
| people are experimenting with various MMU systems but it is at
| least _possible_ to do these kinds of things in nominally "real"
| hardware (it has always been possible to simulate of course).
| nraynaud wrote:
| I can imagine the gold rush on those high order bits!
| bogomipz wrote:
| >"There are numerous use cases for stashing metadata into those
| unused bits. Memory allocators could use that space to track
| different memory pools, for example, or for garbage collection.
| Database management systems have their own uses for that space."
|
| Can someone say which databases do this and what they're making
| use of those unused bits in the address for?
| ayende wrote:
| Marking a page pointer as modified / accessed for example
| celeritascelery wrote:
| What would make this extra useful is if they could add an
| instructions that read the tag portion of the pointer (instead of
| needing to do a shift). The ucode could optimize this away to
| nothing more then a special case of load.
| adrian_b wrote:
| The tag is in the high byte (always for AMD and also for Intel
| in the 57-bit mode).
|
| Instead of a shift and a mask, on Intel/AMD CPUs you can use
| the instruction BSWAP (to move the highest byte into the lowest
| byte) and then you may access the register as a byte (even if
| usually it does not matter what size you use to access the
| register, because in most cases you just test individual tag
| bits).
|
| Moreover, on any Atom CPUs or any other CPUs newer than Haswell
| (2013), you may skip the BSWAP by loading the pointer with a
| MOVBE instruction, which does the byte swapping during the
| load. This use of BSWAP/MOVBE is identical to the handling of
| big-endian (a.k.a. network byte order) data, like in the
| headers of network data packets. Because of that, they are
| already available in C/C++ in macros that depend on the
| operating system, e.g. __be64_to_cpu(x) for Linux, which can be
| used in C/C++ to access the pointer tag.
|
| No new instruction could be more compact than the old BSWAP or
| MOVBE instructions, so there is absolutely no need for a new
| special instruction.
|
| For Intel in 48-bit address mode, the same BSWAP or MOVBE will
| bring the tag into the 16-bit part of a register. You just need
| to take care that the 2 bytes of the tag are swapped. This does
| not matter, because you should define the structure that
| corresponds to the tag to match the layout that results after
| loading the byte-swapped pointer.
| netfl0 wrote:
| https://d3fend.mitre.org/technique/d3f:PointerAuthentication
|
| We are building a KB of these sorts of things.
| pjmlp wrote:
| I failed to find any reference to SPARC ADI, the longest stable
| solution to this problem,
|
| https://docs.oracle.com/cd/E53394_01/html/E54815/gqajs.html
|
| https://www.kernel.org/doc/html/latest/sparc/adi.html
| netfl0 wrote:
| Thank you, it sounds like you have a lot of experience in
| this domain, if you'd like to contribute we'd welcome more of
| your perspective.
|
| https://github.com/d3fend/d3fend-ontology
|
| Otherwise we'll get this reference added.
| londons_explore wrote:
| Years ago, someone said "32 bit addresses! Thats huge! Lets use
| the top few bits for other stuff, like gate A20 and select lines
| of hardware".
|
| That came to bite people in the form of the "3GB hole", the "PCI
| hole", etc. And those hacks were painful for lots of people.
|
| I feel that by reusing address bits for other purposes in a non-
| flexible way like this, we're just repeating the mistakes of
| history. After all, there are 44 zettabytes of data in the world
| (in 2020), and addressing that is already beyond what a 64 bit
| number can do!
|
| And one day, we'll have that much storage in your hand - your
| phone has about 2^80 atoms in it, so storing 2^64 bytes in there
| is totally physically possible.
| scottlamb wrote:
| > I feel that by reusing address bits for other purposes in a
| non-flexible way like this, we're just repeating the mistakes
| of history.
|
| Those old schemes were permanent machine-wide assumptions on
| physical addresses. They were essentially saying "32 - N bits
| is enough for anyone" (on this hardware design).
|
| This is on virtual addresses, and the kernel developers want
| (and it sounds like the Intel feature allows) this to be
| configured per-process. I think it's totally reasonable for
| some process to say at runtime "56 bits is enough for me". In
| fact, it's common for Java code to run with "Compressed Oops"
| (32-bit pointers to 8-byte-aligned addresses, for a total of 32
| GiB of addressable heap). This happens automatically when the
| configured heap size is sufficiently small.
|
| > After all, there are 44 zettabytes of data in the world (in
| 2020), and addressing that is already beyond what a 64 bit
| number can do!
|
| I don't think it makes sense for one process to be able to mmap
| all the data in the world. The major page fault latency would
| be nasty!
|
| > And one day, we'll have that much storage in your hand - your
| phone has about 2^80 atoms in it, so storing 2^64 bytes in
| there is totally physically possible.
|
| That day is pretty far off I think, but even when it happens
| I'm not sure it makes sense for all virtual pointers to be
| >=64-bit. A couple reasons it might not:
|
| * Most programs (depending somewhat on programming language)
| use a lot of their memory for pointers, so it seems wasteful.
|
| * I assume this hypothetical future device will still have
| slower and faster bytes to access. It may still make more sense
| to have different APIs for the fast stuff and the slow stuff.
| mmap's limitation of stalling the thread while something is
| paged in is a real problem today, and I don't know if that
| would get any better. Likewise error handling.
| hyperman1 wrote:
| The A20 gate story was something different though, it abused
| the numeric overflow of 16 bit addresses, not some unused bits
| in the middle of an address.
|
| An x86 adres had a 16 bit segment and a 16 bit offset, with the
| address being calculated as 16*segment+offset, truncated to 20
| bits. Segment F000 was in ROM and 0000 was in RAM, with the low
| RAM addresses used for different BIOS functionalities.
|
| If you use a segment like 0xFF00 , then offset 0 to 0x0FFF
| correspond to linear addresses 0xFF000 to 0xFFFFF, in ROM.
| Offset 0x1000 to 0xFFFF correspond to 0x0000 to 0xEFFF, in RAM.
| This trick meant you can set the segment register only once,
| and use the offset to read both tables in ROM and use variables
| in RAM. Of course contemporary BIOS and other software used
| this trick to shave of a few instructions.
|
| You know what happens next. The 80286 has 24 address lines, not
| 20, so the addresses just above the existing 0xFFFFF = 1MB
| became valid instead of wrapping around. In the name of
| backward compatibility, someone found an unused port on the
| keyboard controller, attached an AND port to it and to A20. If
| you talk to the keyboard controller, you can choose if you want
| 16MB ram or the old wraparound behavior. The keyboard
| controller was of course dog slow, and you have to set and
| reset the A20 gate the whole time to switch between not
| triggering BIOS/DOS/TSR bugs and usable upper RAM. This hack is
| AFAIK still there today in every x86 based PC, even if the gate
| stays enabled all the time.
| ajross wrote:
| > The A20 gate story was something different though, it
| abused the numeric overflow of 16 bit addresses, not some
| unused bits in the middle of an address.
|
| It wasn't even an abuse. It was a genuine attempt by the
| board manufacturer (IBM) to address a real backwards
| compatibility problem with a new CPU part. Obviously as
| memories grew far beyond 1M and real mode became a part of
| history (well, and of bootstrap code) it was a giant wart.
| But it solved a real problem.
| colejohnson66 wrote:
| The whole reason x86-64 required "canonical"[a] addresses in
| the first place was to prevent people using them for this
| purpose. Sure, this proposal allows applications to know how
| many bits they can work with, but what happens when an
| application developer writes `assert(freeBits >= 7)` when only
| 6 are available in the future?
|
| [a] a "canonical" address is one where bits 63:MaxPhysAddr are
| identical. So on a processor supporting 48 address lines, bits
| 63:48 must be identical (either all clear or all set).
| Attempting reads or writes with non canonical addresses raises
| a #GP fault.
| zozbot234 wrote:
| > Sure, this proposal allows applications to know how many
| bits they can work with
|
| That is entirely orthogonal to this "upper bits ignore"
| feature. Ideally, a process would be able to set _any_
| reasonable amount of upper bits as "reserved for tagging",
| and the system allocator and kernel would then simply not
| require it to work with user virtual addresses that involve
| those upper bits. But upper bits can't simply be "ignored" if
| an app is to be forward-compatible; they still need to be
| canonicalized whenever external code is called.
| saagarjha wrote:
| Then the application does not work on a future processor,
| that is correct. The person who wrote it, who likely is a JIT
| engineer, will get a bug report asking for this to be fixed,
| although most likely they're already aware of the upcoming
| hardware change and have prepared a different tagging scheme
| already.
| pjmlp wrote:
| Hopefully by then the languages whose reason for pointer
| tagging exists in first place, won't matter as much as they
| still do today.
| cmrdporcupine wrote:
| Given its longevity and installed base, I see no reasonable
| possibility of C/C++ going away or being significantly
| curtailed in the next half century.
|
| C is 50 years old this year. And it's _everywhere_
|
| It's more likely we'll be (or my kids will be) involved in
| harm reduction around these systems, rather than outright
| replacement.
| praptak wrote:
| Who knows, maybe we'll have a cheap (semi-)automatic
| rewrite from C to a memory-safe language by then.
| londons_explore wrote:
| Look to other industries. There are plenty of standard
| things that are suboptimal, but still used because they're
| so standard. For example, the English language has plenty
| of inefficiencies and inconsistencies that date back over
| 1000 years. But to fix them would be too much work for too
| little gain.
| pjmlp wrote:
| Cobol and Fortran are still around, they even support
| modules, some form of generics and OOP, yet you don't see
| many people jump of joy writing new libraries in them.
| cmrdporcupine wrote:
| Whereas people still jump for joy at writing new
| libraries in C++ and it's coming up on 40 years old.
|
| C++11 saved the language from dying off. Somehow it's my
| favourite language to work in (though mainly because
| nobody is paying me to work in Rust)
| pjmlp wrote:
| We are talking mostly about C here, but in any case, I
| don't jump of joy when looking at code that depends on
| SFINAE, ADL and CTAD.
| rwmj wrote:
| I don't think C is going away any time soon, and even if it
| did, C reflects a common model of how hardware works that is
| shared by plenty of other higher level languages. Also GCs
| use pointer tagging.
| pjmlp wrote:
| The irony of C Machines.
| karatinversion wrote:
| Pointers are a core part of the CPU memory model on both ARM
| and x64, so they aren't going away anytime soon; there's no
| reason compilers, JITs or runtimes couldn't use pointer
| tagging, just the same as C programmers.
| pjmlp wrote:
| True, but the way they are exposed to the programmers
| matter.
|
| Languages like Modula-2 or Ada, to use two C like
| languages, don't suffer from pointer abuse as much as C,
| because they provide better abstractions, thus raw pointer
| tricks aren't as dangerous to the whole system stability.
|
| While JIT and runtimes could make use of tagging as well,
| Lisp and Ada machines kind of proved the point that with
| memory safe runtimes, the CPU specialy extensions aren't as
| useful as general purpose CPUs hence why they moved away
| from such solutions.
|
| So there is a certain irony that we are now going back into
| C Machines as the ultimate mitigation.
| saagarjha wrote:
| Lisp and Ada machines happened to be very slow in
| practice. Turns out people prefer CPU extensions to make
| their C code fast rather than compiling it to a more
| exotic architecture.
| pjmlp wrote:
| I missed the point, Lisp and Ada machines were never
| designed to run C code, other than a curiosity.
| saagarjha wrote:
| Well, the two points I was making were 1. people want to
| run C code anyways, so that's not really an option for
| general-purpose use and 2. Lisp and Ada machines ended up
| being slow for Lisp and Ada, which makes them pretty
| unattractive.
| saagarjha wrote:
| You mean most languages that run on virtual machines today,
| right? Java, JavaScript, C#, Lua, ...
| pjmlp wrote:
| LLVM, WASM,....
| tawaypol wrote:
| LLVM is as much a virtual machine as the "C Virtual
| Machine" from the spec is. To my knowledge there are no
| LLVM IR interpreters.
| amelius wrote:
| I suppose this could be useful for (mark-and-sweep) garbage
| collectors.
| fsfod wrote:
| I know at least the Java ZGC use top bits as metadata for a
| load barrier to check for relocated objects. They fake hardware
| pointer masking by mapping the the same heap to multiple
| addresses https://www.opsian.com/blog/javas-new-zgc-is-very-
| exciting/
| pjmlp wrote:
| Intel history has quite a few failed attempts to memory tagging
| (iAPX 432, i960, MPX), maybe it needs again an AMD push to make
| it work.
| adrian_b wrote:
| There are several cases when Intel had botched the definition
| of some instructions and later AMD had to redefine them
| correctly, and eventually Intel also adopted the corrected
| versions in their ISA.
|
| Nevertheless, in this case AMD is wrong, without doubt.
|
| Intel has published one year ago their version of "Linear
| Address Masking", which might become available in Sapphire
| Rapids.
|
| There is nothing to criticize about Intel LAM. It keeps the
| highest address bit with the same meaning of today, to
| distinguish supervisor (kernel) pointers from user pointers.
|
| Address masking can be enabled or disabled separately for
| supervisor pointers and user pointers. Address masking has 2
| variants, depending on whether you choose 48-bit physical
| addresses (4-level page tables) or 57-bit physical addresses
| (5-level page tables).
|
| AMD has published now a different incompatible method. It is
| likely that they have conceived it a few years ago, when the
| design of Zen 4 had started, which is why it is incompatible
| with Intel.
|
| The fact that it is incompatible with Intel would not have been
| a big deal, except that the AMD method is wrong, because they
| also mask the MSB, breaking all the kernel code that tests
| pointers to determine if they are user pointers or not.
|
| Like everyone else, I cannot understand why AMD did not specify
| rightly this feature, like Intel. It certainly is neither
| rocket science nor brain surgery.
| pjmlp wrote:
| Oh, really bad then. So only SPARC and ARM will rescue us.
| adrian_b wrote:
| The 64-bit ARM ISA has allowed since the beginning the use
| of the high byte as a pointer tag.
|
| Now Intel and AMD are catching up with this feature.
| Unfortunately they went separate ways.
|
| This is a trivial feature, which does not have any
| potential for competitive differentiation.
|
| However, for the users, it is very important for it to be
| standardized, to behave identically on all systems. It
| would have been much better if Intel and AMD had discussed
| this and they had decided on a common specification.
|
| AMD is guiltier about this, because since the beginning of
| Zen they had only very seldom provided any information in
| advance about what features will be present in their next
| generation.
|
| While there are many other policies that are very bad at
| Intel, at least they have remained the CPU company
| providing the best documentation for their products and
| they have always announced any ISA changes with at least
| one year, if not a few years, in advance. Therefore
| everybody has known about the proposed Intel Linear Address
| Masking, while nobody has known how AMD will do it.
|
| The correction to the AMD address masking method is
| absolutely trivial, just disable the masking gate for the
| highest address bit, reducing the usable pointer tag from 7
| bits to 6 bits, but maintaining compatibility with the
| existing operating systems.
|
| This small change can be made in 5 minutes by the AMD CPU
| design team and it does not require any change in the
| existing mask layout.
|
| At least one year has passed since Intel published their
| better version. Supposing that AMD has implemented their
| masking version in Zen 4, but they have kept this secret, I
| am pretty certain that 1 year ago AMD did not have their
| final mask set for Zen 4 and there have been some revisions
| since then. So there have certainly been opportunities when
| this could have been corrected without absolutely any cost
| for AMD.
|
| It appears that nobody at AMD has realized that breaking
| software compatibility with their address masking variant
| is not good.
| saagarjha wrote:
| You're misunderstanding what this tag is used for: it's to
| accelerate virtual machines, rather than for memory safety.
| hansendc wrote:
| By an "AMD push" do you mean Intel should post a superior
| implementation a year before AMD does? ;)
|
| https://lore.kernel.org/lkml/20210205151631.43511-1-kirill.s...
|
| BTW, MPX was clearly not the right thing. Nobody _really_
| wanted it. The world would be a better place if these address-
| bit-ignoring things (ARM TBI, AMD UAI, Intel LAM) had been
| implemented years ago instead all the effort spent on MPX.
|
| Believe me, I know. I put a lot of blood, sweat and tears into
| MPX for Linux.
|
| Disclaimer: If you didn't figure it out by now, I work on Linux
| at Intel.
| wongarsu wrote:
| Instead of dealing with all the complexities of giving user
| processes control over the CPU's pointer masking with all the
| involved complexities of context switching and processes now
| using bit 63 (which marks kernel addresses), why doesn't the
| kernel just turn the feature on system wide when available,
| reserve a couple bits for kernel usage (like say bit 63 to mark
| kernel addresses) and provide a syscall that simply informs
| processes which bits they can use for pointer tagging, if any.
|
| Are there any compatibility concerns that make it necessary to
| keep faulting on addresses outside the valid address range?
| zozbot234 wrote:
| > Are there any compatibility concerns that make it necessary
| to keep faulting on addresses outside the valid address range?
|
| As a sibling comment points out, forward compatibility is the
| whole reason why this "faulting on non-canonical addresses" was
| introduced. We used to have systems with 32-bit addresses and
| "upper byte ignore", leaving 24-bits of actual address space.
|
| Applications that took advantage of that "ignore" feature by
| issuing non-normalized load or store operations broke on
| "32-bit clean" versions of the same architectures, even when
| limited to the same 24-bit address space. If they had stuck to
| issuing loads and stores with canonicalized addresses and
| treated "tagging" as a pure memory-saving thing, they could've
| been supported.
| irdc wrote:
| Note that this was a problem on the classic Mac OS (https://e
| n.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme..., upper
| bits of pointers were used as flags) and early ARM processors
| (https://en.wikipedia.org/wiki/26-bit_computing#Early_ARM_pro
| ..., combined program status register and program counter
| resulting in an effectively 26-bit system). Neither of these
| systems had an MMU, thus causing the breakage you mention. On
| a modern system with an MMU, one could instead reduce the
| size of a process' address space in exchange for tag bits.
| ajross wrote:
| Because that changes behavior. A userspace process that would
| expect to fault currently on access to pointers with high bits
| set would suddently start touching different memory. I don't
| have specific examples, but there have been implementations of
| this kind of thing in the past that rely on traps like that to
| detect equivalent pointer tags or alternate memory spaces.
| JonChesterfield wrote:
| Not what I was expecting from the article. I don't see how new
| ISA features could be net positive here.
|
| Pointer tagging works fine on x86 already. Low ones from
| alignment and high 16 bits provided one does the masking/shifting
| to get back to canonical form before dereferencing. High 8 if
| paranoid about future address space increase.
| saagarjha wrote:
| Are you familiar with what the ISA feature does? It lets code
| skip the masking step, allowing them to dereference a non-
| canonical tagged address as if it was masked into canonical
| form. Since applications that rely on tagging are typically
| performance-sensitive, skipping that step can be a win for
| them.
| JonChesterfield wrote:
| I see that benefit. For code compiled appropriately. The
| article also thinks this will need to be per-process, so
| context switching gets even more expensive.
|
| Maybe the mask gets in the way of prefetching the pointer
| destination, though the machine might speculate that the
| pointer will be in canonical form. The arithmetic mask itself
| is very close to free.
|
| I guess someone thinks it's worth the silicon and software
| dev necessary to deploy. If it makes my code faster then I'll
| probably use it.
| saagarjha wrote:
| It's one less instruction, so it also reduces code size.
| bitcharmer wrote:
| temac wrote:
| I'm intrigued by:
|
| > Turning on UAI would allow user space to create pointer values
| that look like kernel addresses, but which would actually be
| valid user-space pointers. Those pointers can, of course, be
| passed into the kernel via system calls where, in the absence of
| due care, they might be interpreted as kernel-space addresses.
| The consequences of such confusion would not be good, and the
| possibility of it happening is relatively high.
|
| Userspace can already forge "pointers" with whatever it wants in
| their bits. If the idea is to allow the transfer taggued pointers
| to userspace to the kernel and from there back to userspace,
| maybe just don't allow that?
|
| I actually don't see why the kernel should accept tagged pointers
| to userspace at all. And it seems it should _already_ be checked
| everywhere otherwise userspace could already make the kernel
| access unintended kernel memory. I don 't see how, if some of
| them could start to be dereferenceable under some configuration
| from standard userspace, it would changes anything.
| temac wrote:
| Ok so I read the mail from Andy and understand the real problem
| better: UAI is not context switched and is to be enabled system
| wide. I don't know what AMD has been smoking.
| saagarjha wrote:
| Pretty sure this is how it works in ARM as well, except that
| TBI can be configured per-exception level so it can be turned
| off in the kernel.
| pm215 wrote:
| Yes. The thing which _is_ controlled per-process (per-
| thread, really) is the extent to which you can pass a
| tagged address to a kernel syscall and have it strip the
| tag and operate on the 'real' address versus handing you a
| 'bad address' error. The default is 'syscalls (mostly)
| don't detag addresses'.
| https://www.kernel.org/doc/html/latest/arm64/tagged-
| address-... has the details.
|
| Because TBI is separate for userspace and the kernel you
| could make it per-process -- just context switch the TBI
| control bit. But there's no need.
| KMag wrote:
| The historical longevity of hard-wired pointer tagged systems
| isn't great.
|
| Others have alluded to early m68k Macintoshes ignoring the top 8
| bits of pointers, resulting in later Macintosh systems needing
| per-application compatibility mode to allow some applications to
| use more than 16 MB of RAM and others to keep using pointer
| tagging.
|
| Sun's SPARC processors have dedicated instructions that have
| 30-bit addressing and 2 tag bits. It turned out basically no
| software ended up using those instructions, even Sun's JVM JITs
| on SPARC avoided these instructions, I believe.
|
| There's older Lisp machines, Buroughs machines, etc. IBM's AS/400
| Technology Independent Machine Interface (TIMI) (AoT compiled
| bytecode for userland executables, essentially) uses tagged
| 128-bit pointers. These all seemed better designed for longevity,
| but I'm less familiar with these details, and they're all
| significantly less popular than m68k or SPARC.
|
| It's a shame that none of these pointer tagging solutions are
| implemented as faster/more compact encodings of common userspace
| pointer tagging operations already emitted by JITs. For instance,
| pointer tagging usually is used to inline the fast path where the
| tagged type is what's expected, with a conditional branch for
| handling all of the unexpected cases. So, a compact
| representation of "IP-relative jump by W if the X most
| significant bits of Y aren't Z" would reduce the code size (and
| instruction cache/memory bandwidth used) for dynamically typed
| language JITs. Similarly, it's common to have instructions for
| bitwise-and with a sign-extended small immediate (32 bits, 13
| bits, etc., depending on architecture). For these pointer tagging
| operations, it's much more useful to instead use the immediate to
| specify the most significant N bits of the bitwise-and, with all
| of the low bits 1.
|
| By not hard-wiring the number of bits used in tagging, the
| features would seem to be more future-proof. By making the
| instructions purely user-space, they wouldn't require any kernel
| support and could be adopted faster. Adoption in JITs (the most
| common current use case for tagged pointers) could be especially
| fast, since JITs can perform a runtime check to avoid emitting
| the new instructions on unsupported CPUs. Granted, the extent of
| my CPU design experience was one undergrad class. I'm sure it's
| much more complex to push this extra logic into the instruction
| decoding and micro-op issuing logic instead of modifying the MMU.
| However, hard-wired pointer tagging systems have typically had
| shorter lifetimes than their designers expected.
|
| On a related note, if your userspace programs are at the top of
| your address space, you get NaN-boxed pointers "for free", since
| any 64-bit pointer with the first 13 bits set is some form of NaN
| when interpreted as an IEEE-754 double. What this means is that
| the different NaN-tagging tradeoffs (Safari decided to make
| object pointer use faster, Firefox chose to make Number use
| faster) collapse. Neither Numbers nor pointers need to be munged
| (though, needing to be careful about the two actual NaN values
| the FPU might generate). Unfortunately for Linux, too much kernel
| code (and perhaps also too much userland code) at this point
| assumes the kernel lives at the top of the address space for
| Linux to change this.
|
| On a side note, garbage collectors would benefit with a feature
| allowing traps of writes to read-only memory to a userland
| handler without a context switch. That way, the collector can
| mark a few pages read-only, lazily copy objects and fix up
| pointers as user code mutates objects, and finally fix up any
| stragglers when the GC attempts to mark the unmarked objects
| during a mark phase. Some garbage collectors already do this sort
| of thing, but using SIGSEG handlers that trap to the kernel and
| call back. It would be nice to reduce the overhead of these.
| Presumably, installing these lightweight handlers would require
| kernel cooperation.
|
| Historical tangent: early 16-bit x86 processors generated 20-bit
| addresses by left-shifting the segment by 4 bits, adding the
| address, and ignoring overflow. So, once machines started
| supporting more than 1 MB of memory (not ignoring the overflow),
| IBM needed a backward-compatibility mechanism. They had an unused
| pin on their keyboard controller chip, so they routed the 21st
| bit (address line 20, using zero-based numbering aka "a20")
| through the keyboard controller. The OS had to explicitly tell
| the keyboard controller to stop zeroing-out the "a20" line. So,
| during your boot up process, your x86 CPU still has to emulate
| the portion of the IBM keyboard controller that zeroes out the
| a20 line until the mode-switch. Your 64-bit machine boots with
| 20-bit addressing, switches to 21-bit addressing, switches to
| 32-bit addressing, and finally switches to 64-bit addressing.
| Given the BIOS checks that the first two bytes of the boot sector
| are 0x55AA, it's a shame that when designing x86_64, AMD didn't
| have the processor start in 64-bit addressing with an instruction
| to drop back down to 20-bit addressing for legacy systems. That
| way, the BIOS could use the first two bytes of the boot sector to
| determine if it needed to drop into legacy mode immediately
| before executing the boot sector. It's not a lot of transistors
| or complexity to emulate the accumulated cruft, but there was an
| opportunity to set a roadmap to remove that cruft from the
| hardware.
___________________________________________________________________
(page generated 2022-03-31 23:01 UTC)