[HN Gopher] Ointers: A library for representing pointers where b...
___________________________________________________________________
Ointers: A library for representing pointers where bits have been
stolen (2021)
Author : fanf2
Score : 113 points
Date : 2024-05-08 10:42 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| couchand wrote:
| Am I misreading the bitmask code? It looks like (in addition to a
| few other ideas) it's using the old "stick a few extra bits in an
| aligned pointer", but it seems to be only manipulating high bits,
| whereas aligned pointer zeroes are low-order bits.
|
| I'd suggest a heavier investment in testing infrastructure.
| pixelesque wrote:
| It looks like it can use both traditional "tagged pointer"
| alignment bits AND virtual address bits...
| couchand wrote:
| I see that in the readme but I don't see where that's handled
| in the bitmasking. It appears to be universally masking high-
| order bits here: https://github.com/irrustible/ointers/blob/1
| 961f75bbb9818d72...
| formerly_proven wrote:
| Read the function above the function you linked to.
| couchand wrote:
| Oh that's frightening they shift the entire pointer by
| the alignment
| db48x wrote:
| Why? Alignment causes some low-order bits to be zero.
| Shifting the pointer to the right drops those zeros on
| the floor, leaving high-order bits zero (or sign extended
| or whatever). Then you can put your tag up there instead.
| Shifting the value left by the same amount drops the tag
| and recovers the same pointer for use.
| vlovich123 wrote:
| Any idea why it's preferred to do the alignment route
| instead of storing the bits in the upper part of the
| pointer & masking?
| db48x wrote:
| Personal preference, architectural differences, phase of
| the moon, etc, etc.
| couchand wrote:
| My guideline with pointer tricks is: if you're not just
| masking bits it's too complicated for real use.
| Sharlin wrote:
| The stored representation is packed such that all the stealable
| bits are contiguous. To get the original pointer value it's
| unpacked first.
| masklinn wrote:
| 64 bit architectures don't actually have 64 bit address spaces,
| both AMD64 and ARM64 have 48 bit address spaces by default
| (some CPUs have extensions you can enable to request larger
| address spaces e.g. LVA on ARM64 and 5LP / LA57 on AMD64 but
| that's opt-in on a per-process basis).
|
| So while you have 3 bits available at the bottom of the
| pointer, there are 16 at the top. That's a lot more payload you
| can smuggle. There are even CPU extensions which tell the
| processor to ignore some of that (Linear Address Masking for
| Intel, Upper Address Ignore for AMD, and Top Byte Ignore for
| ARM).
| vlovich123 wrote:
| Small correction. That's true for 4-level paging where
| virtual memory is capped at 256 TiB (LAM48). There's a
| 5-level extension that reduces the wasted space to 7 bits
| (LAM57) to allow for 128 PiB of virtual space.
|
| I have no idea what the purpose of the extension is that when
| I don't believe you can get that much secondary storage
| attached to a CPU unless you go to tape which is pointless
| for virtual mapping.
| masklinn wrote:
| ...
|
| I literally mentioned five-level paging in my comment.
|
| But it's an opt-in extension, you need the kernel to
| support it, and then you need to ask the kernel to enable
| it on your behalf. So it's not much of an issue, you just
| have to not to use this sort of extensions (it's not the
| only one, ARM64 also has a similar one though it only goes
| up to 52 bit address spaces) with 16~19 bits of tagging.
| LegionMammal978 wrote:
| You do need to opt-in when compiling the kernel, but on
| Linux it doesn't take anything particularly special on
| the program's side to enable it. The rule is that non-
| fixed mmap() will only ever yield 48-bit addresses, until
| the program specifies a 57-bit address as a hint, after
| which it can yield 57-bit addresses at will. (This rule
| has lots of curious consequences for the location of the
| vDSO, when 5-level paging is enabled and an ELF program
| uses big addresses for its segments.)
| zgs wrote:
| Every time I've seen similar done in the past, it has come back
| to hurt the instigator.
| pixelesque wrote:
| I've used tagged pointers with no issues at all over the years
| to "smuggle" / store values in...
| speed_spread wrote:
| If you build your whole app with it, yeah. If you use it
| tactically for a certain data type it can be very nice. Just
| don't try to dissimulate it's specialness.
| db48x wrote:
| Every lisp system that ever existed uses this technique, and it
| never hurt them any. Emacs or SBCL or Open Genera; they all
| work perfectly well.
| _flux wrote:
| OCaml uses the lowest bit, which is nice because (aligned)
| pointers work as-is.
| lmm wrote:
| > Every lisp system that ever existed uses this technique,
| and it never hurt them any.
|
| Wasn't it part of the reason they ended up with poor
| "mechanical sympathy" on regular PCs, and got a bad
| performance reputation as a result?
| db48x wrote:
| No. Their reputation for poor performance mostly arises
| because the default data structure is a singly-linked list.
| Although the language makes these lists extremely
| convenient for the programmer to use, linked lists
| invariably result in terrible cache locality. High-
| performance systems that are written in lisp (and they do
| exist), achieve that performance by avoiding lists and
| using vectors and arenas instead.
|
| Of course, it must be said that this performance penalty
| rarely matters in practice. For most programs, the limiting
| factor is not performance but the cost of implementation
| and maintenance. Lisp has has many powerful and convenient
| features that make implementation faster and surer than in
| any other language, and which usually greatly lower the
| cost of maintenance as well. Thus by writing a new program
| in Lisp you can get it off the ground and earning you money
| faster than you could in almost any other language. You can
| also add features more cheaply as time goes on, allowing
| you to keep up with your competitors. It is only when your
| system is nearing completion, and all necessary features
| have been invented and implemented, that you should think
| about rewriting select parts of that system in a systems
| language where more efficiency is possible. Not only will
| you only understand the system after it is written, but you
| can measure the actual performance of the system to figure
| out which parts you need to rewrite, and which parts have
| acceptable performance already and thus do not need to be
| rewritten.
| jxf wrote:
| > For most programs, the limiting factor is not
| performance but the cost of implementation and
| maintenance.
|
| The limiting factor for what? Their commercial success,
| or something else?
| db48x wrote:
| Success, yes. It need not be direct commercial success;
| lots of programs are written purely for internal use
| within some larger organization. And even when there are
| known performance requirements, Lisp is often still a
| good choice because the performance is handled by
| specialized hardware. Consider games, for example. Time
| to market and rapid iteration are both paramount, while
| performance is taken care of not by writing a fast
| renderer but by sending everything to the GPU to be
| rendered on specialized hardware.
| pfdietz wrote:
| Cache locality is improved if the Lisp has a compacting
| garbage collector, but you still need extra memory for
| the cdr links. Lisp Machines had cdr-coding that allowed
| those to be avoided but I don't think that's practical on
| stock architectures.
| db48x wrote:
| True, but don't forget that one of the _other_ causes of
| performance problems in lisp programs is the creation and
| subsequent collection of garbage. If you know ahead of
| time which of your data needs to be accessed or created
| by the most performance-sensitive parts of your program,
| then you can put that data into vectors to start with and
| avoid all of the extra work.
| pfdietz wrote:
| Long-lived things end up in older generations and aren't
| traversed much during GC, which is the moral equivalent
| of being statically allocated.
|
| There's overhead if you mutate old objects in a GCed
| system due to card marking.
|
| Lisp vectors are vectors of pointers, so there's still an
| overhead of dereferencing through it. Presumably the
| objects being pointed to end up compacted together,
| eventually.
| db48x wrote:
| I should have been more clear; I meant using vectors as
| arena allocators, so that all of your data ends up stored
| contiguously with no extra indirection.
| mik1998 wrote:
| > linked lists invariably result in terrible cache
| locality
|
| With a compacting GC (ie most of the relevant ones) lists
| often provide better memory locality than arrays. They
| only lose at a large scale due to having +8 bytes of
| memory, which makes arrays occasionally cheaper for large
| sequences.
| nequo wrote:
| Why do lists provide _better_ memory locality than
| arrays? The array already stores its elements next to
| each other in memory. The compacting GC helps arrange the
| elements of the list close to each other too, right? Then
| wouldn't the compacting GC at best even out the
| performance difference for sequential traversal?
| gpderetta wrote:
| Even with perfect cache locality, a linked list will be
| significantly slower to traverse than an array because of
| worse pipelining and parallelization. Partially unrolled
| lists can help, but at the cost of additional complexity.
|
| Of course not all container accesses require traversal.
| bitwize wrote:
| I don't know what you mean. Maybe in the early early 80s
| that was the case, but PCs were still 16-bit back then, and
| would've been a poor fit for Lisp anyway.
|
| One of the reasons why the Lisp machines died out is that
| round about the mid-80s or so, compiler technology improved
| for generic 32-bit processors, and it became possible to
| run Lisp software on a VAX, 68k, or RISC CPU faster than it
| ran on the Lisp machines' bespoke architecture. Back during
| the first AI hypecycle, the makers of Golden Common Lisp
| introduced the Hummingboard to the market, billed as an
| inexpensive solution to have a "Lisp machine in a PC". It
| was just a 386 on an ISA card (predating Compaq's 386 PC by
| about a year) with gobs of memory on board; a special build
| of Golden Common Lisp allowed code to be run on that CPU
| rather than the main one.
| coldtea wrote:
| > _One of the reasons why the Lisp machines died out is
| that round about the mid-80s or so, compiler technology
| improved for generic 32-bit processors, and it became
| possible to run Lisp software on a VAX, 68k, or RISC CPU
| faster than it ran on the Lisp machines ' bespoke
| architecture._
|
| I'd say Lisp machines died out because Lisp died out
| (commercially). Other languages got popular, the second
| AI winter didn't help at all either.
|
| If Lisp itself had fared better (even if it was on
| generic hardware), Lisp machines could have been saved
| too, they could still use a VAX, 68k, or RISC CPU
| underneath, and optimize for the developer experience. Or
| they'd have turned Lisp machines from hardware into a
| "Lisp machine" OS or IDE/REPL/etc environment for other
| OSes succeeding. But none of that took off.
| zozbot234 wrote:
| > Or they'd have turned Lisp machines from hardware into
| a "Lisp machine" OS
|
| Emacs is that Lisp OS. But it's still lacking a good text
| editor.
| pfdietz wrote:
| Ok, that made me laugh.
| bitwize wrote:
| > Or they'd have turned Lisp machines from hardware into
| a "Lisp machine" OS or IDE/REPL/etc environment for other
| OSes succeeding.
|
| You mean like Allegro CL? LispWorks? Genera on DEC
| Ultrix?
|
| With the exception of Genera, these solutions are
| available, maintained, and supported today. Even Genera
| hung on for a good few years. Lisp _machines_ started to
| wane a few years before the AI winter hit in full force,
| because the idea of dedicated hardware to run Lisp
| programs made sense when your alternative was Maclisp
| struggling on a PDP-10, but became expensive, slow, and
| fiddly compared to the generic boxes running fast 32-bit
| processors with, again, much improved compiler tech.
| Genera on DEC Alpha, even as an interpreted VM, was so
| much faster than any of Symbolics 's bespoke CPU
| architectures that Symbolics just quit making hardware
| and declared the Alpha version of Genera the official
| upgrade path.
| rjsw wrote:
| One road not taken would have been to port the MIT/LMI/TI
| Lisp Machine software to stock hardware, it was 32-bit
| until the end so didn't need an Alpha as the host. A fast
| 68020 with a custom MMU would have been fine. There was
| only around 12k of "assembler" to rewrite to provide the
| VM and simple RTOS.
| lispm wrote:
| Symbolics had its own PC implementation of Lisp: CLOE.
|
| From the Lisp FAQ: "CLOE (Common Lisp Operating
| Environment) is a cross-development environment for IBM
| PCs (MSDOS) and Symbolics Genera. It includes CLOS,
| condition error system, generational garbage collection,
| incremental compilation, code time/space profiling, and a
| stack-frame debugger. It costs from $625 to $4000 and
| requires 4-8mn RAM and a 386 processor. "
|
| Later they also ported Genera to DEC Alpha. Currently I
| have access to an implementation which runs on ARM64 and
| Intel x64.
| lispm wrote:
| No, the reason for less than stellar performance of Lisp on
| the 32bit x86 CPUs were things like too few registers.
|
| On 64bit Intel CPUs this is much less of a problem and Lisp
| runs roughly twice as fast in 64bit mode. I would think
| that this even holds without taking advantage of wider data
| words.
|
| On other architectures this was less of a problem. 32bit
| CPUs had more registers on a RISC or some other CISC CPUs.
| samatman wrote:
| It's not. Two counterexamples: LuaJIT's interpreter/VM, and
| Wren's, both use NaN boxing, and are among the fastest VMs
| out there (we're ignoring JITs as out of scope).
|
| It isn't tagging pointers that makes things (some Lisps are
| 'things', some are not) slow: it's pervasive abstraction
| and indirection. Doing some masking on a pointer is
| pipelined, out-of-order, one to three cycles, and is
| absolutely dwarfed by cache misses and conditional
| mispredictions, let alone the sort of pointer chasing which
| is common in idiomatic Lisp (or indeed Python) code.
| rjsw wrote:
| Maclisp on PDP-10 and Franz Lisp didn't, they used bibop type
| checking instead of tagged pointers.
| db48x wrote:
| That's merely an elaboration on the same theme.
| rjsw wrote:
| It isn't, pointers in Franz Lisp can be used unmodified
| in C.
| db48x wrote:
| But those pointers still have bits in them that indicate
| the type of the object that they point to. It is merely
| the case that the tag bits are chosen at run time so that
| they correspond to allocated memory regions, rather than
| being chosen ahead of time.
| rjsw wrote:
| None of the bits in the pointer have been "stolen"
| though, which is the point of the referenced article,
| they are all still available to access the full address
| space of the CPU.
| db48x wrote:
| That doesn't mean that they aren't still there, and don't
| tell you the type. It's still a tagged pointer, and
| because all objects of the same type must be allocated
| out of the same page (or collection of pages), it still
| limits how much memory can be dedicated to objects of
| that type.
| gpderetta wrote:
| And the modern take on this is NaN Boxing!
| junon wrote:
| VM writers use this egregiously. I've not heard of it causing
| issues. It's not that complicated.
|
| Lua, Lisps, many others.
| masklinn wrote:
| Apple's been using that for more than a decade:
| https://www.mikeash.com/pyblog/friday-qa-2012-07-27-lets-bui...
| http://www.sealiesoftware.com/blog/archive/2013/09/24/objc_e...
|
| OCaml's also been using tagging to mark unboxed type pretty
| much from the start, nearly 30 years ago:
| https://dev.realworldocaml.org/runtime-memory-
| layout.html#:~....
| duskwuff wrote:
| And it's not like Apple isn't aware of the dangers, either;
| they used top-byte tagging on the 68000, and it took them a
| lot of time and effort to migrate away from that to support
| 16+ MB systems.
| gpderetta wrote:
| IIRC the issue was that 68000 would transparently mask the
| top bits, so code could get away with avoiding the explicit
| masking, which of course breaks when you move to a larger
| address space.
|
| More modern cpus enforce specific bit patterns for the
| MSBs, so broken code would be caught immediately.
| coldtea wrote:
| This "trick" has been used since the dawn of time in major
| platforms
| adql wrote:
| I dunno, pointers were kinda small on 8/16 bit platforms
| titzer wrote:
| All fast JavaScript VMs use some form of pointer tagging. V8
| uses (compressed) tagged pointers and both JSC and Firefox used
| NaN-boxing.
|
| WasmGC has a i31ref as part of its reference hierarchy, which
| is implemented with pointer tagging.
| jstimpfle wrote:
| It's common for libraries to use the 2 lowest bits in tree
| structures e.g. RB tree. It's not a problem at all since the
| data structure is controlled by the library. Even if tree nodes
| are allocated/provided by the user, having a "4-byte aligned"
| requirement in the API contract isn't a problem at all -- in
| fact you need to work quite hard to allocate a structure with a
| pointer that isn't at least 4 byte aligned (i386), or 8 byte
| aligned (x64).
| a1369209993 wrote:
| > in fact you need to work quite hard to allocate a structure
| with a pointer that isn't at least 4 byte aligned (i386), or
| 8 byte aligned (x64).
|
| Well, no, actually; it's: p =
| malloc(size+1)+1;
|
| It's just quite implausible that you'd do that _by accident_.
| menaerus wrote:
| Well no it isn't since malloc returns a void* and
| incrementing a void* is impossible for a compiler to figure
| out the offset that it needs to increment address for. For
| that operation to succeed/compile you need to cast the
| result of malloc first to p*. This means that the resulting
| alignment will still be correct.
|
| The actual danger is in reinterpreting the pointers, or
| thereof some arbitrary chunk of memory, to a type that
| doesn't necessarily satisfy the alignment requirements for
| that given memory address.
| adql wrote:
| Any example ?
| mbrubeck wrote:
| This bitvec I wrote for Servo and Gecko uses this technique,
| and has been shipping in Firefox releases for over six years
| with no bugs found:
|
| https://docs.rs/smallbitvec/
|
| I'm pretty sure you can find several more examples in Firefox,
| as well as other major browsers.
| gizmo wrote:
| Stealing bits from pointers makes sense sometimes, but in most
| cases an easier and less error-prone way to increase memory
| efficiency is to just use indices or relative pointers. Instead
| of shaving individual bits of a pointer take advantage of the
| fact that most pointers in a program are almost exactly alike.
|
| Instead of storing a 64 bit prev and next pointer for an item in
| a linked list just store e.g. two 16 bit offsets. Then the
| next_ptr = (this_ptr & base_mask) + offset * alignment.
|
| Most of the time wasting memory by storing full 64bit pointers is
| no big deal. But when it is, you can go a lot further than just
| shaving off a few bits.
|
| If you want to be extra clever you can encode values in the base
| address of your pointers by taking advantage of how virtual
| memory is allocated. Every process has its own virtual memory
| pages and you can request memory pages at specific address
| ranges. For instance if you align all your memory at a 32gb
| boundary then you can use 32bit pointers + a base pointer without
| even having to mask. You can also use virtual memory allocation
| tricks to convey information about alignment, typing, whether a
| pointer is allowed to be shared between threads and so on. And
| again, without having to butcher the pointer itself.
| jsheard wrote:
| Considering how common it is to use indices in lieu of pointers
| in Rust, it would be nice if it had native NonMax integer types
| similar to the NonZero types it already has. NonZero gives you
| the niche-filling optimization (e.g. Option<NonZero> and
| NonZero are the same size) but with indices you _want_ zero to
| be a valid value, so it would be natural to use the maximum as
| the illegal value instead. You can kind of do it by offsetting
| the index by 1 or inverting the bits then putting it in a
| NonZero but that 's a small extra runtime cost on every use.
| joseluis wrote:
| You can do a handy struct wrapper over a private Nonzero that
| xors the given value with the prohibited value (max in this
| case) at the new constructor. And like so for the get method.
| Xoring is very cheap. That's my favorite way of storing
| indices/links for nodes, since you can wrap them in Option
| for free.
| zozbot234 wrote:
| This requires using arenas extensively in your program or
| perhaps playing tricks with virtual memory, because your system
| allocator will give you arbitrary pointers by default and
| you'll often want to point to an object from a different
| allocation.
| gizmo wrote:
| Right. Which is why I favor doing your own memory management
| entirely (which has some huge advantages) or not worrying
| about memory at all and trusting a garbage collector. I don't
| think there are many situations left where memory management
| is too important to leave to a gc but not important enough to
| do from scratch.
| adql wrote:
| Plenty of niche ways that are limited enough to not require
| program wide architecture change
|
| https://muxup.com/2023q4/storing-data-in-pointers#some-
| real-...
| gpderetta wrote:
| Definitely indices are a great option, but then you need a base
| pointer and a way to allocate off of it. That can add
| significant complexity. So it is all a tradeoff.
| dkjaudyeqooe wrote:
| That's really not the application for this, it's tags, which
| are endlessly useful or characterizing the pointer. For
| instance you can use the MSBs to establish the type of whats
| pointed to, or for reference counting. I use the LSB to
| indicate if the value is a pointer or an unsigned integer,
| whereby you can store both an (integer) key to a record in a
| database, or a pointer to the record data in memory. You could
| use another bit in the pointer to indicate that the record is
| dirty.
|
| Using unused bits in pointers can make you code more efficient
| and simpler.
| westurner wrote:
| Can unused bits be used for signed pointers or CRCs?
| aetherspawn wrote:
| I would argue that we shouldn't bother to build CRC
| checking of memory into programs and that ECC RAM should
| just become the norm eventually.
|
| 99.9999% of memory operations aren't CRC-d so trying to
| fight against flaky hardware is a bit of an uphill battle.
| pavlov wrote:
| Technically it's an ointer on PowerPC and SPARC, but a pointe
| everywhere else.
| sojuz151 wrote:
| This look like something that debuggers would hate, because
| following pointers would be broken
| jxf wrote:
| I was just thinking this. You'd either need to special-case
| your debuggers or be able to use something that provided
| bespoke hooks for pointer-following.
| gpderetta wrote:
| Wrap it in a class and teach the debugger to interpret it. It
| might be possible with gdb and python pretty-printers for
| example.
| zozbot234 wrote:
| Are there ways on Linux to make sure that all user-visible
| pointers in a process are allocated within some subset of the
| overall address space? For instance, imagine writing a 64-bit
| program such that all userspace pointers are allocated within the
| first 32 bits. This would allow for representing them in memory
| using four bytes, as is commonly done on 32-bit architectures--
| without adding more extensive or ad-hoc changes such as a new
| "x32" binary format. Except that you could also limit to 24-bit
| (giving 16M of address space), 16-bit (for a "tiny" code model)
| or anything else. Of course, this would require some amount of
| support in the linker/loader, userspace libraries and the kernel
| itself. But the effort might be worthwhile if it can allow for
| replacing more ad-hoc approaches.
| fathyb wrote:
| An allocator is free to only `mmap` addresses that are within a
| range. I think jemalloc could allow you to do that using custom
| arenas.
| ksherlock wrote:
| Linux mmap (since 2.6) has a MAP_32BIT flag, but only for
| x86-64:
|
| Put the mapping into the first 2 Gigabytes of the process
| address space. This flag is supported only on x86-64, for
| 64-bit programs. It was added to allow thread stacks to be
| allocated somewhere in the first 2 GB of memory, so as to
| improve context-switch performance on some early 64-bit
| processors. Modern x86-64 processors no longer have this
| performance problem, so use of this flag is not required on
| those systems. The MAP_32BIT flag is ignored when MAP_FIXED is
| set.
| asb wrote:
| I somehow hadn't come across this library, but have a whole blog
| post on the various ways people store data in pointers (and
| when+why it's safe) https://muxup.com/2023q4/storing-data-in-
| pointers
| queuebert wrote:
| If the ointer points to a very large block of memory, is it an
| oinker?
| willcipriano wrote:
| I never had the problem of too many pointers, but if I did I'd
| try the "arena"[0] approach.
|
| Preallocate a huge area of memory and use smaller indexes
| relative to the start instead. Provided a pointer is bigger than
| a int. You might be able to use this in that approach and keep
| partial pointers instead.
|
| [0]https://www.rfleury.com/p/untangling-lifetimes-the-arena-
| all...
| gumby wrote:
| > What do you call a pointer with the high bits stolen? An
| ointer!
|
| What is the name a pointer with the low bits stolen? Machines
| with strict word addressing will ignore the bottom two bits so
| you can safely store what you want there.
|
| Not sure I've used, much less programmed a machine with that
| strict constraint in a while though. These days CPUs are all
| designed to implement the C abstract machine. At least the GPU
| folks are not subject to this constraint.
| gwbas1c wrote:
| Could something like this be used to lower memory footprint?
| menaerus wrote:
| Besides that it makes some lock-free algorithms even feasible
| and some easier to implement, yes, it also can be used to
| downsize the memory pressure at the cost of extra CPU cycles if
| you know that the metadata you want to track fits in those 16+
| bits.
___________________________________________________________________
(page generated 2024-05-08 23:00 UTC)