[HN Gopher] Tiny Pointers
       ___________________________________________________________________
        
       Tiny Pointers
        
       Author : levzettelin
       Score  : 106 points
       Date   : 2025-02-12 09:43 UTC (13 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | levzettelin wrote:
       | Can someone ELI5?
        
         | trymas wrote:
         | Most likely related with this post:
         | https://news.ycombinator.com/item?id=43002511
        
           | chews wrote:
           | Correct, this is about Yao's conjecture and the implications
           | for other hash implementations.
        
         | taeric wrote:
         | I could be wrong, but I think the easiest way to think of this
         | is to consider how much extra memory programs took when
         | compiled with 64 bit pointers over 32 bit ones. Suddenly every
         | pointer takes double the memory. Which, sure, isn't a huge deal
         | if you don't have a lot of memory allocations. But, if you do,
         | it can add up.
         | 
         | Places it would likely impact more than you'd realize is in
         | higher level language arrays where each item in the array is a
         | pointer. For similar reasons, many datastructures can be coded
         | such that each "pointer" is an array index instead.
         | 
         | So, extending all of that, what if you could make your pointers
         | even smaller than 32 bits? If you know the addressable need for
         | where the pointer is used, there is no reason you can't go
         | smaller.
        
           | card_zero wrote:
           | Yet these aren't offsets from an address, that is, array
           | indexes?
        
             | taeric wrote:
             | But the idea is still the same? You can make "smart
             | pointers" that are attached to arenas or whatever you want
             | to call them. On those, the addressable size of the pointer
             | is confined to how large the arena is.
        
               | card_zero wrote:
               | Eh, you underestimate my naivety. To rephrase: are these
               | just array indexes, then?
               | 
               | I mean I presume it's something cleverer than just "let's
               | bit-pack our indexes to save space".
               | 
               | ..."and make it dynamic" (arenas)
               | 
               | Oh wait, _variable-size_ tiny pointers. Yes. Now it 's
               | getting satisfyingly complicated. Maybe like how UTF-8
               | works?
        
               | taeric wrote:
               | I mean, in a way, a pointer is just an array index into
               | heap space? Such that, I think this is accurate? I'd drop
               | the "just", though. They can be thought of that way, but
               | the idea is how to keep the programmer from having to do
               | that manually.
               | 
               | To rephrase, this is how to mechanically bit-pack indexes
               | to save space. Doing it in a pointer means you are likely
               | wanting to share these pointers with other parts of the
               | code, so you also need to have them be self contained.
               | Naive way would be for each pointer to be a tuple
               | identifying arena and having its offset. But there is
               | still a lot of effort that has to be done for that to
               | work well. This paper looks at that effort and gives a
               | pass at how worthwhile it is.
        
               | judofyr wrote:
               | > To rephrase: are these just array indexes, then? I mean
               | I presume it's something cleverer than just "let's bit-
               | pack our indexes to save space".
               | 
               | The pointers can be _dereferenced_ into an array index
               | (when given the key), but they are smaller in size than a
               | "bit-packed index", although I'm not quite sure what you
               | mean by this. If you want to to represent an index into
               | an array of size `n` there's no representation where
               | _all_ indexes take less than log(n). You can make some of
               | them smaller, but then you make other bigger (like in
               | UTF-8). The Tiny Pointers are _all_ smaller than log(n)
               | (and some of them would be exactly equal, despite
               | representing different indexes!), but the only way of
               | determining the actual index is to dereference it
               | together with original key that was used to create it.
        
               | robertsdionne wrote:
               | How is this novel?
               | 
               | This is like using a fixed sharding scheme plus delta-
               | compression for indices relative to shard addresses plus
               | varint encoding the deltas.
        
               | robertsdionne wrote:
               | After reading more, it seems the novelty is in the two
               | specific schemes that use these techniques to store data
               | with high probability into tables of extremely high load
               | factors with bounded pointer sizes, which are more
               | complex than simple array indices due to having to
               | resolve which fallback table is storing the key-value
               | pair.
               | 
               | One scheme proves tiny pointer size bounds for fixed
               | length tiny pointers. The other proves bounds for
               | variable length pointers.
        
         | ceeam wrote:
         | Zoomers invented indexes.
        
           | ramon156 wrote:
           | Programmer made program. What's new! When will they make
           | magic orbs?!
        
           | nimih wrote:
           | Every author on this paper is a millennial or gen x.
        
       | mont_tag wrote:
       | Didn't Python's compact dictionary implementation do this a
       | decade ago?
       | 
       | "The dict type now uses a "compact" representation based on a
       | proposal by Raymond Hettinger which was first implemented by
       | PyPy. The memory usage of the new dict() is between 20% and 25%
       | smaller compared to Python 3.5." --
       | https://docs.python.org/3.6/whatsnew/3.6.html#whatsnew36-com...
       | 
       | "Note, the sizeof(index) can be as small as a single byte for
       | small dicts, two bytes for bigger dicts and up to
       | sizeof(Py_ssize_t) for huge dict." --
       | https://mail.python.org/pipermail/python-dev/2012-December/1...
       | 
       | The "tiny pointers" are in the _make_index method in the proof of
       | concept code. --
       | https://code.activestate.com/recipes/578375-proof-of-concept...
       | @staticmethod           def _make_index(n):               'New
       | sequence of indices using the smallest possible datatype'
       | if n <= 2**7: return array.array('b', [FREE]) * n       # signed
       | char               if n <= 2**15: return array.array('h', [FREE])
       | * n      # signed short               if n <= 2**31: return
       | array.array('l', [FREE]) * n      # signed long
       | return [FREE] * n
       | 
       | The logic is still present today in CPython. --
       | https://raw.githubusercontent.com/python/cpython/3e222e3a159...
       | dk_indices is actual hashtable.  It holds index in entries, or
       | DKIX_EMPTY(-1)       or DKIX_DUMMY(-2).       Size of indices is
       | dk_size.  Type of each index in indices varies with dk_size:
       | * int8  for          dk_size <= 128       * int16 for 256   <=
       | dk_size <= 2**15       * int32 for 2**16 <= dk_size <= 2**31
       | * int64 for 2**32 <= dk_size
        
         | kragen wrote:
         | No, although it's not completely unrelated.
        
         | yorwba wrote:
         | These are the "traditional _log n_ -bit pointers" that tiny
         | pointers improve on.
        
         | judofyr wrote:
         | The "compact dictionary" representation you're talking about
         | not always using 64-bit numbers, but rather using log(n)-bit
         | values when you have n entries (i.e. using 8-bit values when
         | there's only <= 128 entries). The paper linked here talks about
         | pointers which uses _less_ than log(n)-bits for n entries.
        
       | kragen wrote:
       | Note that this is not the paper by Krapivin that
       | https://news.ycombinator.com/item?id=43002511 is about.
        
         | vlovich123 wrote:
         | It's the paper mentioned in the same article and one of the
         | author's is his professors.
        
           | kragen wrote:
           | Yes.
        
       | judofyr wrote:
       | I looked a bit into this a few years back and found it quite
       | interesting. Despite them calling them "Tiny Pointers" I would
       | say it's closer to a open addressing hash map. You have a
       | specific key, and then you can "allocate" an entry in the hash
       | map. This gives you back a "pointer". You can then later use the
       | original key _and_ the pointer together to determine the index of
       | the entry. There 's also a slight chance that the allocation will
       | fail (similar to a hash collision in a hash map). The "trick"
       | here is that two different keys can end up having the _exact_
       | same pointer (because we 're always dereferencing with the key).
       | This makes them more compact.
       | 
       | I was struggling a bit to come up with good use cases for it.
       | Their examples are all around combining them with existing data
       | structures and they show that the space complexity is smaller,
       | but it wasn't completely clear to me how feasible this would
       | actually be in practice.
        
         | qazxcvbnm wrote:
         | I read the paper a few years ago and I agree that for such an
         | incredible algorithmic improvement, it's not trivial to find a
         | use case, as you still need to maintain a separate (albeit
         | algorithmically insignificant) lookup table. When I read the
         | paper I (mistakenly) hoped it could be used for indexing on-
         | disk structures without ever hitting the disk for things like B
         | tree internal nodes. To get its sweet sweet algorithmic
         | complexity, up its sleeve is once again (as I recall) the age
         | old trick of rebuilding the structure when the size doubles,
         | which makes it much less efficient than it sounds for most
         | practical use cases. I suppose one good use case for this might
         | be compressing database indexes, where you need to maintain a
         | separate structure anyway and the space savings can be worth
         | it.
        
           | judofyr wrote:
           | I didn't invest a lot of time in it because when these papers
           | show no _practical_ applications I always assume that the
           | constant factors are too high. Especially when there 's also
           | no _specific_ values of the constant factors presented. It 's
           | still somewhere back in my mind to implement it one day to
           | figure out the nitty gritty details.
        
             | boothby wrote:
             | Back of the envelope calculations from the abstract: An
             | 8-bit tiny pointer would be sufficient to reference an
             | array of size 2^2^2^3 [?] 1e77 ([?]atoms in the visible
             | universe) with fullness 31/32 [?] 97%. Or size 65536 and
             | fullness 63/64. For 16-bit tiny pointers, there's no point
             | in going bigger than the universe; fullness goes to 99.98%.
             | Like you, I'd love to see such an example worked out.
        
         | taeric wrote:
         | It isn't hard to make a datastructure that indexes into itself.
         | BDDs, for example, are often coded in this way. I did an
         | admittedly poor job of one at https://taeric.github.io/trading-
         | with-bdds.html, but I think it is enough to see the idea well
         | enough.
        
         | Izikiel43 wrote:
         | This guy used that paper to create a new hash table which is
         | much faster at high load that current implementations:
         | 
         | https://www.quantamagazine.org/undergraduate-upends-a-40-yea...
        
           | nsteel wrote:
           | That paper was front page 2 days ago. It's the reason this
           | paper is front page today. And while it is fantastic work, it
           | has no obvious break-through application, as was discussed:
           | https://news.ycombinator.com/item?id=43004955
        
       | BenoitEssiambre wrote:
       | Here's a somewhat related post where I argue that logic that
       | _can_ have small pointers has lower entropy and is more optimal
       | as bayesian model of the domain:
       | https://benoitessiambre.com/entropy.html
        
       | kittikitti wrote:
       | Is there a peer reviewed version of this or does Hacker News
       | exclusively post non peer reviewed works?
        
         | wbolt wrote:
         | https://dl.acm.org/doi/10.1145/3700594
        
       | kazinator wrote:
       | > _How large do the pointers need to be? The natural answer is
       | that each pointer uses log nbits. However, the fact that each
       | pointer has a distinct owner makes it possible to compress the
       | pointers to o(log n) bits._
       | 
       | What if you have to _debug_ the whole situation, such that you
       | don 't always know who is the owner of a pointer you are looking
       | at?
       | 
       | > _A user k can call Allocate(k) in order to get a tiny pointer
       | p; they can dereference the tiny pointer pby computing a function
       | Dereference(k,p) whose value depends only on k, p, and random
       | bits; and they can free a tiny pointer p by calling a function
       | Free(k,p)._
       | 
       | That is tantamount ot saying that the pointer is not actually _p_
       | but the tuple _< k, p>_, and so its size consists of the number
       | of bits in _k_ , the number of bits in _p_ plus an indication of
       | where the division between these bits lie: where _k_ ends and _p_
       | begins.
       | 
       | We can abbreviate _< k, p>_ to _p_ in contexts where _k_ can be
       | implicitly understood.
        
       ___________________________________________________________________
       (page generated 2025-02-12 23:01 UTC)