[HN Gopher] Tiny Pointers
___________________________________________________________________
Tiny Pointers
Author : levzettelin
Score : 106 points
Date : 2025-02-12 09:43 UTC (13 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| levzettelin wrote:
| Can someone ELI5?
| trymas wrote:
| Most likely related with this post:
| https://news.ycombinator.com/item?id=43002511
| chews wrote:
| Correct, this is about Yao's conjecture and the implications
| for other hash implementations.
| taeric wrote:
| I could be wrong, but I think the easiest way to think of this
| is to consider how much extra memory programs took when
| compiled with 64 bit pointers over 32 bit ones. Suddenly every
| pointer takes double the memory. Which, sure, isn't a huge deal
| if you don't have a lot of memory allocations. But, if you do,
| it can add up.
|
| Places it would likely impact more than you'd realize is in
| higher level language arrays where each item in the array is a
| pointer. For similar reasons, many datastructures can be coded
| such that each "pointer" is an array index instead.
|
| So, extending all of that, what if you could make your pointers
| even smaller than 32 bits? If you know the addressable need for
| where the pointer is used, there is no reason you can't go
| smaller.
| card_zero wrote:
| Yet these aren't offsets from an address, that is, array
| indexes?
| taeric wrote:
| But the idea is still the same? You can make "smart
| pointers" that are attached to arenas or whatever you want
| to call them. On those, the addressable size of the pointer
| is confined to how large the arena is.
| card_zero wrote:
| Eh, you underestimate my naivety. To rephrase: are these
| just array indexes, then?
|
| I mean I presume it's something cleverer than just "let's
| bit-pack our indexes to save space".
|
| ..."and make it dynamic" (arenas)
|
| Oh wait, _variable-size_ tiny pointers. Yes. Now it 's
| getting satisfyingly complicated. Maybe like how UTF-8
| works?
| taeric wrote:
| I mean, in a way, a pointer is just an array index into
| heap space? Such that, I think this is accurate? I'd drop
| the "just", though. They can be thought of that way, but
| the idea is how to keep the programmer from having to do
| that manually.
|
| To rephrase, this is how to mechanically bit-pack indexes
| to save space. Doing it in a pointer means you are likely
| wanting to share these pointers with other parts of the
| code, so you also need to have them be self contained.
| Naive way would be for each pointer to be a tuple
| identifying arena and having its offset. But there is
| still a lot of effort that has to be done for that to
| work well. This paper looks at that effort and gives a
| pass at how worthwhile it is.
| judofyr wrote:
| > To rephrase: are these just array indexes, then? I mean
| I presume it's something cleverer than just "let's bit-
| pack our indexes to save space".
|
| The pointers can be _dereferenced_ into an array index
| (when given the key), but they are smaller in size than a
| "bit-packed index", although I'm not quite sure what you
| mean by this. If you want to to represent an index into
| an array of size `n` there's no representation where
| _all_ indexes take less than log(n). You can make some of
| them smaller, but then you make other bigger (like in
| UTF-8). The Tiny Pointers are _all_ smaller than log(n)
| (and some of them would be exactly equal, despite
| representing different indexes!), but the only way of
| determining the actual index is to dereference it
| together with original key that was used to create it.
| robertsdionne wrote:
| How is this novel?
|
| This is like using a fixed sharding scheme plus delta-
| compression for indices relative to shard addresses plus
| varint encoding the deltas.
| robertsdionne wrote:
| After reading more, it seems the novelty is in the two
| specific schemes that use these techniques to store data
| with high probability into tables of extremely high load
| factors with bounded pointer sizes, which are more
| complex than simple array indices due to having to
| resolve which fallback table is storing the key-value
| pair.
|
| One scheme proves tiny pointer size bounds for fixed
| length tiny pointers. The other proves bounds for
| variable length pointers.
| ceeam wrote:
| Zoomers invented indexes.
| ramon156 wrote:
| Programmer made program. What's new! When will they make
| magic orbs?!
| nimih wrote:
| Every author on this paper is a millennial or gen x.
| mont_tag wrote:
| Didn't Python's compact dictionary implementation do this a
| decade ago?
|
| "The dict type now uses a "compact" representation based on a
| proposal by Raymond Hettinger which was first implemented by
| PyPy. The memory usage of the new dict() is between 20% and 25%
| smaller compared to Python 3.5." --
| https://docs.python.org/3.6/whatsnew/3.6.html#whatsnew36-com...
|
| "Note, the sizeof(index) can be as small as a single byte for
| small dicts, two bytes for bigger dicts and up to
| sizeof(Py_ssize_t) for huge dict." --
| https://mail.python.org/pipermail/python-dev/2012-December/1...
|
| The "tiny pointers" are in the _make_index method in the proof of
| concept code. --
| https://code.activestate.com/recipes/578375-proof-of-concept...
| @staticmethod def _make_index(n): 'New
| sequence of indices using the smallest possible datatype'
| if n <= 2**7: return array.array('b', [FREE]) * n # signed
| char if n <= 2**15: return array.array('h', [FREE])
| * n # signed short if n <= 2**31: return
| array.array('l', [FREE]) * n # signed long
| return [FREE] * n
|
| The logic is still present today in CPython. --
| https://raw.githubusercontent.com/python/cpython/3e222e3a159...
| dk_indices is actual hashtable. It holds index in entries, or
| DKIX_EMPTY(-1) or DKIX_DUMMY(-2). Size of indices is
| dk_size. Type of each index in indices varies with dk_size:
| * int8 for dk_size <= 128 * int16 for 256 <=
| dk_size <= 2**15 * int32 for 2**16 <= dk_size <= 2**31
| * int64 for 2**32 <= dk_size
| kragen wrote:
| No, although it's not completely unrelated.
| yorwba wrote:
| These are the "traditional _log n_ -bit pointers" that tiny
| pointers improve on.
| judofyr wrote:
| The "compact dictionary" representation you're talking about
| not always using 64-bit numbers, but rather using log(n)-bit
| values when you have n entries (i.e. using 8-bit values when
| there's only <= 128 entries). The paper linked here talks about
| pointers which uses _less_ than log(n)-bits for n entries.
| kragen wrote:
| Note that this is not the paper by Krapivin that
| https://news.ycombinator.com/item?id=43002511 is about.
| vlovich123 wrote:
| It's the paper mentioned in the same article and one of the
| author's is his professors.
| kragen wrote:
| Yes.
| judofyr wrote:
| I looked a bit into this a few years back and found it quite
| interesting. Despite them calling them "Tiny Pointers" I would
| say it's closer to a open addressing hash map. You have a
| specific key, and then you can "allocate" an entry in the hash
| map. This gives you back a "pointer". You can then later use the
| original key _and_ the pointer together to determine the index of
| the entry. There 's also a slight chance that the allocation will
| fail (similar to a hash collision in a hash map). The "trick"
| here is that two different keys can end up having the _exact_
| same pointer (because we 're always dereferencing with the key).
| This makes them more compact.
|
| I was struggling a bit to come up with good use cases for it.
| Their examples are all around combining them with existing data
| structures and they show that the space complexity is smaller,
| but it wasn't completely clear to me how feasible this would
| actually be in practice.
| qazxcvbnm wrote:
| I read the paper a few years ago and I agree that for such an
| incredible algorithmic improvement, it's not trivial to find a
| use case, as you still need to maintain a separate (albeit
| algorithmically insignificant) lookup table. When I read the
| paper I (mistakenly) hoped it could be used for indexing on-
| disk structures without ever hitting the disk for things like B
| tree internal nodes. To get its sweet sweet algorithmic
| complexity, up its sleeve is once again (as I recall) the age
| old trick of rebuilding the structure when the size doubles,
| which makes it much less efficient than it sounds for most
| practical use cases. I suppose one good use case for this might
| be compressing database indexes, where you need to maintain a
| separate structure anyway and the space savings can be worth
| it.
| judofyr wrote:
| I didn't invest a lot of time in it because when these papers
| show no _practical_ applications I always assume that the
| constant factors are too high. Especially when there 's also
| no _specific_ values of the constant factors presented. It 's
| still somewhere back in my mind to implement it one day to
| figure out the nitty gritty details.
| boothby wrote:
| Back of the envelope calculations from the abstract: An
| 8-bit tiny pointer would be sufficient to reference an
| array of size 2^2^2^3 [?] 1e77 ([?]atoms in the visible
| universe) with fullness 31/32 [?] 97%. Or size 65536 and
| fullness 63/64. For 16-bit tiny pointers, there's no point
| in going bigger than the universe; fullness goes to 99.98%.
| Like you, I'd love to see such an example worked out.
| taeric wrote:
| It isn't hard to make a datastructure that indexes into itself.
| BDDs, for example, are often coded in this way. I did an
| admittedly poor job of one at https://taeric.github.io/trading-
| with-bdds.html, but I think it is enough to see the idea well
| enough.
| Izikiel43 wrote:
| This guy used that paper to create a new hash table which is
| much faster at high load that current implementations:
|
| https://www.quantamagazine.org/undergraduate-upends-a-40-yea...
| nsteel wrote:
| That paper was front page 2 days ago. It's the reason this
| paper is front page today. And while it is fantastic work, it
| has no obvious break-through application, as was discussed:
| https://news.ycombinator.com/item?id=43004955
| BenoitEssiambre wrote:
| Here's a somewhat related post where I argue that logic that
| _can_ have small pointers has lower entropy and is more optimal
| as bayesian model of the domain:
| https://benoitessiambre.com/entropy.html
| kittikitti wrote:
| Is there a peer reviewed version of this or does Hacker News
| exclusively post non peer reviewed works?
| wbolt wrote:
| https://dl.acm.org/doi/10.1145/3700594
| kazinator wrote:
| > _How large do the pointers need to be? The natural answer is
| that each pointer uses log nbits. However, the fact that each
| pointer has a distinct owner makes it possible to compress the
| pointers to o(log n) bits._
|
| What if you have to _debug_ the whole situation, such that you
| don 't always know who is the owner of a pointer you are looking
| at?
|
| > _A user k can call Allocate(k) in order to get a tiny pointer
| p; they can dereference the tiny pointer pby computing a function
| Dereference(k,p) whose value depends only on k, p, and random
| bits; and they can free a tiny pointer p by calling a function
| Free(k,p)._
|
| That is tantamount ot saying that the pointer is not actually _p_
| but the tuple _< k, p>_, and so its size consists of the number
| of bits in _k_ , the number of bits in _p_ plus an indication of
| where the division between these bits lie: where _k_ ends and _p_
| begins.
|
| We can abbreviate _< k, p>_ to _p_ in contexts where _k_ can be
| implicitly understood.
___________________________________________________________________
(page generated 2025-02-12 23:01 UTC)