[HN Gopher] Fine, I'll Play With Skiplists
___________________________________________________________________
Fine, I'll Play With Skiplists
Author : foldU
Score : 90 points
Date : 2024-11-27 15:50 UTC (4 days ago)
(HTM) web link (buttondown.com)
(TXT) w3m dump (buttondown.com)
| ch33zer wrote:
| In the multi writer skiplist code I think there's a leftover line
| from the previous example:
|
| atomic.StorePointer(&prev.next, node)
|
| Apart from that I really like the simulations!
| foldU wrote:
| whoops! thank you!
| ajross wrote:
| Skiplist discussions that don't talk about heap effects are
| probably incomplete. The big limitation with skiplists for
| anything that isn't a "mostly-write-once" workload (something
| that, it should be noted, doesn't get much benefit from the
| atomic implementation discussed) is that the metadata record is
| _variable sized_ , which does bad things to heap fragmentation
| and cache locality (two things you probably care deeply about if
| you're going so far as to junk your AVL tree for a lockless data
| structure) and isn't amenable to the slab-style optimizations
| used elsewhere.
|
| I love skiplists for their simplicity, especially compared with
| the balanced trees against which they compete. But the truth is
| they're very one-trick-pony data structures and don't bring lot
| to the table as a general choice. Their home is pretty limited.
| surajrmal wrote:
| Doesn't the author mention using arenas?
| ajross wrote:
| Arenas are just heaps, and subject to all the same problems.
| It's true that in the LSM use case in question (and I need to
| point out that databases are not my area!) there is a clear
| "flush" operation that will put a limit on how much heap
| churn and fragmentation you need to tolerate. And maybe
| that's an answer that makes skiplists more desirable in this
| context. But things like long-lived app data and language
| runtimes don't have that, and skiplists turn into headaches.
|
| A simpler consequence of the same problem: I mostly work in
| embedded stuff and skiplists are pretty much a non-starter as
| they can't easily be made intrusive without a bunch of waste.
| peterfirefly wrote:
| Different sizes would be allocated in different arenas.
| tomnipotent wrote:
| It's my understanding that in the context of append-only
| skip lists, like what RocksDB uses for memtables, you can
| see wasted space at the end of a block within an arena but
| no fragmentation between records within a block. SSTables
| are another story.
| atombender wrote:
| Skiplists can be very effective in disk structures for few of
| reasons.
|
| For one, a sorted sequence of keys can be written to file in a
| single pass, and it's easy to interleave with the data itself.
| For example, Lucene uses multi-level skip lists in compressed
| posting list files in order to quickly jump to a term and then
| jump to the lowest matching document ID for that term. Since
| these files are immutable, the tree is only built once and all
| the data can be stored in sorted order, which has the added
| benefit that these files can also be merged in a single pass.
| derefr wrote:
| > It's generally my view that from a bird's eye view LSMs are
| simpler than on-disk Btrees, largely because they are more
| composable and a lot more of their data is immutable.
|
| I dunno about this. On-disk B-trees _can_ actually be pretty dang
| simple! Both conceptually and in implementation.
|
| In concept, for an on-disk B-tree "table", you've got:
|
| 1. one growable file, or a sequence of fixed-size split files,
| abstracted over into a "vector of pages" type that you can 1.
| find the base pointer of, 2. overwrite a full page of, or 3.
| extend;
|
| 2. a "metapage" type, that works like the world's most trivial
| filesystem: it owns a "freelist" of page numbers, and keeps a
| pointer to the live root of the B-tree; and it embeds its own
| hash, and a version number.
|
| 3. a mechanism for discovering/choosing the newest live+valid
| metapage, essentially making this trivial "filesystem" into a
| trivial "journalling" filesystem. (in e.g. LMDB, you just have
| two metapages in pages 0 and 1, that are alternately written to,
| sort of like double-buffering; on DB open, the newer of the two
| by version is chosen if it's valid by hash; otherwise the older
| is used.)
|
| 4. a readers-writer lock (which doesn't need to be persisted to
| disk, because the on-disk file is never in a volatile/dirty
| state. You can just OS-advisory-lock the file -- those locks
| disappear when the creator process dies, which _is_ what you want
| here);
|
| 5. read-only transactions, that take a reader lock, discover the
| newest live metapage, dereference the B-tree root, and pass that
| pointer off to the user, letting them have at it, reading the
| B-tree arbitrarily as if it were an in-memory data structure;
|
| 6. read-write transactions, that take a writer lock, discover the
| newest live metapage, dereference the b-tree root, set up an in-
| memory dirtied pages map, and then let you read arbitrarily +
| write arbitrarily (with reads indirected through the dirtied-
| pages map, acting as an overlay);
|
| 7. Copy-on-Write updates to "clean" pages during rw txs, by
| cloning the clean page contents onto free[listed] pages, dirtying
| the clone with the update, and adding the dirty version of the
| page to the dirtied pages map, to overlay the original page;
|
| 8. a read-write tx commit op that "propagates" the changes from
| dirty pages, by Copy-on-Write rewriting the B-tree ancestors of
| the parent pages to point to the new dirty page numbers, until
| you derive a new B-tree root page -- which you then create a new
| metapage for, _fsync the file_ , and then write the metapage.
|
| (Note that this last bit means that a read-write transaction will
| effectively implicitly "roll back" on crash/power cut -- the data
| is written _before_ the "journal entry" for it, so if the
| "journal" doesn't make it to disk, then the data updates --
| including the updated freelist -- are "lost". All a failed tx has
| done is update free pages.)
|
| Sure, that's eight things you need to implement. But none of them
| are _complicated_ or _unintuitive_ , the way that a "lock free
| concurrent skiplist" is. You could whiteboard any one of those
| abstractions above with a junior programmer, and they could
| probably figure out a basically-correct implementation just from
| the descriptions above.
|
| And that means that the implementations of on-disk B-trees are
| usually pretty short-and-sweet.
|
| LMDB, for example, is one C file
| (https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblm...)
| with ~8k of actual code (according to
| https://github.com/AlDanial/cloc).
|
| Postgres's durable B-tree implementation
| (https://github.com/postgres/postgres/tree/master/src/backend...)
| is 22kloc, and ~10kloc of actual code.
| AtlasBarfed wrote:
| Lsms and sstables involve compaction in all data write patterns
| that involve updates or deletes.
|
| I don't know how Rockdb handles this, but this is a very very
| non-trivial problem in Cassandra.
|
| Let's say your update spans a row of data that is stored across
| multiple sstables (the row of data, that is a set of column
| values, was formed in multiple updates that spanned multiple
| flushes of the mentable to disk/sstables.
|
| So either as part of your update you will be compacting that is
| rewriting multiple SS tables of unknown size, possibly very
| large into a new SS table with the new updates, or you must
| employ some means of time stamping the individual values or in
| the case of deletes employing something like tombstones or
| delete markers or timestamps
|
| Then your reed engine needs to be able to read multiple SS
| tables and apply update ordering in order to find the most
| recent actual value.
|
| This results in Io stress and reduced performance in the read
| path.
|
| Lsms exist to scale writes to large indexes. Essentially what
| you were doing is you are queuing up the updates for a future
| process (compaction) to clean up.
| hinkley wrote:
| In subversion writes replaced a node, then replaces the parent
| node all the way up to the root. So anyone doing a read in the
| middle of a merge only sees the old data because it saw the old
| root node. I assume MVCC systems do something similar, but for
| others I would think a lock per node allows for a lot less
| contention. And your project then relies very much on good
| indexes to avoid reading rows and thus parent nodes you don't
| need.
___________________________________________________________________
(page generated 2024-12-01 23:00 UTC)