[HN Gopher] Fine, I'll Play With Skiplists
       ___________________________________________________________________
        
       Fine, I'll Play With Skiplists
        
       Author : foldU
       Score  : 90 points
       Date   : 2024-11-27 15:50 UTC (4 days ago)
        
 (HTM) web link (buttondown.com)
 (TXT) w3m dump (buttondown.com)
        
       | ch33zer wrote:
       | In the multi writer skiplist code I think there's a leftover line
       | from the previous example:
       | 
       | atomic.StorePointer(&prev.next, node)
       | 
       | Apart from that I really like the simulations!
        
         | foldU wrote:
         | whoops! thank you!
        
       | ajross wrote:
       | Skiplist discussions that don't talk about heap effects are
       | probably incomplete. The big limitation with skiplists for
       | anything that isn't a "mostly-write-once" workload (something
       | that, it should be noted, doesn't get much benefit from the
       | atomic implementation discussed) is that the metadata record is
       | _variable sized_ , which does bad things to heap fragmentation
       | and cache locality (two things you probably care deeply about if
       | you're going so far as to junk your AVL tree for a lockless data
       | structure) and isn't amenable to the slab-style optimizations
       | used elsewhere.
       | 
       | I love skiplists for their simplicity, especially compared with
       | the balanced trees against which they compete. But the truth is
       | they're very one-trick-pony data structures and don't bring lot
       | to the table as a general choice. Their home is pretty limited.
        
         | surajrmal wrote:
         | Doesn't the author mention using arenas?
        
           | ajross wrote:
           | Arenas are just heaps, and subject to all the same problems.
           | It's true that in the LSM use case in question (and I need to
           | point out that databases are not my area!) there is a clear
           | "flush" operation that will put a limit on how much heap
           | churn and fragmentation you need to tolerate. And maybe
           | that's an answer that makes skiplists more desirable in this
           | context. But things like long-lived app data and language
           | runtimes don't have that, and skiplists turn into headaches.
           | 
           | A simpler consequence of the same problem: I mostly work in
           | embedded stuff and skiplists are pretty much a non-starter as
           | they can't easily be made intrusive without a bunch of waste.
        
             | peterfirefly wrote:
             | Different sizes would be allocated in different arenas.
        
             | tomnipotent wrote:
             | It's my understanding that in the context of append-only
             | skip lists, like what RocksDB uses for memtables, you can
             | see wasted space at the end of a block within an arena but
             | no fragmentation between records within a block. SSTables
             | are another story.
        
         | atombender wrote:
         | Skiplists can be very effective in disk structures for few of
         | reasons.
         | 
         | For one, a sorted sequence of keys can be written to file in a
         | single pass, and it's easy to interleave with the data itself.
         | For example, Lucene uses multi-level skip lists in compressed
         | posting list files in order to quickly jump to a term and then
         | jump to the lowest matching document ID for that term. Since
         | these files are immutable, the tree is only built once and all
         | the data can be stored in sorted order, which has the added
         | benefit that these files can also be merged in a single pass.
        
       | derefr wrote:
       | > It's generally my view that from a bird's eye view LSMs are
       | simpler than on-disk Btrees, largely because they are more
       | composable and a lot more of their data is immutable.
       | 
       | I dunno about this. On-disk B-trees _can_ actually be pretty dang
       | simple! Both conceptually and in implementation.
       | 
       | In concept, for an on-disk B-tree "table", you've got:
       | 
       | 1. one growable file, or a sequence of fixed-size split files,
       | abstracted over into a "vector of pages" type that you can 1.
       | find the base pointer of, 2. overwrite a full page of, or 3.
       | extend;
       | 
       | 2. a "metapage" type, that works like the world's most trivial
       | filesystem: it owns a "freelist" of page numbers, and keeps a
       | pointer to the live root of the B-tree; and it embeds its own
       | hash, and a version number.
       | 
       | 3. a mechanism for discovering/choosing the newest live+valid
       | metapage, essentially making this trivial "filesystem" into a
       | trivial "journalling" filesystem. (in e.g. LMDB, you just have
       | two metapages in pages 0 and 1, that are alternately written to,
       | sort of like double-buffering; on DB open, the newer of the two
       | by version is chosen if it's valid by hash; otherwise the older
       | is used.)
       | 
       | 4. a readers-writer lock (which doesn't need to be persisted to
       | disk, because the on-disk file is never in a volatile/dirty
       | state. You can just OS-advisory-lock the file -- those locks
       | disappear when the creator process dies, which _is_ what you want
       | here);
       | 
       | 5. read-only transactions, that take a reader lock, discover the
       | newest live metapage, dereference the B-tree root, and pass that
       | pointer off to the user, letting them have at it, reading the
       | B-tree arbitrarily as if it were an in-memory data structure;
       | 
       | 6. read-write transactions, that take a writer lock, discover the
       | newest live metapage, dereference the b-tree root, set up an in-
       | memory dirtied pages map, and then let you read arbitrarily +
       | write arbitrarily (with reads indirected through the dirtied-
       | pages map, acting as an overlay);
       | 
       | 7. Copy-on-Write updates to "clean" pages during rw txs, by
       | cloning the clean page contents onto free[listed] pages, dirtying
       | the clone with the update, and adding the dirty version of the
       | page to the dirtied pages map, to overlay the original page;
       | 
       | 8. a read-write tx commit op that "propagates" the changes from
       | dirty pages, by Copy-on-Write rewriting the B-tree ancestors of
       | the parent pages to point to the new dirty page numbers, until
       | you derive a new B-tree root page -- which you then create a new
       | metapage for, _fsync the file_ , and then write the metapage.
       | 
       | (Note that this last bit means that a read-write transaction will
       | effectively implicitly "roll back" on crash/power cut -- the data
       | is written _before_ the  "journal entry" for it, so if the
       | "journal" doesn't make it to disk, then the data updates --
       | including the updated freelist -- are "lost". All a failed tx has
       | done is update free pages.)
       | 
       | Sure, that's eight things you need to implement. But none of them
       | are _complicated_ or _unintuitive_ , the way that a "lock free
       | concurrent skiplist" is. You could whiteboard any one of those
       | abstractions above with a junior programmer, and they could
       | probably figure out a basically-correct implementation just from
       | the descriptions above.
       | 
       | And that means that the implementations of on-disk B-trees are
       | usually pretty short-and-sweet.
       | 
       | LMDB, for example, is one C file
       | (https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblm...)
       | with ~8k of actual code (according to
       | https://github.com/AlDanial/cloc).
       | 
       | Postgres's durable B-tree implementation
       | (https://github.com/postgres/postgres/tree/master/src/backend...)
       | is 22kloc, and ~10kloc of actual code.
        
         | AtlasBarfed wrote:
         | Lsms and sstables involve compaction in all data write patterns
         | that involve updates or deletes.
         | 
         | I don't know how Rockdb handles this, but this is a very very
         | non-trivial problem in Cassandra.
         | 
         | Let's say your update spans a row of data that is stored across
         | multiple sstables (the row of data, that is a set of column
         | values, was formed in multiple updates that spanned multiple
         | flushes of the mentable to disk/sstables.
         | 
         | So either as part of your update you will be compacting that is
         | rewriting multiple SS tables of unknown size, possibly very
         | large into a new SS table with the new updates, or you must
         | employ some means of time stamping the individual values or in
         | the case of deletes employing something like tombstones or
         | delete markers or timestamps
         | 
         | Then your reed engine needs to be able to read multiple SS
         | tables and apply update ordering in order to find the most
         | recent actual value.
         | 
         | This results in Io stress and reduced performance in the read
         | path.
         | 
         | Lsms exist to scale writes to large indexes. Essentially what
         | you were doing is you are queuing up the updates for a future
         | process (compaction) to clean up.
        
         | hinkley wrote:
         | In subversion writes replaced a node, then replaces the parent
         | node all the way up to the root. So anyone doing a read in the
         | middle of a merge only sees the old data because it saw the old
         | root node. I assume MVCC systems do something similar, but for
         | others I would think a lock per node allows for a lot less
         | contention. And your project then relies very much on good
         | indexes to avoid reading rows and thus parent nodes you don't
         | need.
        
       ___________________________________________________________________
       (page generated 2024-12-01 23:00 UTC)