[HN Gopher] How the append-only btree works (2010)
       ___________________________________________________________________
        
       How the append-only btree works (2010)
        
       Author : kevmo314
       Score  : 163 points
       Date   : 2023-12-29 14:38 UTC (8 hours ago)
        
 (HTM) web link (www.bzero.se)
 (TXT) w3m dump (www.bzero.se)
        
       | seunosewa wrote:
       | Append-only struxtures are not efficient enough for general use.
       | You're paying a steep price for creating a snapshot of the data
       | for every single operation.
       | 
       | I saw this when I was playing with Scala's immutable and mutable
       | data structures - written by the same team - ages ago. The
       | immutable structures were _much_ slower for common operations.
       | 
       | The fastest databases tend to use undo logs to re-construct
       | snapshots when they are needed.
        
         | vlovich123 wrote:
         | A mutable approach requires a write ahead log meaning you have
         | to copy the data twice on every write which seems worse.
        
           | vajrabum wrote:
           | In a well designed database system (hardware!) you aren't
           | going to use the same CPU, disk or controller to write the
           | log and the data structure. So the performance penalty if any
           | is going to be low.
        
             | loeg wrote:
             | Now you're paying for double the hardware.
        
             | vlovich123 wrote:
             | Low but non 0 and you're still duplicating the values
             | written to disk. So your SSD lifetime is decreased and this
             | is effectively a RAID0 layout which means data loss is
             | risky too.
        
           | iambvk wrote:
           | It is: two writes for write ahead log vs. log-n (tree-height)
           | writes for CoW
        
         | londons_explore wrote:
         | I don't think this is generally true.
         | 
         | Especially on mediums where sequentially writing large blocks
         | is faster than random writes, you get much better performance
         | to use a log-structured datastructure and put anything new/any
         | changes at the end.
         | 
         | Due to the design of modern SSD's, 'write in place' is pretty
         | much an illusion.
        
         | scottlamb wrote:
         | You can build real useful systems on them. For example, Gmail
         | used to be implemented on top of GFS as two append-only files
         | per user: the zlog and the index. The zlog held (zlib-
         | compressed) operations such as "add this message" and "delete
         | this message". The index was an append-only btree as described
         | here. It was a true index in that it didn't include the actual
         | message contents. Those were simply pointers into the log.
         | Thus, it was much smaller than the zlog and its efficiency was
         | less of a concern. Also, periodically both files were
         | "compacted" by writing a fresh log (that built the current
         | state in the most efficient way) and matching index. This was
         | primarily for privacy (otherwise your deleted messages would
         | never actually go away) but also improved efficiency.
         | 
         | Gmail's now built on top of Spanner so uses a log-structured
         | merge tree. Still append-only files but a bit different. Files
         | aren't per user but per some arbitrary key range boundary; no
         | more btree; multiple layers with the top ones getting compacted
         | more frequently; etc.
        
           | pclmulqdq wrote:
           | It's worth noting that Google's internal scale-out filesystem
           | is append-only (for various distributed systems and
           | operational reasons), so you end up with append-only data
           | structures proliferating in that ecosystem. That does not
           | necessarily mean that the append-only data structures were
           | chosen for any technical merit other than ability to be
           | implemented on the append-only filesystem.
        
             | btilly wrote:
             | So if we ignore the technical merits based on which append-
             | only data structures were chosen in general, we don't
             | necessarily have additional technical merits for any
             | particular time append-only was chosen.
             | 
             | The reasons why append-only structures were chosen in
             | general apply surprisingly often to any particular system
             | that scales to large data, which would like to be robust to
             | a wide variety of real world problems. You won't see it in
             | looking at the data structure because the hard parts are
             | abstracted away from you. But you'll see it if you try to
             | reimplement the same system from scratch, then scale it up
             | to production.
        
               | alternative_a wrote:
               | The subtle point I read in GP is the historic correlation
               | between storage systems and data structures. Your point
               | is equally valid in that this correlation is not
               | indicative of non-general applicability.
               | 
               | Both points need to be kept in mind imho in design.
        
             | scottlamb wrote:
             | Good point. It's also worth noting that raw flash also is
             | generally treated as roughly append-only for wear leveling.
             | "General-purpose" may be in the eye of the beholder, but
             | I'd say these append-only data structures are useful in at
             | least two significant environments.
        
           | lloydatkinson wrote:
           | What you described is called event sourcing
        
         | valenterry wrote:
         | That usually happens when you try to work with immutable data
         | structures like you are used to with mutable datastructures.
         | 
         | For example, if you append to a mutable list then it's going to
         | be fast. But prepending to it is much slower. With immutable
         | lists, it's the other way around. Not knowing that will make
         | you think that immutable datastructures are generally slow, but
         | they are not.
         | 
         | That being said, I would say they are generally more tricky, so
         | it's good to understand in which cases it's worth to sacrifize
         | safety and readability and switch to mutable datastructures for
         | performance reasons.
        
         | slashdev wrote:
         | LMDB is a strong counter-example to your argument. Based on its
         | benchmarks, you're wrong. [1]
         | 
         | Yes immutable in-memory data structures are slower than their
         | mutable counterpoints, but we're not talking about either of
         | those here.
         | 
         | Databases are not in-memory data structures.
         | 
         | [1] http://www.lmdb.tech/bench/microbench/july/
        
         | halayli wrote:
         | You're comparing apples and oranges. Anytime you write/modify
         | on an SSD you're writing on a new page. So you might as well
         | leverage this behavior to your advantage. It has several
         | benefits when it comes to implementing transactions.
        
         | teo_zero wrote:
         | IF reads dominate writes, and multiple threads read the data
         | concurrently, this approach may win by eliminating the need for
         | locks.
        
         | hinkley wrote:
         | You probably need something special for batch operations.
         | Adding five records in one transaction shouldn't pay >= 5 times
         | as much as a single one.
         | 
         | You need a different API for something like that however.
        
           | joshlemer wrote:
           | Yes, for instance Clojure's "transients" as well as the
           | JavaScript "immer" library
        
         | cryptonector wrote:
         | The biggest cost to copy-on-write data structures is write
         | magnification. Adding a log to amortize that cost helps a lot.
         | That's what the ZFS intent log does. Any CoW b-tree scheme can
         | benefit from that approach.
         | 
         | The benefits of CoW data structures are tremendous:
         | - easy to reason about       - easy to multi-thread for reading
         | - O(1) snashopts (and clones)
         | 
         | The downsides of CoW data structures are mainly:
         | - the need to amortize write magnification       - difficulty
         | in multi-threading writes
        
         | joshlemer wrote:
         | Scala's immutable data structures, and especially the way they
         | are used idiomatically and most obviously, have a lot of
         | performance issues that are much bigger than the basic cost of
         | persistent data structures. Namely, most commonly I will see
         | developers default to pipelining strict operations like
         | myList.map(...).filter(...).flatMap(...).collect(...).take(2)
         | which forces/materializes the whole collection in memory for
         | every operation. Better would be to first transform the list
         | into a lazy view or iterator with myList.view or
         | myList.iterator, and then do the pipeline.
         | 
         | They also lack the ability to perform multiple updates in a
         | batch, except for some very limited cases. Other
         | implementations like Clojure's support "transients" where you
         | get access to mutate the data structure over and over again (as
         | well as do reads), and then freeze the structure in place as a
         | new persistent collection. JavaScript libraries like immer
         | allow for the same thing. Scala's collections don't generally
         | support this except in the form of "builders" which don't
         | support reads and also don't support all the write access
         | patterns (such as updating a vector at a specific index, or
         | removing a key from a map/set, etc).
        
       | btilly wrote:
       | One of the advantages of this kind of architecture is that if one
       | reader is reading the old tree while it is replaced, it just
       | works. Databases are really good at having multiple versions of
       | the same data structure be accessible at the same time.
       | 
       | I hand rolled some similar data structures in higher order
       | languages when I couldn't find any Python packages that gave me
       | the same capability. But I couldn't figure out what a good name
       | would be for, say, a dictionary that could have multiple
       | concurrent versions. So I never went anything where with that.
        
         | Hnrobert42 wrote:
         | As Phil Karlton, "There are only two hard things in computer
         | science: cache validation and naming things." Seems like you
         | found a way to work on both!
         | 
         | https://skeptics.stackexchange.com/questions/19836/has-phil-...
        
           | timeimp wrote:
           | I always have to add the "and off-by-one" at the end.
           | 
           | Nothing like reading memory you don't own!
        
             | airstrike wrote:
             | There are actually only two hard problems in computer
             | science:
             | 
             | 0) Cache invalidation
             | 
             | 1) Naming things
             | 
             | 5) Asynchronous callbacks
             | 
             | 2) Off-by-one errors
             | 
             | 3) Scope creep
             | 
             | 6) Bounds checking
        
               | alternative_a wrote:
               | Let's be pedantic here and get all religious over the
               | words "Science" and "hard".
               | 
               | Computer "science" has a difficult conceptual problem
               | with caching. The optimal cache, this science tells us,
               | is indistinguishable from a fortune teller who is never
               | wrong (oracle). Fortune telling is a "hard" problem for a
               | science based on reasoning. The best we can do is hedge
               | bets (which is what the science of caching focuses on).
               | 
               | This same science also has a difficulty with naming
               | things. Now numbering things is easy and science loves
               | maths and maths love sciences, but science and letters
               | have a more difficult history. Science would approve of
               | "factoryfactoryfactoryImpl" btw ... it's "a rational
               | scheme of naming". .
               | 
               | Here we see a "science" that is facing actual
               | difficulties.
               | 
               | The rest of your list are difficult but not "hard". The
               | science of these matters is clear and the rest is up to
               | the "scientists" struggling with "scope creep" and
               | "bounds checking" ..
        
               | Scarblac wrote:
               | There is only one hard problem in software engineering:
               | people.
        
           | pmarreck wrote:
           | _in_ validation?
        
         | j-pb wrote:
         | These things are usually called "persistent" versions of the
         | thing, e.g. persistent Dictionary. Sometimes "immutably
         | persistent" to distinguish it from the "durable" meaning of
         | persistence i.e. written to disk.
        
           | btilly wrote:
           | It seems odd to me to call a data structure "persistent" when
           | its purpose is to allow you to cheaply keep many similar
           | transitory copies around until you throw them away.
           | 
           | The specific application was using dynamic programming to
           | build up a complex data structure. You wind up with many
           | copies of very similar data structures, and doing a deep copy
           | each time is prohibitively expensive.
        
             | ncruces wrote:
             | Still that's what they're called:
             | https://en.wikipedia.org/wiki/Persistent_data_structure
             | 
             | Precisely because previous versions remain valid (persist)
             | under modification.
        
               | btilly wrote:
               | There is a key difference between that, and what I
               | created.
               | 
               | The difference is that persistent data structures allow
               | you to traverse the history of the data structure to find
               | past versions. By contrast I only allowed you to see the
               | current version, returning a new version of the root any
               | time you made a modification. As a result, the memory
               | associated with the old version could be freed once
               | nothing would want to access it again. And any version
               | that you had could still be manipulated.
               | 
               | For the example that I was dealing with, both made sense.
               | On top of that it made it possible to create a hashable
               | version of the data structure that had a good chance (not
               | a guarantee) of matching hashes when you arrived at the
               | same data structure through different histories.
        
               | ncruces wrote:
               | Well, the difference isn't that great. You wrote: "the
               | memory associated with the old version could be freed
               | once nothing would want to access it again." So all it
               | really takes is something wanting.
        
       | zogomoox wrote:
       | Instead of cloning the branches all the way up to root, wouldn't
       | it be more efficient to just write delta-records, so a structure
       | per inner node that says "it's like this original node, but with
       | the following modifications x,y,z"
        
         | PartiallyTyped wrote:
         | You may want to rebalance a tree, or you may want to have
         | multiple readers while a writer exists without relying on
         | mutexes or locking.
        
           | zogomoox wrote:
           | as long as you don't modify the original nodes that would
           | still work, no locks needed.
        
         | btilly wrote:
         | Efficient in what way?
         | 
         | Yes, a delta record would be smaller. But now you have to fetch
         | both the new and the old, plus do computation to figure out
         | what you have. This is trading off space and time.
         | 
         | I'm going to go to a reasonably low level to explain this.
         | 
         | Databases have generally found that the right tradeoff between
         | space and time is to always work in terms of pages of fixed
         | size. Now all reads are memory aligned, of known size. All
         | writes are as well. And inside the CPU, processing a page in
         | the most straightforward and predictable way possible is very
         | fast. In particular you want to avoid complex logic that
         | introduces too many pipeline stalls.
         | 
         | If you're going to work with pages anyways, you want to find
         | ways to fetch as few pages as possible. Only have this page
         | point to that page where you really need to. And put as much as
         | reasonable on each page. The name of the game is to fetch as
         | few pages as you need, and get everything you need from a page
         | when you fetch it. Because when you're getting a new page,
         | often you have to wait for a disk read. That's slow. It used to
         | be really slow, you needed to wait for the right part of the
         | disk to rotate around. Those disks went at something like 7200
         | rpm, but that means 120 revolutions per second, which means
         | you're waiting for anywhere from 0 to 8.333... milliseconds for
         | the read. Now consider reading through a million record
         | database...
         | 
         | That's where a BTree comes in. It is a tree structure built
         | around a page layout. When a page gets too full, it is split
         | and one record is promoted to the level above. When the top
         | page gets too full, a new level is created at top. That keeps
         | it perfectly balanced at all times. So you can get to any
         | record in very few reads. And if you're walking the whole data
         | structure, a single page generally has many records.
        
       | hoytech wrote:
       | LMDB [0] was derived from this project, as mentioned in its
       | copyright notice section [1].
       | 
       | [0] http://www.lmdb.tech/doc/ [1]
       | https://github.com/LMDB/lmdb/blob/30288c72573ceac719627183f1...
        
         | jnwatson wrote:
         | Importantly, the LMDB API allows append mode, and it is very
         | fast.
        
       | bob1029 wrote:
       | Some years ago I was playing around with the idea of an append-
       | only splay tree for storing a database file. The thinking was
       | that the algorithm would keep recently- _accessed_ items at the
       | business end of the log (in addition to any new /updated/deleted
       | items).
       | 
       | This concept is clearly quite heavy on writes, but the tradeoff
       | is that you would then have the ability to expire unused entries
       | simply by picking an offset in the log and chopping everything
       | that precedes it. Any record accessed more recently than that
       | cutoff point would be guaranteed to have been written one or more
       | times later on in the log.
       | 
       |  _Any_ access would result in a new modified tree  & root node
       | being written, but the prototype did batch IO using an MPSC queue
       | abstraction which meant that I could amortize things a bit.
       | Multiple transactions could fit in a single IO command if they
       | are issued from different threads, occur within a small slice of
       | time and are smaller than the block size.
        
       | paulsutter wrote:
       | Every historical version of the database is available as a
       | snapshot. The metadata block should point back to previous meta
       | block(s), not just the new root
       | 
       | And of course emphasizing the closing statement:
       | 
       | > there is no need for a transaction log, because the database
       | file is the transaction log
        
       | ww520 wrote:
       | The immutable b+tree is a great idea, except it generates a
       | tremendous amount of garbage pages. Every update of a record
       | generates several new pages along the path of the tree.
       | 
       | There are two additional techniques to make immutable b+tree
       | practical. One is to reclaim stale unreachable pages after the
       | transactions using them have closed. Two is to use a pair of
       | oscillating fixed location meta pages.
       | 
       | A page is active if it's in the latest snapshot or it is in use
       | by an open transition; otherwise, it's unreadable and can be
       | reclaimed. When a transaction closes, it hands its list of in-use
       | pages to the reclaimer. The reclaimer tracks these pages. When no
       | other open transactions hold on to a page, it can be reclaimed.
       | 
       | When the write transaction commits, it hands the list of old
       | pages being overwritten to the reclaimer as candidates for
       | freeing, pending no open transactions using them.
       | 
       | The reclaimer can batch a number of reclaimed pages together to
       | update the free page list, by appending to the end of the file a
       | page of the newly freed page pointers. Update the meta page to
       | point to the new head page of the free page list at the end of
       | the file. This can be done as a write transaction to keep things
       | consistent.
       | 
       | At a crash all the pending reclaiming pages in memory are lost
       | and the garbage pages in disk linger. This requires periodic
       | garbage collection or compaction.
       | 
       | The most frequently updated page in the tree is the meta page.
       | Every write transaction updates it. This creates a lot of garbage
       | pages. The second technique addresses this problem.
       | 
       | One insight is that the meta page is only needed when a new
       | transaction begins by finding the tree root of the latest
       | committed snapshot. That means we only need two copies of the
       | meta pages, one for the last committed snapshot and one for the
       | new pending write transaction. When the write transaction
       | commits, the pending meta page becomes the latest committed
       | snapshot. The other page becomes available for the next pending
       | transaction.
       | 
       | We can have two fixed location meta pages, oscillating between
       | the latest committed snapshot and the pending new transaction.
       | The fixed location removes the need to search for the meta page
       | from the end of file.
        
         | dragontamer wrote:
         | > except it generates a tremendous amount of garbage pages
         | 
         | Note that this is an advantage on say, controller-less NAND
         | Flash (JFFS-like embedded NAND Flash).
         | 
         | More modern Flash drives have embedded FTL (flash translation
         | layers) that internalizes that garbage collection process. But
         | *someone* had to write the FTL to begin with, and methinks this
         | append-only btree would work very well for that.
         | 
         | -------
         | 
         | In NAND Flash, only 10,000ish erase/write cycles are allowed
         | before any block becomes unusable. (Depending on tech of
         | course: could be 1000 on QLC could be 100k on SLC). All that
         | garbage helps cycle the write/erase cycles across the drive
         | more evenly / more consistently. Especially if you combine that
         | garbage with TRIM-like commands.
         | 
         | That might be a little bit too low level for a lot of folks
         | though.
        
           | epcoa wrote:
           | There are more appropriate and better data structures for an
           | FTL especially with the assumption of some amount of
           | auxiliary RAM (either host or on controller). Much of this
           | literature is open access/free.
           | 
           | "All that garbage helps cycle the write/erase cycles across
           | the drive more evenly"
           | 
           | Well no, you still want to minimize garbage production (and
           | related GC overhead). Wear leveling doesn't mean produce more
           | garbage.
        
             | dragontamer wrote:
             | > Well no, you still want to minimize garbage production
             | (and related GC overhead). Wear leveling doesn't mean
             | produce more garbage.
             | 
             | Surely some % of garbage helps as you're garbage collecting
             | and reliably trying to shuffle data around to wear-level
             | more evenly?
             | 
             | Lets say you have 99% used data and 1% garbage. You have
             | very little room for (static) wear leveling. Ex: If you're
             | writing 10MB of new data, and your drive is 99% full of
             | allocated data... you'd have to move 1000MB of data around
             | to "find" the 10MB of garbage, on the average.
             | 
             | In the other extreme case: 0% data used and 100% garbage
             | (say a TRIM operation just completed), then you simply just
             | write 10MB without having to worry about moving any data
             | around.
             | 
             | The 50% data + 50% garbage scales as appropriate. 10MB of
             | written new data will need you to find and coalesce 10MB of
             | garbage to write the new data. This will happen after
             | moving 20MB of overall data around.
             | 
             | ----------
             | 
             | I'm oversimplifying of course. But even in a real life
             | system, I'd expect that the more "garbage" you have sitting
             | around, the better the various algorithms work for FTL /
             | SSD (static or dynamic) wear leveling.
        
               | epcoa wrote:
               | I can't state any more clearly: minimize the production
               | of garbage.
               | 
               | "Surely some % of garbage helps as you're garbage
               | collecting "
               | 
               | Garbage that doesn't exist doesn't need collecting.
               | 
               | The flaw here is confusing free space with garbage. You
               | shouldn't have written in the first place if you could
               | have avoided it.
               | 
               | Every environmentalist knows this: RRR, the first R is
               | reduce not recycle.
        
               | dragontamer wrote:
               | Any append-only data-structure will have data that was
               | true (when it was written), but has become
               | false/obsolete/garbage at a later time. This is
               | unavoidable.
               | 
               | I'm not saying that we write needless garbage to the logs
               | or filesystem or whatever. I'm saying that the amount of
               | garbage in your stream that you leave behind aids in
               | later steps of (static) wear-leveling. So therefore, its
               | not a big deal. You're going to be making plenty of this
               | (initially true data, but later garbage data) as files
               | get updated, moved around filesystems, or whatnot.
               | 
               | "Garbage" in this sense is perhaps the wrong word. Its
               | more like "Obsolete data", or "outdated data".
        
               | ncruces wrote:
               | If you read the article though, many of the updated nodes
               | (which are now garbage) don't see any updates to their
               | "data" but to internal tree pointers.
               | 
               | So lots of "data" is being copied, and garbage is being
               | generated, only for the benefit of tidying up tree
               | structure, not because the actual "data" in those pages
               | changed.
               | 
               | Not generating such garbage in the first place is an
               | obvious benefit.
        
               | dragontamer wrote:
               | I think I see what you mean now. Thanks.
        
         | codetrotter wrote:
         | This description reminds me of a document I read about how ZFS
         | is implemented. In particular, how snapshots work in ZFS, and
         | what happens when you delete a snapshot.
        
           | cryptonector wrote:
           | It's a very similar idea.
           | 
           | Traditional Unix-ish filesystems, with inodes and indirect
           | nodes are a lot like b-trees, but ones where the indices are
           | block numbers and where you can only append indices, trim, or
           | replace indices.
           | 
           | The ZIL (ZFS intent log) is a mechanism for amortizing the
           | write magnification of copy on write trees.
        
         | thechao wrote:
         | I am _immediately_ nerd-sniped by this. Is there any code out
         | there you know of I can see? The dual meta-data pages sounds
         | _precisely_ like double-buffering (from graphics; my domain of
         | expertise). I am also drawn to append-only- _like_ and
         | persistent data-structures.
        
           | hoytech wrote:
           | LMDB works like this. The MDB_env struct keeps pointers to
           | both meta pages:
           | 
           | https://github.com/LMDB/lmdb/blob/30288c72573ceac719627183f1.
           | ..
           | 
           | Which meta-page used is determined by the transaction ID
           | (even IDs use first, odd IDs second).
           | 
           | This is the earliest description I can find of the "shadow
           | paging" concept: https://dl.acm.org/doi/10.1145/320521.320540
           | 
           | And I believe its first implementation was in IBM's System R.
        
         | senderista wrote:
         | Of course, this implies a single-writer design (not necessarily
         | a bad thing!).
        
           | ww520 wrote:
           | Actually immutable btree has a single writer in general since
           | there's no translation log and the meta page pointing to one
           | latest tree root. Even if two write transactions updating
           | different parts of the tree, they both need to update the
           | same meta page, causing a contention. This is one downside
           | (or upside depending on your need since single writer
           | simplifies things a lot and great for sequential writes).
        
         | cryptonector wrote:
         | Yes, CoW trees have tremendous write magnification. This is
         | well-known. The fix is to amortize this cost by writing
         | transactions as a log and then doing a b-tree update operation
         | for numerous accumulated transactions.
         | 
         | Think of ZFS and it's ZIL (ZFS Intent Log). That's exactly what
         | the ZIL is: a CoW tree write magnification amortization
         | mechanism.
        
           | wredue wrote:
           | FFS. The fix for absolutely bonkers CoW costs is to stop
           | buying in to this idiotic notion that "immutability is easier
           | to reason about".
           | 
           | If you're having difficulty reasoning about how to deal with
           | major performance issues _then your position is not easier to
           | reason about. Full stop_.
           | 
           | Stop the madness!
        
             | cryptonector wrote:
             | It is easy to reason about the performance issues of CoW,
             | and it's easy enough to reason about how to work around
             | those issues. Ease of reasoning is a big deal when you're
             | dealing with hundreds of millions of lines of code, as some
             | of us do. Cognitive load is a big deal in today's world.
             | It's why we're migrating from C to memory-safe languages.
        
           | ww520 wrote:
           | I'm not familiar with ZFS internals beyond the deduplication
           | part. Is that just the traditional transaction log + btree
           | update approach most databases used? Does ZFS support
           | transactional reading and writing?
        
             | cryptonector wrote:
             | > Does ZFS support transactional reading and writing?
             | 
             | I don't know that I understand your question. ZFS supports
             | POSIX transactional semantics, which is _not_ ACID, though
             | ZFS 's design _could_ support ACID.
             | 
             | > Is that just the traditional transaction log + btree
             | update approach most databases used?
             | 
             | Basically, yes. The idea is to write a sequential log of a)
             | data blocks, b) metadata operations (e.g., renames) to
             | support fast crash recovery. During normal operation ZFS
             | keeps in-memory data structures up to date but delays
             | writing new tree transactions so as to amortize write
             | magnification. On unclean shutdown ZFS checks that all
             | transactions recorded in the ZIL have been written to the
             | tree or else it will do it by a) rewinding the ZFS state to
             | the newest fully-written transaction, b) loading the
             | contents of the ZIL from that point forward.
             | 
             | Because the ZIL was designed at a time when fast flash
             | devices were expensive, it's really a separate thing from
             | the rest of the tree. If one were to start over one might
             | integrate the ZIL with the rest of the system more tightly
             | so as to further minimize write magnification (e.g., data
             | blocks written to the ZIL need not be re-written to the
             | tree).
        
         | ruuda wrote:
         | The Hitchhiker Tree [1] addresses the write amplification at
         | the cost of making lookups and scans slightly more expensive,
         | by keeping a small array of "pending writes" in every non-leaf
         | node. The entire thing is still immutable, but instead of
         | writing a new leaf node + path from root to that leaf, in the
         | majority of cases we only write a new root. When there is no
         | space left in the array, all the pending values get pushed one
         | level down at the same time, so the write amplification is
         | amortized.
         | 
         | [1]: https://www.youtube.com/watch?v=jdn617M3-P4
        
           | ww520 wrote:
           | I believe the Bw-tree does the same thing, caching new
           | updates in the intermediate branch nodes and only pushing the
           | updates in batch down to the lower layers when running out of
           | room at the branch node. These are the newer wave of
           | algorithms to address the write amplification.
        
         | mamcx wrote:
         | Naively thinking here, how about a 2 immutable b+tree setup:
         | 
         | The "hot" tree is the WAL: All the data is there and copies are
         | generated.
         | 
         | The "cold" tree is behind: In the background move from WAL and
         | at the same time compact it.
        
           | ww520 wrote:
           | That's how the traditional transaction log + btree approach
           | work. The append-only btree removes the need for a separate
           | transaction log.
        
       | packetlost wrote:
       | One enhancement I've toyed with in the past is to have subtrees
       | beneath each leaf node for the historical data for each record.
       | You switch the indexing value to the timestamp of record
       | insertion time or a TX number and keep inserting in append-only
       | fashion. There's some enhancements you can make because the
       | common case is for the key to be a monotonically increasing
       | number, but nothing is strictly necessary. I'd love to build a
       | datalog system on top of something like that at some point, even
       | if it's just for fun.
        
       | hlship wrote:
       | In such a system, how does a reader find the root node? I'd be
       | concerned about a) readers seeing a partial write to disk (or
       | page buffer) during an update or b) a crash while the writer
       | writes to disk.
       | 
       | I could imagine using two files, one containing the actual
       | b-tree, the second containing the offset to the latest root note;
       | the second file gets overwritten only after a successful write is
       | verifiably written to disk.
       | 
       | Datomic's (https://www.datomic.com/) architecture is similar to
       | this, but uses many write-once "segments" (which could be files
       | in EFS or S3, or rows in DynamoDB or other stores).
        
         | twoodfin wrote:
         | There's a "canonical" pointer to the root node which is updated
         | atomically on every append.
         | 
         | For an in-memory database, a CAS is adequate. Persistent stores
         | ultimately need some kind of file-level locking.
         | 
         | If you look at the Apache Iceberg spec, you get a good idea of
         | how this works: The only "mutability" in that universe is the
         | root table pointer in the catalog.
        
         | ww520 wrote:
         | The pointer to the root tree node is stored in the last
         | committed metadata page. A read transaction starts with the
         | reading of the metadata page. Reaching the partial written
         | pages is impossible as the writer has not committed the latest
         | metadata page yet. The transaction is committed when the latest
         | metadata page is written as the last step.
        
       | JensRantil wrote:
       | IIRC, this is how CouchDB was implemented. The benefit is that it
       | was resilient to crash without needing a write-ahead log. The
       | downside was that it required running background compaction of
       | the B+tree regularly.
        
         | josephg wrote:
         | LMDB works the same way. But it also stores a second data
         | structure on disk listing all free blocks. Writes
         | preferentially take free blocks when they're available. The
         | result is it doesn't need any special compaction step. It sort
         | of automatically compacts constantly.
        
       | toolslive wrote:
       | with append-only data structures, you could turn the fact that
       | you rewrite the path from the node to the root to your advantage
       | and rebalance. This means you abandon the "balanced" property and
       | use update statistics to get an optimal shape. Who cares about
       | the length of paths to parts of the tree you never visit?
        
       | o11c wrote:
       | One thing that's not explicitly mentioned: append-only trees
       | necessarily lack `parent` pointers.
       | 
       | This means that your "iterator" can't be a lightweight type, it
       | has to contain an array of parent nodes to visit (again) later.
       | You can use a fixed-size array if you can reason about the
       | maximum height of the tree (for a balanced binary tree, this
       | means around `2*POINTER_BITS`, which is 1024 bytes on a 64-bit
       | platform; it will be less for a B-tree (`2*log(POINTER_MAX,
       | CHILDREN_PER_NODE)`) but you now have to either track or re-scan
       | for the index of the child you just returned from). Beware buffer
       | overflow logic bugs if your tree isn't as balanced as you
       | thought!
        
       ___________________________________________________________________
       (page generated 2023-12-29 23:00 UTC)