[HN Gopher] How the append-only btree works (2010)
___________________________________________________________________
How the append-only btree works (2010)
Author : kevmo314
Score : 163 points
Date : 2023-12-29 14:38 UTC (8 hours ago)
(HTM) web link (www.bzero.se)
(TXT) w3m dump (www.bzero.se)
| seunosewa wrote:
| Append-only struxtures are not efficient enough for general use.
| You're paying a steep price for creating a snapshot of the data
| for every single operation.
|
| I saw this when I was playing with Scala's immutable and mutable
| data structures - written by the same team - ages ago. The
| immutable structures were _much_ slower for common operations.
|
| The fastest databases tend to use undo logs to re-construct
| snapshots when they are needed.
| vlovich123 wrote:
| A mutable approach requires a write ahead log meaning you have
| to copy the data twice on every write which seems worse.
| vajrabum wrote:
| In a well designed database system (hardware!) you aren't
| going to use the same CPU, disk or controller to write the
| log and the data structure. So the performance penalty if any
| is going to be low.
| loeg wrote:
| Now you're paying for double the hardware.
| vlovich123 wrote:
| Low but non 0 and you're still duplicating the values
| written to disk. So your SSD lifetime is decreased and this
| is effectively a RAID0 layout which means data loss is
| risky too.
| iambvk wrote:
| It is: two writes for write ahead log vs. log-n (tree-height)
| writes for CoW
| londons_explore wrote:
| I don't think this is generally true.
|
| Especially on mediums where sequentially writing large blocks
| is faster than random writes, you get much better performance
| to use a log-structured datastructure and put anything new/any
| changes at the end.
|
| Due to the design of modern SSD's, 'write in place' is pretty
| much an illusion.
| scottlamb wrote:
| You can build real useful systems on them. For example, Gmail
| used to be implemented on top of GFS as two append-only files
| per user: the zlog and the index. The zlog held (zlib-
| compressed) operations such as "add this message" and "delete
| this message". The index was an append-only btree as described
| here. It was a true index in that it didn't include the actual
| message contents. Those were simply pointers into the log.
| Thus, it was much smaller than the zlog and its efficiency was
| less of a concern. Also, periodically both files were
| "compacted" by writing a fresh log (that built the current
| state in the most efficient way) and matching index. This was
| primarily for privacy (otherwise your deleted messages would
| never actually go away) but also improved efficiency.
|
| Gmail's now built on top of Spanner so uses a log-structured
| merge tree. Still append-only files but a bit different. Files
| aren't per user but per some arbitrary key range boundary; no
| more btree; multiple layers with the top ones getting compacted
| more frequently; etc.
| pclmulqdq wrote:
| It's worth noting that Google's internal scale-out filesystem
| is append-only (for various distributed systems and
| operational reasons), so you end up with append-only data
| structures proliferating in that ecosystem. That does not
| necessarily mean that the append-only data structures were
| chosen for any technical merit other than ability to be
| implemented on the append-only filesystem.
| btilly wrote:
| So if we ignore the technical merits based on which append-
| only data structures were chosen in general, we don't
| necessarily have additional technical merits for any
| particular time append-only was chosen.
|
| The reasons why append-only structures were chosen in
| general apply surprisingly often to any particular system
| that scales to large data, which would like to be robust to
| a wide variety of real world problems. You won't see it in
| looking at the data structure because the hard parts are
| abstracted away from you. But you'll see it if you try to
| reimplement the same system from scratch, then scale it up
| to production.
| alternative_a wrote:
| The subtle point I read in GP is the historic correlation
| between storage systems and data structures. Your point
| is equally valid in that this correlation is not
| indicative of non-general applicability.
|
| Both points need to be kept in mind imho in design.
| scottlamb wrote:
| Good point. It's also worth noting that raw flash also is
| generally treated as roughly append-only for wear leveling.
| "General-purpose" may be in the eye of the beholder, but
| I'd say these append-only data structures are useful in at
| least two significant environments.
| lloydatkinson wrote:
| What you described is called event sourcing
| valenterry wrote:
| That usually happens when you try to work with immutable data
| structures like you are used to with mutable datastructures.
|
| For example, if you append to a mutable list then it's going to
| be fast. But prepending to it is much slower. With immutable
| lists, it's the other way around. Not knowing that will make
| you think that immutable datastructures are generally slow, but
| they are not.
|
| That being said, I would say they are generally more tricky, so
| it's good to understand in which cases it's worth to sacrifize
| safety and readability and switch to mutable datastructures for
| performance reasons.
| slashdev wrote:
| LMDB is a strong counter-example to your argument. Based on its
| benchmarks, you're wrong. [1]
|
| Yes immutable in-memory data structures are slower than their
| mutable counterpoints, but we're not talking about either of
| those here.
|
| Databases are not in-memory data structures.
|
| [1] http://www.lmdb.tech/bench/microbench/july/
| halayli wrote:
| You're comparing apples and oranges. Anytime you write/modify
| on an SSD you're writing on a new page. So you might as well
| leverage this behavior to your advantage. It has several
| benefits when it comes to implementing transactions.
| teo_zero wrote:
| IF reads dominate writes, and multiple threads read the data
| concurrently, this approach may win by eliminating the need for
| locks.
| hinkley wrote:
| You probably need something special for batch operations.
| Adding five records in one transaction shouldn't pay >= 5 times
| as much as a single one.
|
| You need a different API for something like that however.
| joshlemer wrote:
| Yes, for instance Clojure's "transients" as well as the
| JavaScript "immer" library
| cryptonector wrote:
| The biggest cost to copy-on-write data structures is write
| magnification. Adding a log to amortize that cost helps a lot.
| That's what the ZFS intent log does. Any CoW b-tree scheme can
| benefit from that approach.
|
| The benefits of CoW data structures are tremendous:
| - easy to reason about - easy to multi-thread for reading
| - O(1) snashopts (and clones)
|
| The downsides of CoW data structures are mainly:
| - the need to amortize write magnification - difficulty
| in multi-threading writes
| joshlemer wrote:
| Scala's immutable data structures, and especially the way they
| are used idiomatically and most obviously, have a lot of
| performance issues that are much bigger than the basic cost of
| persistent data structures. Namely, most commonly I will see
| developers default to pipelining strict operations like
| myList.map(...).filter(...).flatMap(...).collect(...).take(2)
| which forces/materializes the whole collection in memory for
| every operation. Better would be to first transform the list
| into a lazy view or iterator with myList.view or
| myList.iterator, and then do the pipeline.
|
| They also lack the ability to perform multiple updates in a
| batch, except for some very limited cases. Other
| implementations like Clojure's support "transients" where you
| get access to mutate the data structure over and over again (as
| well as do reads), and then freeze the structure in place as a
| new persistent collection. JavaScript libraries like immer
| allow for the same thing. Scala's collections don't generally
| support this except in the form of "builders" which don't
| support reads and also don't support all the write access
| patterns (such as updating a vector at a specific index, or
| removing a key from a map/set, etc).
| btilly wrote:
| One of the advantages of this kind of architecture is that if one
| reader is reading the old tree while it is replaced, it just
| works. Databases are really good at having multiple versions of
| the same data structure be accessible at the same time.
|
| I hand rolled some similar data structures in higher order
| languages when I couldn't find any Python packages that gave me
| the same capability. But I couldn't figure out what a good name
| would be for, say, a dictionary that could have multiple
| concurrent versions. So I never went anything where with that.
| Hnrobert42 wrote:
| As Phil Karlton, "There are only two hard things in computer
| science: cache validation and naming things." Seems like you
| found a way to work on both!
|
| https://skeptics.stackexchange.com/questions/19836/has-phil-...
| timeimp wrote:
| I always have to add the "and off-by-one" at the end.
|
| Nothing like reading memory you don't own!
| airstrike wrote:
| There are actually only two hard problems in computer
| science:
|
| 0) Cache invalidation
|
| 1) Naming things
|
| 5) Asynchronous callbacks
|
| 2) Off-by-one errors
|
| 3) Scope creep
|
| 6) Bounds checking
| alternative_a wrote:
| Let's be pedantic here and get all religious over the
| words "Science" and "hard".
|
| Computer "science" has a difficult conceptual problem
| with caching. The optimal cache, this science tells us,
| is indistinguishable from a fortune teller who is never
| wrong (oracle). Fortune telling is a "hard" problem for a
| science based on reasoning. The best we can do is hedge
| bets (which is what the science of caching focuses on).
|
| This same science also has a difficulty with naming
| things. Now numbering things is easy and science loves
| maths and maths love sciences, but science and letters
| have a more difficult history. Science would approve of
| "factoryfactoryfactoryImpl" btw ... it's "a rational
| scheme of naming". .
|
| Here we see a "science" that is facing actual
| difficulties.
|
| The rest of your list are difficult but not "hard". The
| science of these matters is clear and the rest is up to
| the "scientists" struggling with "scope creep" and
| "bounds checking" ..
| Scarblac wrote:
| There is only one hard problem in software engineering:
| people.
| pmarreck wrote:
| _in_ validation?
| j-pb wrote:
| These things are usually called "persistent" versions of the
| thing, e.g. persistent Dictionary. Sometimes "immutably
| persistent" to distinguish it from the "durable" meaning of
| persistence i.e. written to disk.
| btilly wrote:
| It seems odd to me to call a data structure "persistent" when
| its purpose is to allow you to cheaply keep many similar
| transitory copies around until you throw them away.
|
| The specific application was using dynamic programming to
| build up a complex data structure. You wind up with many
| copies of very similar data structures, and doing a deep copy
| each time is prohibitively expensive.
| ncruces wrote:
| Still that's what they're called:
| https://en.wikipedia.org/wiki/Persistent_data_structure
|
| Precisely because previous versions remain valid (persist)
| under modification.
| btilly wrote:
| There is a key difference between that, and what I
| created.
|
| The difference is that persistent data structures allow
| you to traverse the history of the data structure to find
| past versions. By contrast I only allowed you to see the
| current version, returning a new version of the root any
| time you made a modification. As a result, the memory
| associated with the old version could be freed once
| nothing would want to access it again. And any version
| that you had could still be manipulated.
|
| For the example that I was dealing with, both made sense.
| On top of that it made it possible to create a hashable
| version of the data structure that had a good chance (not
| a guarantee) of matching hashes when you arrived at the
| same data structure through different histories.
| ncruces wrote:
| Well, the difference isn't that great. You wrote: "the
| memory associated with the old version could be freed
| once nothing would want to access it again." So all it
| really takes is something wanting.
| zogomoox wrote:
| Instead of cloning the branches all the way up to root, wouldn't
| it be more efficient to just write delta-records, so a structure
| per inner node that says "it's like this original node, but with
| the following modifications x,y,z"
| PartiallyTyped wrote:
| You may want to rebalance a tree, or you may want to have
| multiple readers while a writer exists without relying on
| mutexes or locking.
| zogomoox wrote:
| as long as you don't modify the original nodes that would
| still work, no locks needed.
| btilly wrote:
| Efficient in what way?
|
| Yes, a delta record would be smaller. But now you have to fetch
| both the new and the old, plus do computation to figure out
| what you have. This is trading off space and time.
|
| I'm going to go to a reasonably low level to explain this.
|
| Databases have generally found that the right tradeoff between
| space and time is to always work in terms of pages of fixed
| size. Now all reads are memory aligned, of known size. All
| writes are as well. And inside the CPU, processing a page in
| the most straightforward and predictable way possible is very
| fast. In particular you want to avoid complex logic that
| introduces too many pipeline stalls.
|
| If you're going to work with pages anyways, you want to find
| ways to fetch as few pages as possible. Only have this page
| point to that page where you really need to. And put as much as
| reasonable on each page. The name of the game is to fetch as
| few pages as you need, and get everything you need from a page
| when you fetch it. Because when you're getting a new page,
| often you have to wait for a disk read. That's slow. It used to
| be really slow, you needed to wait for the right part of the
| disk to rotate around. Those disks went at something like 7200
| rpm, but that means 120 revolutions per second, which means
| you're waiting for anywhere from 0 to 8.333... milliseconds for
| the read. Now consider reading through a million record
| database...
|
| That's where a BTree comes in. It is a tree structure built
| around a page layout. When a page gets too full, it is split
| and one record is promoted to the level above. When the top
| page gets too full, a new level is created at top. That keeps
| it perfectly balanced at all times. So you can get to any
| record in very few reads. And if you're walking the whole data
| structure, a single page generally has many records.
| hoytech wrote:
| LMDB [0] was derived from this project, as mentioned in its
| copyright notice section [1].
|
| [0] http://www.lmdb.tech/doc/ [1]
| https://github.com/LMDB/lmdb/blob/30288c72573ceac719627183f1...
| jnwatson wrote:
| Importantly, the LMDB API allows append mode, and it is very
| fast.
| bob1029 wrote:
| Some years ago I was playing around with the idea of an append-
| only splay tree for storing a database file. The thinking was
| that the algorithm would keep recently- _accessed_ items at the
| business end of the log (in addition to any new /updated/deleted
| items).
|
| This concept is clearly quite heavy on writes, but the tradeoff
| is that you would then have the ability to expire unused entries
| simply by picking an offset in the log and chopping everything
| that precedes it. Any record accessed more recently than that
| cutoff point would be guaranteed to have been written one or more
| times later on in the log.
|
| _Any_ access would result in a new modified tree & root node
| being written, but the prototype did batch IO using an MPSC queue
| abstraction which meant that I could amortize things a bit.
| Multiple transactions could fit in a single IO command if they
| are issued from different threads, occur within a small slice of
| time and are smaller than the block size.
| paulsutter wrote:
| Every historical version of the database is available as a
| snapshot. The metadata block should point back to previous meta
| block(s), not just the new root
|
| And of course emphasizing the closing statement:
|
| > there is no need for a transaction log, because the database
| file is the transaction log
| ww520 wrote:
| The immutable b+tree is a great idea, except it generates a
| tremendous amount of garbage pages. Every update of a record
| generates several new pages along the path of the tree.
|
| There are two additional techniques to make immutable b+tree
| practical. One is to reclaim stale unreachable pages after the
| transactions using them have closed. Two is to use a pair of
| oscillating fixed location meta pages.
|
| A page is active if it's in the latest snapshot or it is in use
| by an open transition; otherwise, it's unreadable and can be
| reclaimed. When a transaction closes, it hands its list of in-use
| pages to the reclaimer. The reclaimer tracks these pages. When no
| other open transactions hold on to a page, it can be reclaimed.
|
| When the write transaction commits, it hands the list of old
| pages being overwritten to the reclaimer as candidates for
| freeing, pending no open transactions using them.
|
| The reclaimer can batch a number of reclaimed pages together to
| update the free page list, by appending to the end of the file a
| page of the newly freed page pointers. Update the meta page to
| point to the new head page of the free page list at the end of
| the file. This can be done as a write transaction to keep things
| consistent.
|
| At a crash all the pending reclaiming pages in memory are lost
| and the garbage pages in disk linger. This requires periodic
| garbage collection or compaction.
|
| The most frequently updated page in the tree is the meta page.
| Every write transaction updates it. This creates a lot of garbage
| pages. The second technique addresses this problem.
|
| One insight is that the meta page is only needed when a new
| transaction begins by finding the tree root of the latest
| committed snapshot. That means we only need two copies of the
| meta pages, one for the last committed snapshot and one for the
| new pending write transaction. When the write transaction
| commits, the pending meta page becomes the latest committed
| snapshot. The other page becomes available for the next pending
| transaction.
|
| We can have two fixed location meta pages, oscillating between
| the latest committed snapshot and the pending new transaction.
| The fixed location removes the need to search for the meta page
| from the end of file.
| dragontamer wrote:
| > except it generates a tremendous amount of garbage pages
|
| Note that this is an advantage on say, controller-less NAND
| Flash (JFFS-like embedded NAND Flash).
|
| More modern Flash drives have embedded FTL (flash translation
| layers) that internalizes that garbage collection process. But
| *someone* had to write the FTL to begin with, and methinks this
| append-only btree would work very well for that.
|
| -------
|
| In NAND Flash, only 10,000ish erase/write cycles are allowed
| before any block becomes unusable. (Depending on tech of
| course: could be 1000 on QLC could be 100k on SLC). All that
| garbage helps cycle the write/erase cycles across the drive
| more evenly / more consistently. Especially if you combine that
| garbage with TRIM-like commands.
|
| That might be a little bit too low level for a lot of folks
| though.
| epcoa wrote:
| There are more appropriate and better data structures for an
| FTL especially with the assumption of some amount of
| auxiliary RAM (either host or on controller). Much of this
| literature is open access/free.
|
| "All that garbage helps cycle the write/erase cycles across
| the drive more evenly"
|
| Well no, you still want to minimize garbage production (and
| related GC overhead). Wear leveling doesn't mean produce more
| garbage.
| dragontamer wrote:
| > Well no, you still want to minimize garbage production
| (and related GC overhead). Wear leveling doesn't mean
| produce more garbage.
|
| Surely some % of garbage helps as you're garbage collecting
| and reliably trying to shuffle data around to wear-level
| more evenly?
|
| Lets say you have 99% used data and 1% garbage. You have
| very little room for (static) wear leveling. Ex: If you're
| writing 10MB of new data, and your drive is 99% full of
| allocated data... you'd have to move 1000MB of data around
| to "find" the 10MB of garbage, on the average.
|
| In the other extreme case: 0% data used and 100% garbage
| (say a TRIM operation just completed), then you simply just
| write 10MB without having to worry about moving any data
| around.
|
| The 50% data + 50% garbage scales as appropriate. 10MB of
| written new data will need you to find and coalesce 10MB of
| garbage to write the new data. This will happen after
| moving 20MB of overall data around.
|
| ----------
|
| I'm oversimplifying of course. But even in a real life
| system, I'd expect that the more "garbage" you have sitting
| around, the better the various algorithms work for FTL /
| SSD (static or dynamic) wear leveling.
| epcoa wrote:
| I can't state any more clearly: minimize the production
| of garbage.
|
| "Surely some % of garbage helps as you're garbage
| collecting "
|
| Garbage that doesn't exist doesn't need collecting.
|
| The flaw here is confusing free space with garbage. You
| shouldn't have written in the first place if you could
| have avoided it.
|
| Every environmentalist knows this: RRR, the first R is
| reduce not recycle.
| dragontamer wrote:
| Any append-only data-structure will have data that was
| true (when it was written), but has become
| false/obsolete/garbage at a later time. This is
| unavoidable.
|
| I'm not saying that we write needless garbage to the logs
| or filesystem or whatever. I'm saying that the amount of
| garbage in your stream that you leave behind aids in
| later steps of (static) wear-leveling. So therefore, its
| not a big deal. You're going to be making plenty of this
| (initially true data, but later garbage data) as files
| get updated, moved around filesystems, or whatnot.
|
| "Garbage" in this sense is perhaps the wrong word. Its
| more like "Obsolete data", or "outdated data".
| ncruces wrote:
| If you read the article though, many of the updated nodes
| (which are now garbage) don't see any updates to their
| "data" but to internal tree pointers.
|
| So lots of "data" is being copied, and garbage is being
| generated, only for the benefit of tidying up tree
| structure, not because the actual "data" in those pages
| changed.
|
| Not generating such garbage in the first place is an
| obvious benefit.
| dragontamer wrote:
| I think I see what you mean now. Thanks.
| codetrotter wrote:
| This description reminds me of a document I read about how ZFS
| is implemented. In particular, how snapshots work in ZFS, and
| what happens when you delete a snapshot.
| cryptonector wrote:
| It's a very similar idea.
|
| Traditional Unix-ish filesystems, with inodes and indirect
| nodes are a lot like b-trees, but ones where the indices are
| block numbers and where you can only append indices, trim, or
| replace indices.
|
| The ZIL (ZFS intent log) is a mechanism for amortizing the
| write magnification of copy on write trees.
| thechao wrote:
| I am _immediately_ nerd-sniped by this. Is there any code out
| there you know of I can see? The dual meta-data pages sounds
| _precisely_ like double-buffering (from graphics; my domain of
| expertise). I am also drawn to append-only- _like_ and
| persistent data-structures.
| hoytech wrote:
| LMDB works like this. The MDB_env struct keeps pointers to
| both meta pages:
|
| https://github.com/LMDB/lmdb/blob/30288c72573ceac719627183f1.
| ..
|
| Which meta-page used is determined by the transaction ID
| (even IDs use first, odd IDs second).
|
| This is the earliest description I can find of the "shadow
| paging" concept: https://dl.acm.org/doi/10.1145/320521.320540
|
| And I believe its first implementation was in IBM's System R.
| senderista wrote:
| Of course, this implies a single-writer design (not necessarily
| a bad thing!).
| ww520 wrote:
| Actually immutable btree has a single writer in general since
| there's no translation log and the meta page pointing to one
| latest tree root. Even if two write transactions updating
| different parts of the tree, they both need to update the
| same meta page, causing a contention. This is one downside
| (or upside depending on your need since single writer
| simplifies things a lot and great for sequential writes).
| cryptonector wrote:
| Yes, CoW trees have tremendous write magnification. This is
| well-known. The fix is to amortize this cost by writing
| transactions as a log and then doing a b-tree update operation
| for numerous accumulated transactions.
|
| Think of ZFS and it's ZIL (ZFS Intent Log). That's exactly what
| the ZIL is: a CoW tree write magnification amortization
| mechanism.
| wredue wrote:
| FFS. The fix for absolutely bonkers CoW costs is to stop
| buying in to this idiotic notion that "immutability is easier
| to reason about".
|
| If you're having difficulty reasoning about how to deal with
| major performance issues _then your position is not easier to
| reason about. Full stop_.
|
| Stop the madness!
| cryptonector wrote:
| It is easy to reason about the performance issues of CoW,
| and it's easy enough to reason about how to work around
| those issues. Ease of reasoning is a big deal when you're
| dealing with hundreds of millions of lines of code, as some
| of us do. Cognitive load is a big deal in today's world.
| It's why we're migrating from C to memory-safe languages.
| ww520 wrote:
| I'm not familiar with ZFS internals beyond the deduplication
| part. Is that just the traditional transaction log + btree
| update approach most databases used? Does ZFS support
| transactional reading and writing?
| cryptonector wrote:
| > Does ZFS support transactional reading and writing?
|
| I don't know that I understand your question. ZFS supports
| POSIX transactional semantics, which is _not_ ACID, though
| ZFS 's design _could_ support ACID.
|
| > Is that just the traditional transaction log + btree
| update approach most databases used?
|
| Basically, yes. The idea is to write a sequential log of a)
| data blocks, b) metadata operations (e.g., renames) to
| support fast crash recovery. During normal operation ZFS
| keeps in-memory data structures up to date but delays
| writing new tree transactions so as to amortize write
| magnification. On unclean shutdown ZFS checks that all
| transactions recorded in the ZIL have been written to the
| tree or else it will do it by a) rewinding the ZFS state to
| the newest fully-written transaction, b) loading the
| contents of the ZIL from that point forward.
|
| Because the ZIL was designed at a time when fast flash
| devices were expensive, it's really a separate thing from
| the rest of the tree. If one were to start over one might
| integrate the ZIL with the rest of the system more tightly
| so as to further minimize write magnification (e.g., data
| blocks written to the ZIL need not be re-written to the
| tree).
| ruuda wrote:
| The Hitchhiker Tree [1] addresses the write amplification at
| the cost of making lookups and scans slightly more expensive,
| by keeping a small array of "pending writes" in every non-leaf
| node. The entire thing is still immutable, but instead of
| writing a new leaf node + path from root to that leaf, in the
| majority of cases we only write a new root. When there is no
| space left in the array, all the pending values get pushed one
| level down at the same time, so the write amplification is
| amortized.
|
| [1]: https://www.youtube.com/watch?v=jdn617M3-P4
| ww520 wrote:
| I believe the Bw-tree does the same thing, caching new
| updates in the intermediate branch nodes and only pushing the
| updates in batch down to the lower layers when running out of
| room at the branch node. These are the newer wave of
| algorithms to address the write amplification.
| mamcx wrote:
| Naively thinking here, how about a 2 immutable b+tree setup:
|
| The "hot" tree is the WAL: All the data is there and copies are
| generated.
|
| The "cold" tree is behind: In the background move from WAL and
| at the same time compact it.
| ww520 wrote:
| That's how the traditional transaction log + btree approach
| work. The append-only btree removes the need for a separate
| transaction log.
| packetlost wrote:
| One enhancement I've toyed with in the past is to have subtrees
| beneath each leaf node for the historical data for each record.
| You switch the indexing value to the timestamp of record
| insertion time or a TX number and keep inserting in append-only
| fashion. There's some enhancements you can make because the
| common case is for the key to be a monotonically increasing
| number, but nothing is strictly necessary. I'd love to build a
| datalog system on top of something like that at some point, even
| if it's just for fun.
| hlship wrote:
| In such a system, how does a reader find the root node? I'd be
| concerned about a) readers seeing a partial write to disk (or
| page buffer) during an update or b) a crash while the writer
| writes to disk.
|
| I could imagine using two files, one containing the actual
| b-tree, the second containing the offset to the latest root note;
| the second file gets overwritten only after a successful write is
| verifiably written to disk.
|
| Datomic's (https://www.datomic.com/) architecture is similar to
| this, but uses many write-once "segments" (which could be files
| in EFS or S3, or rows in DynamoDB or other stores).
| twoodfin wrote:
| There's a "canonical" pointer to the root node which is updated
| atomically on every append.
|
| For an in-memory database, a CAS is adequate. Persistent stores
| ultimately need some kind of file-level locking.
|
| If you look at the Apache Iceberg spec, you get a good idea of
| how this works: The only "mutability" in that universe is the
| root table pointer in the catalog.
| ww520 wrote:
| The pointer to the root tree node is stored in the last
| committed metadata page. A read transaction starts with the
| reading of the metadata page. Reaching the partial written
| pages is impossible as the writer has not committed the latest
| metadata page yet. The transaction is committed when the latest
| metadata page is written as the last step.
| JensRantil wrote:
| IIRC, this is how CouchDB was implemented. The benefit is that it
| was resilient to crash without needing a write-ahead log. The
| downside was that it required running background compaction of
| the B+tree regularly.
| josephg wrote:
| LMDB works the same way. But it also stores a second data
| structure on disk listing all free blocks. Writes
| preferentially take free blocks when they're available. The
| result is it doesn't need any special compaction step. It sort
| of automatically compacts constantly.
| toolslive wrote:
| with append-only data structures, you could turn the fact that
| you rewrite the path from the node to the root to your advantage
| and rebalance. This means you abandon the "balanced" property and
| use update statistics to get an optimal shape. Who cares about
| the length of paths to parts of the tree you never visit?
| o11c wrote:
| One thing that's not explicitly mentioned: append-only trees
| necessarily lack `parent` pointers.
|
| This means that your "iterator" can't be a lightweight type, it
| has to contain an array of parent nodes to visit (again) later.
| You can use a fixed-size array if you can reason about the
| maximum height of the tree (for a balanced binary tree, this
| means around `2*POINTER_BITS`, which is 1024 bytes on a 64-bit
| platform; it will be less for a B-tree (`2*log(POINTER_MAX,
| CHILDREN_PER_NODE)`) but you now have to either track or re-scan
| for the index of the child you just returned from). Beware buffer
| overflow logic bugs if your tree isn't as balanced as you
| thought!
___________________________________________________________________
(page generated 2023-12-29 23:00 UTC)