[HN Gopher] B-Trees: More Than I Thought I'd Want to Know
       ___________________________________________________________________
        
       B-Trees: More Than I Thought I'd Want to Know
        
       Author : todsacerdoti
       Score  : 138 points
       Date   : 2021-08-18 14:42 UTC (8 hours ago)
        
 (HTM) web link (benjamincongdon.me)
 (TXT) w3m dump (benjamincongdon.me)
        
       | bern4444 wrote:
       | I had to implement a simple B+ tree for a databases class in
       | college (since they're often used when implementing databases).
       | 
       | It was one of those assignments few people fully completed. Most
       | completed parts of it to varying degrees. I got more than halfway
       | through but wasn't able to complete it.
       | 
       | It was an extremely difficult assignment and stands out in my
       | mind.
       | 
       | They're very interesting and very complex. Nice article!
        
         | efficax wrote:
         | An in-memory b-tree or one stored on disk? The first one should
         | not be a huge amount of work over implementing, say, a red-
         | black tree. But sorting out the disk storage is hard (and in
         | the end, you've just written a database!)
        
         | mad_vill wrote:
         | uw madison?
        
           | jwitthuhn wrote:
           | I'm not the original commenter, but I sure did implement a
           | B-tree for my databases course at UW Madison.
        
         | cogman10 wrote:
         | Perhaps it was just my professor, but what really helped was we
         | first did red-black trees and after doing that our teacher
         | explained to us that red black trees were just a type of
         | B-Tree. Once you have that, it's pretty easy to go from one to
         | the other.
        
         | lmkg wrote:
         | Implementing a B+ tree was the final assignment in a grad-level
         | Algorithms & Data Structures call that I took. I thought it was
         | a reasonable assignment in that context. It felt like a good
         | capstone to the class, building on what we had learned and
         | finally crossing the line from toy examples to something that
         | approached real-world usefulness.
         | 
         | It seems very out-of-place for a databases course. Yes, B-Trees
         | are important for a database, but I don't see how
         | _implementing_ one teaches you about how a database _uses_ one
         | or why it 's important.
        
           | Tostino wrote:
           | There are actually some pretty good database courses at CMU
           | for example, that go into what is required to build databases
           | from the ground up.
           | 
           | So don't think all DB classes are the same, there are some
           | vast differences in that type of class compared to a class
           | teaching you how to use SQL.
        
           | ddlutz wrote:
           | I've seen quite a few database courses have you implement
           | them, as they are the backbone for many database systems.
        
         | nicolaslem wrote:
         | I got interested in the topic a few years ago and wrote an
         | implementation in Python[0] for myself. It's just a toy but the
         | funny thing is that I regularly get waves of students staring
         | the repository, probably when a teacher gives a similar
         | assignment.
         | 
         | [0] https://github.com/NicolasLM/bplustree
        
       | cryptonector wrote:
       | The best B-tree resource is SQLite3's documentation of its B-tree
       | on-disk format.
        
         | jgwil2 wrote:
         | This one? https://www.sqlite.org/fileformat2.html#btree
        
           | jgwil2 wrote:
           | Instead of downvoting could someone just provide the correct
           | link?
        
         | emptybits wrote:
         | Fantastic. Also a great, related, listen is the recent
         | Corecursive Podcast with Richard Hipp as guest. He's the man
         | who created SQLite, with good technical stories to share.
         | 
         | "The Untold Story of SQLite"
         | https://corecursive.com/066-sqlite-with-richard-hipp/
        
       | eatonphil wrote:
       | It would be great to see a walkthrough writing a minimal btree
       | from scratch with reading/writing from disk.
       | 
       | If you've got one (walkthrough with code, not just an existing
       | implementation), link it please!
        
         | spanspan wrote:
         | This incomplete tutorial is about how to write a sqlite clone,
         | with particular emphasis on how to implement the underlying
         | b-tree: https://cstack.github.io/db_tutorial/
        
           | ddlutz wrote:
           | for YEARS that tutorial has been stuck on "Alright. One more
           | step toward a fully-operational btree implementation. The
           | next step should be splitting internal nodes. Until then!".
           | Must not be worth the time to finish writing the tutorial.
        
       | CodeWriter23 wrote:
       | IMO unlike the opening drawing, in a disk-based B-Tree index,
       | Leaf nodes should not have "next" pointers, one should instead go
       | back up the Tree node(s) (iterating upward when necessary) to
       | find the next Leaf node. Next pointers are a duplication of
       | information violating referential integrity within the index, and
       | will be invalidated when an index update causes a split or merge
       | rebalancing operation, creating concurrency issues.
        
         | SamReidHughes wrote:
         | When you split a node, you're probably [1] already modifying
         | the node you're splitting, so there's no problem updating
         | unidirectional sibling pointers. If duplication of information
         | is a problem, well, you have a bug in your storage engine
         | somewhere.
         | 
         | [1] It occurs to me that you don't necessarily need to modify
         | the node you're splitting, if the key you're writing is on the
         | new node's half of the split.
         | 
         | However, sibling pointers would be the _dumbest_ way to
         | traverse the data, even on a read-only tree, because it has
         | serial round-trip latency to disk for every block. You would
         | never use them for traversal.
         | 
         | They _can be_ useful, for a fancy concurrency hack, if you use
         | them to split the node without updating the parent, and then
         | update the parent in a separate operation after the fact. This
         | lets write operations release their lock on the parent _before_
         | accessing the child. In that case any code traversing to a
         | child makes note of the child 's right sibling stored in the
         | parent and uses the child's sibling pointer only if it's
         | different from the right sibling as stored in the parent (which
         | means the parent hasn't been updated yet).
        
       | masklinn wrote:
       | > In my college Data Structures and Algorithms course, we covered
       | B-Trees, but I didn't grok why I'd choose to use one. As
       | presented, B-Trees were essentially "better" Binary Search Trees,
       | with some hand-waving done that they had improved performance
       | when used in database applications.
       | 
       | With modern memory hierarchies that also tends to be the case in-
       | memory (with lower node densities more suited to caches and cache
       | lines).
        
         | pkaye wrote:
         | This articles talks about some of the reasons in the section
         | "Disk-Induced Constraints".
        
           | masklinn wrote:
           | Yes, except for #2 those reasons apply essentially as-is to
           | in-memory b-trees if you replace "memory" with "cache" and
           | "disk" with "memory".
        
             | dragontamer wrote:
             | DDR4 has a burst-length of 8, meaning the smallest you can
             | write is a block of 64 bytes (corresponding to a modern
             | cache-line).
             | 
             | It also should be noted that DDR4 has "rows" roughly in the
             | 1024 byte size. A RAS command opens up a "row" (aka 1024
             | bytes), while a CAS command reads from a column (aka:
             | 64-bytes).
             | 
             | DDR4 cells can only be read once (!!!), and afterwards, the
             | data gets mangled. To prevent the loss of data, all "reads"
             | are first into sense-amplifiers... and these sense-
             | amplifiers can be read infinitely.
             | 
             | This "read into sense-amplifiers" is called a RAS command,
             | and it transfers the entire 1024-byte row into sense
             | amplifiers. A CAS can then read 64-byte chunks from the
             | page at high speeds and randomly.
             | 
             | --------
             | 
             | Before a new RAS command can be issued, there needs to be a
             | few steps.
             | 
             | 1. The "current row" of data needs to be written back to
             | DRAM (remember, the data in the DRAM was lost when you read
             | the data in the first place).
             | 
             | 2. After writing the data, the sense-amplifiers must be
             | emptied (aka: precharged), so that they're ready to receive
             | the new data.
             | 
             | 3. After #1 and #2, it is finally legal to issue a RAS
             | 
             | --------
             | 
             | So in fact, DDR4 RAM is also a block device. Sure, RAS is
             | much faster than a hard-drive seek, but given how much
             | slower RAS (row address strobe) is than a CAS (column
             | address strobe)... a lot of these disk-based discussions
             | (such as B-trees) end up applying to DDR4 in practice.
             | 
             | --------
             | 
             | The only thing that really works like classic RAM is L3 /
             | L2 / L1 caches. Everything else in the modern world is
             | basically a block device.
        
       | toolslive wrote:
       | B+Trees minimize the maximum depth of the lookup. All the values
       | have exactly the same cost. In all honesty, I don't care about
       | the lookup cost of the values I never need.
        
         | bob1029 wrote:
         | Sounds like you are looking for a Splay Tree implementation.
         | Their dynamic optimality properties ensure your most popular
         | data lives right at the root.
        
           | toolslive wrote:
           | For example. In essence any tree that allows you to "rotate"
           | based on some statistics about subtree popularity. This
           | becomes more and more interesting in combination with copy-
           | on-write strategies because there you have to rewrite the top
           | of the tree anyway. So why not aggregate some statistics
           | while you're at it.
        
       | dragontamer wrote:
       | I feel like FAT and FAT32 back in the DOS days were so simple and
       | easy to explain, that my generation of programmers have the
       | "sweet point" in terms of learning technology.
       | 
       | FAT / FAT32 wasn't about B-trees or balancing or anything. It was
       | a simple linked list, that got severely tangled up during use and
       | required an hour of "defragmenting" to make your hard drive go
       | back at high speed.
       | 
       | As a computer user of the time, I was too young / ignorant to
       | know why it happened like that, but that experience stuck with me
       | until college. In college, I learned about how FAT32 filesystem
       | worked fundamentally (though obsolete, its so simple its a great
       | introduction).
       | 
       | From there, I learned why we have moved onto more advanced data-
       | structures like B-Trees. Why things like B-trees are less prone
       | to fragmentation. Etc. etc.
       | 
       | ----------
       | 
       | A big part of my learning process was the hours of waiting I'd
       | put in to defrag my hard drive back in the day. Those memories
       | stuck with me, I know what the "downsides" are if I don't use a
       | self-balancing B-Tree to store data on a hard drive.
       | 
       | Perhaps today, we're no longer at a point where we can just
       | convince legions of new computer-users (or programmers) to use
       | inferior file systems like FAT32 in their daily lives. But we can
       | at least tell stories about those times, so that they can
       | understand why we put so much complexity into our modern
       | filesystem data-structures.
        
         | nanidin wrote:
         | Wasn't defragmentation more important in general on spinning
         | media to reduce random seeks? And now most performance oriented
         | things are happening on SSD?
        
           | dragontamer wrote:
           | I'm very confused by your question. I don't think FAT or
           | FAT32 was ever a performance-oriented filesystem.
           | 
           | Back then, we were just happy having a personal computer. I'm
           | sure there was better tech being used somewhere. But MS-DOS /
           | Windows was just your cheap $99 OS in an age when other
           | compilers and OSes cost you something like $1000 to $10,000.
           | 
           | You formatted your hard-drive in FAT because MS-DOS used FAT.
           | You used MS-DOS because that's what PCs and PC-clones came
           | with. There was no alternative, and PCs were already
           | considered a pretty luxury / nerd item. (I know people brag
           | about their DEC Alphas, Unix copies or OS/2... but you got
           | enough "nerd cred" back then if you simply had a PC at all)
           | 
           | Its just how computers worked back then. Every PC was using
           | FAT, and every few months of regular use the disk
           | reads/writes would get so slow that performance was
           | unbearable.
           | 
           | At that point, you spent an hour defragmenting the hard
           | drive, and everything worked great again.
           | 
           | -------
           | 
           | Eventually 4GB hard drives were made, and you upgraded your
           | computer to FAT32. Soon after that, Windows promised an end
           | to defragmenting with its next-generation NTFS filesystem
           | (based off of the B-trees discussed in this article) and we
           | never looked back.
        
       | spanspan wrote:
       | This is really cool. I recently attempted to build a toy
       | database, and subsequently implemented my own b-tree (
       | https://github.com/spandanb/learndb-py). I ended up running into
       | a lot of these issues.
       | 
       | I also did a write-up on why everyone should engage in similar
       | projects: https://www.spandanbemby.com/joys-of-database.html
        
       ___________________________________________________________________
       (page generated 2021-08-18 23:01 UTC)