[HN Gopher] B-Trees: More Than I Thought I'd Want to Know
___________________________________________________________________
B-Trees: More Than I Thought I'd Want to Know
Author : todsacerdoti
Score : 138 points
Date : 2021-08-18 14:42 UTC (8 hours ago)
(HTM) web link (benjamincongdon.me)
(TXT) w3m dump (benjamincongdon.me)
| bern4444 wrote:
| I had to implement a simple B+ tree for a databases class in
| college (since they're often used when implementing databases).
|
| It was one of those assignments few people fully completed. Most
| completed parts of it to varying degrees. I got more than halfway
| through but wasn't able to complete it.
|
| It was an extremely difficult assignment and stands out in my
| mind.
|
| They're very interesting and very complex. Nice article!
| efficax wrote:
| An in-memory b-tree or one stored on disk? The first one should
| not be a huge amount of work over implementing, say, a red-
| black tree. But sorting out the disk storage is hard (and in
| the end, you've just written a database!)
| mad_vill wrote:
| uw madison?
| jwitthuhn wrote:
| I'm not the original commenter, but I sure did implement a
| B-tree for my databases course at UW Madison.
| cogman10 wrote:
| Perhaps it was just my professor, but what really helped was we
| first did red-black trees and after doing that our teacher
| explained to us that red black trees were just a type of
| B-Tree. Once you have that, it's pretty easy to go from one to
| the other.
| lmkg wrote:
| Implementing a B+ tree was the final assignment in a grad-level
| Algorithms & Data Structures call that I took. I thought it was
| a reasonable assignment in that context. It felt like a good
| capstone to the class, building on what we had learned and
| finally crossing the line from toy examples to something that
| approached real-world usefulness.
|
| It seems very out-of-place for a databases course. Yes, B-Trees
| are important for a database, but I don't see how
| _implementing_ one teaches you about how a database _uses_ one
| or why it 's important.
| Tostino wrote:
| There are actually some pretty good database courses at CMU
| for example, that go into what is required to build databases
| from the ground up.
|
| So don't think all DB classes are the same, there are some
| vast differences in that type of class compared to a class
| teaching you how to use SQL.
| ddlutz wrote:
| I've seen quite a few database courses have you implement
| them, as they are the backbone for many database systems.
| nicolaslem wrote:
| I got interested in the topic a few years ago and wrote an
| implementation in Python[0] for myself. It's just a toy but the
| funny thing is that I regularly get waves of students staring
| the repository, probably when a teacher gives a similar
| assignment.
|
| [0] https://github.com/NicolasLM/bplustree
| cryptonector wrote:
| The best B-tree resource is SQLite3's documentation of its B-tree
| on-disk format.
| jgwil2 wrote:
| This one? https://www.sqlite.org/fileformat2.html#btree
| jgwil2 wrote:
| Instead of downvoting could someone just provide the correct
| link?
| emptybits wrote:
| Fantastic. Also a great, related, listen is the recent
| Corecursive Podcast with Richard Hipp as guest. He's the man
| who created SQLite, with good technical stories to share.
|
| "The Untold Story of SQLite"
| https://corecursive.com/066-sqlite-with-richard-hipp/
| eatonphil wrote:
| It would be great to see a walkthrough writing a minimal btree
| from scratch with reading/writing from disk.
|
| If you've got one (walkthrough with code, not just an existing
| implementation), link it please!
| spanspan wrote:
| This incomplete tutorial is about how to write a sqlite clone,
| with particular emphasis on how to implement the underlying
| b-tree: https://cstack.github.io/db_tutorial/
| ddlutz wrote:
| for YEARS that tutorial has been stuck on "Alright. One more
| step toward a fully-operational btree implementation. The
| next step should be splitting internal nodes. Until then!".
| Must not be worth the time to finish writing the tutorial.
| CodeWriter23 wrote:
| IMO unlike the opening drawing, in a disk-based B-Tree index,
| Leaf nodes should not have "next" pointers, one should instead go
| back up the Tree node(s) (iterating upward when necessary) to
| find the next Leaf node. Next pointers are a duplication of
| information violating referential integrity within the index, and
| will be invalidated when an index update causes a split or merge
| rebalancing operation, creating concurrency issues.
| SamReidHughes wrote:
| When you split a node, you're probably [1] already modifying
| the node you're splitting, so there's no problem updating
| unidirectional sibling pointers. If duplication of information
| is a problem, well, you have a bug in your storage engine
| somewhere.
|
| [1] It occurs to me that you don't necessarily need to modify
| the node you're splitting, if the key you're writing is on the
| new node's half of the split.
|
| However, sibling pointers would be the _dumbest_ way to
| traverse the data, even on a read-only tree, because it has
| serial round-trip latency to disk for every block. You would
| never use them for traversal.
|
| They _can be_ useful, for a fancy concurrency hack, if you use
| them to split the node without updating the parent, and then
| update the parent in a separate operation after the fact. This
| lets write operations release their lock on the parent _before_
| accessing the child. In that case any code traversing to a
| child makes note of the child 's right sibling stored in the
| parent and uses the child's sibling pointer only if it's
| different from the right sibling as stored in the parent (which
| means the parent hasn't been updated yet).
| masklinn wrote:
| > In my college Data Structures and Algorithms course, we covered
| B-Trees, but I didn't grok why I'd choose to use one. As
| presented, B-Trees were essentially "better" Binary Search Trees,
| with some hand-waving done that they had improved performance
| when used in database applications.
|
| With modern memory hierarchies that also tends to be the case in-
| memory (with lower node densities more suited to caches and cache
| lines).
| pkaye wrote:
| This articles talks about some of the reasons in the section
| "Disk-Induced Constraints".
| masklinn wrote:
| Yes, except for #2 those reasons apply essentially as-is to
| in-memory b-trees if you replace "memory" with "cache" and
| "disk" with "memory".
| dragontamer wrote:
| DDR4 has a burst-length of 8, meaning the smallest you can
| write is a block of 64 bytes (corresponding to a modern
| cache-line).
|
| It also should be noted that DDR4 has "rows" roughly in the
| 1024 byte size. A RAS command opens up a "row" (aka 1024
| bytes), while a CAS command reads from a column (aka:
| 64-bytes).
|
| DDR4 cells can only be read once (!!!), and afterwards, the
| data gets mangled. To prevent the loss of data, all "reads"
| are first into sense-amplifiers... and these sense-
| amplifiers can be read infinitely.
|
| This "read into sense-amplifiers" is called a RAS command,
| and it transfers the entire 1024-byte row into sense
| amplifiers. A CAS can then read 64-byte chunks from the
| page at high speeds and randomly.
|
| --------
|
| Before a new RAS command can be issued, there needs to be a
| few steps.
|
| 1. The "current row" of data needs to be written back to
| DRAM (remember, the data in the DRAM was lost when you read
| the data in the first place).
|
| 2. After writing the data, the sense-amplifiers must be
| emptied (aka: precharged), so that they're ready to receive
| the new data.
|
| 3. After #1 and #2, it is finally legal to issue a RAS
|
| --------
|
| So in fact, DDR4 RAM is also a block device. Sure, RAS is
| much faster than a hard-drive seek, but given how much
| slower RAS (row address strobe) is than a CAS (column
| address strobe)... a lot of these disk-based discussions
| (such as B-trees) end up applying to DDR4 in practice.
|
| --------
|
| The only thing that really works like classic RAM is L3 /
| L2 / L1 caches. Everything else in the modern world is
| basically a block device.
| toolslive wrote:
| B+Trees minimize the maximum depth of the lookup. All the values
| have exactly the same cost. In all honesty, I don't care about
| the lookup cost of the values I never need.
| bob1029 wrote:
| Sounds like you are looking for a Splay Tree implementation.
| Their dynamic optimality properties ensure your most popular
| data lives right at the root.
| toolslive wrote:
| For example. In essence any tree that allows you to "rotate"
| based on some statistics about subtree popularity. This
| becomes more and more interesting in combination with copy-
| on-write strategies because there you have to rewrite the top
| of the tree anyway. So why not aggregate some statistics
| while you're at it.
| dragontamer wrote:
| I feel like FAT and FAT32 back in the DOS days were so simple and
| easy to explain, that my generation of programmers have the
| "sweet point" in terms of learning technology.
|
| FAT / FAT32 wasn't about B-trees or balancing or anything. It was
| a simple linked list, that got severely tangled up during use and
| required an hour of "defragmenting" to make your hard drive go
| back at high speed.
|
| As a computer user of the time, I was too young / ignorant to
| know why it happened like that, but that experience stuck with me
| until college. In college, I learned about how FAT32 filesystem
| worked fundamentally (though obsolete, its so simple its a great
| introduction).
|
| From there, I learned why we have moved onto more advanced data-
| structures like B-Trees. Why things like B-trees are less prone
| to fragmentation. Etc. etc.
|
| ----------
|
| A big part of my learning process was the hours of waiting I'd
| put in to defrag my hard drive back in the day. Those memories
| stuck with me, I know what the "downsides" are if I don't use a
| self-balancing B-Tree to store data on a hard drive.
|
| Perhaps today, we're no longer at a point where we can just
| convince legions of new computer-users (or programmers) to use
| inferior file systems like FAT32 in their daily lives. But we can
| at least tell stories about those times, so that they can
| understand why we put so much complexity into our modern
| filesystem data-structures.
| nanidin wrote:
| Wasn't defragmentation more important in general on spinning
| media to reduce random seeks? And now most performance oriented
| things are happening on SSD?
| dragontamer wrote:
| I'm very confused by your question. I don't think FAT or
| FAT32 was ever a performance-oriented filesystem.
|
| Back then, we were just happy having a personal computer. I'm
| sure there was better tech being used somewhere. But MS-DOS /
| Windows was just your cheap $99 OS in an age when other
| compilers and OSes cost you something like $1000 to $10,000.
|
| You formatted your hard-drive in FAT because MS-DOS used FAT.
| You used MS-DOS because that's what PCs and PC-clones came
| with. There was no alternative, and PCs were already
| considered a pretty luxury / nerd item. (I know people brag
| about their DEC Alphas, Unix copies or OS/2... but you got
| enough "nerd cred" back then if you simply had a PC at all)
|
| Its just how computers worked back then. Every PC was using
| FAT, and every few months of regular use the disk
| reads/writes would get so slow that performance was
| unbearable.
|
| At that point, you spent an hour defragmenting the hard
| drive, and everything worked great again.
|
| -------
|
| Eventually 4GB hard drives were made, and you upgraded your
| computer to FAT32. Soon after that, Windows promised an end
| to defragmenting with its next-generation NTFS filesystem
| (based off of the B-trees discussed in this article) and we
| never looked back.
| spanspan wrote:
| This is really cool. I recently attempted to build a toy
| database, and subsequently implemented my own b-tree (
| https://github.com/spandanb/learndb-py). I ended up running into
| a lot of these issues.
|
| I also did a write-up on why everyone should engage in similar
| projects: https://www.spandanbemby.com/joys-of-database.html
___________________________________________________________________
(page generated 2021-08-18 23:01 UTC)