[HN Gopher] Database Fundamentals
       ___________________________________________________________________
        
       Database Fundamentals
        
       Author : tontinton
       Score  : 519 points
       Date   : 2023-12-15 15:28 UTC (7 hours ago)
        
 (HTM) web link (tontinton.com)
 (TXT) w3m dump (tontinton.com)
        
       | abriosi wrote:
       | Thank you very much for putting the article together
        
       | praveer13 wrote:
       | Great article. The book Database Internals looks amazing. Are
       | there any other such books that deep dive into internals?
        
         | cmrdporcupine wrote:
         | Not a book, but I recommend the database class lectures from
         | @apavlo's group at CMU.
         | 
         | https://www.youtube.com/c/cmudatabasegroup
         | 
         | All the classes (intro and advanced) are online, as well as
         | presentations and lectures about industry products.
         | 
         | They are very useful.
         | 
         | Also, from a more high level theoretical CS and less physical-
         | implementation focused POV, the "Alice" book ("Foundations of
         | Databases") is amazing (though very dense and mathematical).
         | Focuses on the relational algebra and Datalog (and the
         | translation of that into the Rel Algebra). Getting print copies
         | is hard now (my used copy arrived with a broken spine and pages
         | falling out), but the entire book is online:
         | http://webdam.inria.fr/Alice/
        
           | senderista wrote:
           | Another excellent DB textbook covering both theory and
           | implementation is Weikum & Vossen:
           | 
           | https://www.google.com/books/edition/_/wV5Ran71zNoC?hl=en&gb.
           | ..
           | 
           | In case you can't afford to donate $150 to Elsevier: https://
           | www.dropbox.com/scl/fi/r0fros5y2kzeqv57v7myz/Transac...
        
         | flashgordon wrote:
         | Another book I found very useful (though this DI book is more
         | modern) is Raghu Ramakrishnan Database Management Systems book.
        
         | why-el wrote:
         | There is a another famous one, focusing on Postgres:
         | https://www.interdb.jp/pg/.
        
           | dgellow wrote:
           | Is it available as a print or ebook?
        
         | big_whack wrote:
         | This is a good paper for a similar kind of overview:
         | https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf
        
       | exacube wrote:
       | Nice article!
       | 
       | There is a bug in the compaction method:                 def
       | compact(sstables, output_sstable):         # Ordered by ascending
       | key. pop() results in the item of  the smallest key.         heap
       | = heapq.heapify([(sstable.next(), sstable) for sstable in
       | sstables])              while (item, sstable) := heap.pop()
       | if not item.is_tombstone():
       | output_sstable.write(item)
       | 
       | You should only skip tombstones when you are compacting the final
       | (i.e., largest) level, instead of between every level. Otherwise,
       | an entry in a lower level will unmask itself because the
       | tombstone in the upper level was compacted away.
       | 
       | It's one of the properties of LSM-based DBs that
       | deletions/tombstones records linger for a long time, though some
       | databases (eg RocksDB) put in some optimizations to get around
       | this.
        
         | tontinton wrote:
         | You're right this was purposefully left out for brevity, in
         | dbeel it is handled.
        
           | exacube wrote:
           | Ah, makes sense :)
        
         | vlovich123 wrote:
         | What kind of optimizations does RocksDB do? I know they have
         | some stuff around range deletions but not sure I've read
         | anything about point deletions.
        
       | conqrr wrote:
       | Very nice article and looks like a great way to get hands dirty
       | for anyone wanting to think like a DB engineer. Related question,
       | for all the Database engineers out there. I have always had a
       | keen interest to work on Databases. Having worked at one of the
       | cloud providers and burn my high amount of ops and oncall, I am
       | of the opinion most of the Database engineers usually have a bad
       | wlb. Given the depth and complexity of all Database internals,
       | Any major impact would take a few quarters to roll out, and its
       | not a job for everyone. Is this mostly true?
        
         | cmrdporcupine wrote:
         | I worked at a DB product startup, doing DB internals
         | development [storage engine, etc], for a few months (before my
         | personal/family situation forced me to have to move on), and
         | WLB seemed pretty reasonable -- though there was a mandatory
         | on-call rotation, that's kind of expected at most startups in
         | the cloud space.
         | 
         | People were very smart and friendly, and took vacations and all
         | that kind of stuff. I'd say the "worst" part of working there
         | was the amplification of my impostor syndrome?
         | 
         | I don't know if that's typical for the industry as a whole. I'd
         | definitely like to get back into DB internals dev at some
         | point, it's one of the few places in our industry where you get
         | to do fundamental systems dev. The job I had was in many ways
         | my "dream job" but family situation at the time meant I
         | couldn't manage it.
        
           | senderista wrote:
           | I had the most fun of my entire career writing a DB from
           | scratch at my previous job. It was fun while it lasted (i.e.,
           | until the money ran out)...
        
         | aNoob7000 wrote:
         | I think it depends on the environment. Yes, most production and
         | even some test databases are critical to businesses. Any
         | downtime or severe performance issues cause a lot of finger-
         | pointing.
         | 
         | Where I work, most database patches are applied to test
         | environments and let soak for a couple of months before rolling
         | out to production.
         | 
         | And yes, I work a lot on weekends. My only complaint about my
         | career as a DBA is that most of our work is behind the scenes
         | and goes unnoticed until something breaks.
        
           | eatonphil wrote:
           | I can't judge your personal position/responsibilities but
           | normally I think of DBA and database developer as not being
           | the same thing (except for at small database startups where
           | there's more overlap). At medium+ -sized database companies
           | I've talked to the database developers are on call but likely
           | don't have access to production. They are on call for bug
           | reports from production users. Whereas DBAs are folks
           | directly responsible for the database in production.
        
         | Icathian wrote:
         | I have worked on internals of databases for a few years. WLB is
         | just fine, we have oncall rotations same as everyone else. They
         | are deep and complex systems, but mostly what that means is
         | just that average tenure is very high as it takes forever to
         | get up to speed and people generally don't get bored and leave,
         | unless you give them good reasons to.
         | 
         | Overall I enjoy it a lot and would generally recommend it as a
         | good field to get into if you want to tackle hard problems.
         | 
         | Edit: people seem to be conflating DBAs with engineers working
         | on database internals. Those are very different jobs and I'm
         | only talking about the second one.
        
         | cowthulhu wrote:
         | I'm the de facto DB guy at work, WLB is fine... it's very rare
         | that I have to work after-hours. The main reason for this is
         | that our DBs are basically exclusively used during business
         | hours, and if you set things up right it's rare that you'll get
         | anything breaking on its own after-hours.
         | 
         | Also - maybe this is specific to me, but most emergencies I
         | deal with are of my own creation. It's rare to have a DB
         | spontaniously corrupt a page. Most emergency issues I deal with
         | are performance-related (this used to take 2s, now it's taking
         | 200s) and you can mitigate those ahead of time if you're
         | thoughtful (breaking code up so you lean less on the query
         | optimizer, automatic index maintenance, consistent and well-
         | indexed join paths).
         | 
         | Depends what you mean on major impacts. A potentially breaking
         | upgrade is a total nightmare, and never a fun process. But most
         | (almost all?) of the things you do are much lower stakes, and
         | definitely do not take a few quarters.
        
           | cmrdporcupine wrote:
           | I read the commenter as referring to working on DB internals
           | -- that is, _building_ a database -- rather than working
           | supporting workflows /queries on an existing DB product. But
           | maybe I read wrong.
        
       | twoodfin wrote:
       | I recall an hn post a bit back that cataloged all the ways POSIX
       | guaranteed atomicity, which might help bashdb. Rename is the
       | first thing I'd explore.
       | 
       | Maybe this from 2010 is what I was remembering from a repost,
       | alas the link has rotted:
       | 
       | https://news.ycombinator.com/item?id=1035100
        
         | moritonal wrote:
         | https://rcrowley.org/2010/01/06/things-unix-can-do-atomicall...
        
           | tontinton wrote:
           | This is great, maybe I'll try to improve bashdb
        
       | __alias wrote:
       | This article made me laugh because it's synonymous to me trying
       | to start a project and going down different rabbit holes.
       | 
       | The author at step 1. of a project trying to pick the tech
       | stacks, decides to read a book on databases to help with choosing
       | a DB
       | 
       | Then proceeds to segue to writing his own database.
       | 
       | Then writes a blog about said process
       | 
       | I wonder how that original project has come along? I'm looking
       | forward to when the author gets to the stage of picking a
       | frontend framework
        
         | JoshGlazebrook wrote:
         | Don't forget about the part where you actually do start on the
         | project, but then you read one article or find another
         | tool/software package that makes you second guess everything
         | you've already done and you go back down another rabbit hole.
        
           | pphysch wrote:
           | IME this "second-guessing" is more often right than wrong.
           | You can always return to a project that motivates you, but
           | you can't get back time spent digging a hole deeper, and
           | often it leads to tunnel vision and bad habit formation.
           | 
           | Not every "project" needs to become "finished" or "product".
        
             | JoshGlazebrook wrote:
             | My problem is scope creep. It's much harder to tell myself
             | no vs. being on an engineering team at a company and there
             | being a set process.
        
         | loeber wrote:
         | Yak Shaving: https://seths.blog/2005/03/dont_shave_that/
        
           | spelunker wrote:
           | When doing personal projects I have to constantly be reeling
           | myself back in from doing x thing "The Right Way", because I
           | end up doing a bunch of useless crap and not actually making
           | progress on the personal project.
           | 
           | Easy to fall into that trap when 1) it's just you and 2)
           | there is little accountability because it's just you!
        
             | bob1029 wrote:
             | My tactic for pushing back against this is to try to trick
             | myself into doing the simplest thing that might still work.
             | It's a challenge to write "bad" code on purpose. The
             | opposite of chasing perfect/clean.
             | 
             | I have found that this frees up a lot of weird expectations
             | that you place yourself under. You can get much more
             | creative when everything is a dumb-ass-simple public static
             | member vs when you are spending 2 hours a day fighting a
             | cursed IoC/DI abstraction simply because you are worried
             | that clean code Jesus _might_ be watching you.
             | 
             | It helps to have an end goal too. It's really easy for me
             | to push through a messy prototype when I can see it
             | bringing me closer to a strategic objective.
        
               | hiAndrewQuinn wrote:
               | Bingo. First get it working, then get it right, then get
               | it fast. It's for this reason that almost all of my
               | projects start with a SQLite database - it's a program
               | I'm very familiar with, like an old reliable chef's
               | knife.
        
         | cmrdporcupine wrote:
         | Yeah this is me too apart from the writing a blog part because,
         | uh, why would I want to expose the rest of humanity to my
         | insanity?
        
         | bonestamp2 wrote:
         | It is pretty funny. That said, if it's just a personal project
         | then sometimes it's more about the journey -- smelling every
         | flower is the enjoyable part of the journey. Sometimes.
         | 
         | I mentioned smelling the flowers because I look to young kids
         | for reminders about the little things we sometimes forget to
         | enjoy along the way, even if it's just the short journey from
         | the car to the house. When you're not in a hurry, remember to
         | enjoy the wonderful things that lie in your path.
        
         | tontinton wrote:
         | One day :)
        
       | i_am_a_squirrel wrote:
       | Great read! Thank you! Now do OLAP :p
        
       | adontz wrote:
       | "Consistency - This one doesn't really belong on ACID as a
       | property of databases, as it is a property of the application."
       | Damn, that is not good on so many levels.
       | 
       | First of all, ACID are not properties of a database, they are
       | properties of database transaction. Article kind of says it, but
       | I feel it is phased ambiguous. Consistency is the property of a
       | database transaction which can be easily illustrated by foreign
       | keys.                   AUTHORS         | author_id | name |
       | |         1 | Mary |         |         2 | John |
       | ARTICLES         | article_id | author_id | title |         |
       | 11 |         2 | About small animals |         |         12 |
       | 1 | About big machines |
       | 
       | Now ARTICLES cannot have author_id non existent in AUTHORS. If a
       | tuple is inserted in ARTICLES, then author_id should reference a
       | valid tuple from AUTHORS. If a tuple is deleted from AUTHORS,
       | then all ARTICLES tuples referencing said AUTHORS tuple should be
       | deleted too; or deleting from AUTHORS should be rejected. That is
       | consistency and that is a property of a database transaction.
       | Transaction must fail if one tries to insert an article with
       | author_id equal 5. At the end of transaction article 11 must be
       | deleted if author 2 was deleted.
       | 
       | Now real life databases are much more complex, but idea is
       | basically the same.
       | 
       | Also, consistency is between tuples. Not having
       | "31-February-2001" as a date is integrity, not consistency.
        
         | joshxyz wrote:
         | damn right sir, and there is also the term "consistency levels"
         | which vary from one db to another.
        
         | tontinton wrote:
         | Thanks, that's something I didn't know!
         | 
         | I'll fix it later, and change the wording to say that these are
         | properties of a database transaction
        
       | dpc_01234 wrote:
       | The atomicity in the bash version can be "simply" achieved by
       | copying the file to a temporary, modyfing it, then using `sync;
       | mv; sync`, right?
        
         | tontinton wrote:
         | Oh you're right I'll add that in later.
        
           | n3storm wrote:
           | I would go : 1. count records 2. make copy 3. insert 4.
           | ensure records = records + 1 5 if records != records +1 or
           | file can be grepped, stat, filesize, then assume is corrupted
           | and rollback to copy
        
           | o11c wrote:
           | Use `look` to get `O(log(n))` lookups (writes are still slow,
           | but you could use a multi-level chain I guess). `join` does
           | not use binary search even though it could have.
        
             | tontinton wrote:
             | no waaay, I have never seen look before, thanks!
        
         | ncruces wrote:
         | Also, while you're copying you can inverse grep filter to avoid
         | duplicates.
         | 
         | And since you're copying, you could maybe ensure sorting, but I
         | don't think doing that with "bash" (plus off the shelf utils)
         | makes a lot of sense.
         | 
         | That's what DJBs CDB ( _cdbget_ , _cdbmake_ , etc) is for:
         | https://cr.yp.to/cdb.html
        
       | pphysch wrote:
       | I love how the article starts out by demystifying "database" by
       | implementing a trivial one as a bash "one liner". Great hook.
        
       | PeterCorless wrote:
       | "So basically it has a document API like in MongoDB with
       | leaderless replication like in Cassandra and thread-per-core
       | architecture like in ScyllaDB."
       | 
       | Very cool design. And all in Rust!
        
         | tontinton wrote:
         | Thanks :)
        
       | bob1029 wrote:
       | > The books piqued my curiosity enough to write my own little
       | database
       | 
       | I think many developers go through this phase at some point. I
       | wouldn't even try to fight it. You learn so much about what _won
       | 't_ work by doing this. Extremely valuable lessons, if you can
       | spare the time.
       | 
       | Building my own databases gave me more respect for the existing
       | solutions than anything else. Getting the bytes to & from disk
       | quickly isn't the hard part. It's doing it reliably for years on
       | end while supporting use cases you couldn't have even dreamed of.
        
         | derefr wrote:
         | > while supporting use cases you couldn't have even dreamed of.
         | 
         | I often wonder: given how much of the complexity of modern
         | DBMSes exists because of constraints imposed by certain use-
         | cases that only pertain in certain business domains... what
         | efficiencies could we gain if we designed a _domain-specific_
         | DBMS with the knowledge that use-cases outside of the domain
         | are off-limits and can be ignored?
         | 
         | For example, I currently use general-purpose DBs to deal with
         | datasets that are fundamentally append-only. But what if I had
         | a DB that _only_ supported working with append-only data? A DB
         | that literally has no concept of an update or delete of
         | existing rows -- you 've got insert, and _maybe_ dropping a
         | table /dataset entirely, and that's it. Could such a DB get
         | away with not implementing MVCC transactions? Could such a DB
         | avoid using a write-ahead log (because the tables are each
         | their own write-ahead logs)? Would it store data more
         | efficiently? Could its indexing be chunkwise-atomic rather than
         | whole-table-atomic and so require less locking? Etc.
        
           | diatone wrote:
           | Check out TigerBeetle!
        
           | lichtenberger wrote:
           | Have a look into my DB project: https://sirix.io |
           | https://github.com/sirixdb/sirix
           | 
           | https://sirix.io/docs/concepts.html and in progress tutorial
           | https://sirix.io/docs/jsoniq-tutorial.html may be especially
           | helpful.
           | 
           | It basically turns updates into appends and is based on a
           | persistent tree structure (the header with a reference to the
           | (revision) root page has to be swapped atomically. Other than
           | that the revision indexes for new data are always appended.
           | In order to reduce copy-on-write overhead for updated page
           | (fragments) a sliding snapshot for the data pages is applied.
           | 
           | Naturally, unchanged pages are simply referenced (e.g.
           | through offsets into the file, thus sharing unchanged pages
           | between revisions).
           | 
           | What's also special is a path summary of all unordered paths
           | in a resource, which enables user-defined smaller tailored
           | secondary indexes and other query rewrites :-)
        
             | derefr wrote:
             | How does Sirix compare to LMDB (esp. MDBX)?
             | 
             | (I ask because AFAIK LMDB derivatives do a similar-sounding
             | thing: it updates pages within a write-transaction by first
             | allocating freelist pages to use to write out new copies of
             | those pages with the changes included; these changes
             | recurse upward because the pages are storing a B-tree,
             | until a modified copy of the root page is made; a commit-
             | log pointer is updated to point to the new root page; and
             | then the old rewritten pages are put into the freelist.)
        
               | lichtenberger wrote:
               | Basically, it retains all revisions. Furthermore, the
               | main document index is a keyed trie, much like hash array
               | mapped tries. That is storing an array as a tree and
               | using compact page layouts (bitmaps, 4 references pages,
               | full pages) to reduce the page sizes if they are not
               | full. However, Sirix assigns monotonically increasing,
               | immutable, unique node identifiers, thus most inner pages
               | are full with references to the next level pages (also
               | checksums of the child pages are stored along with the
               | references as in ZFS). The height of the tree increases
               | dynamically. Currently every inner page stores at most
               | 1024 references, thus it's a very wide tree, but we
               | should experiment with other sizes.
               | 
               | The leaf pages of the trie store either the data
               | itself/the nodes or nodes of the path summary, nodes of
               | the secondary indexes...
               | 
               | Thus, we have the main document index, but a
               | RevisionRootPage also has references to the tries, which
               | store the secondary indexes. The secondary indexes are
               | read into main memory / are reconstructed from the leaf
               | pages of the tries (usually small), also a small path
               | summary.
               | 
               | The data pages are not simply copied... only nodes, which
               | changed or fall out of a sliding window. Thus, a page may
               | have to be reconstructed in-memory from at most a small
               | number N of page fragments in the worst case. Thus, it
               | needs a device, which is suitable for fast random, small
               | sized parallel reads and sequential writes.
               | 
               | Currently you have to copy a resource starting from a
               | given revision and applying all updates up to the most
               | recent revision with intermediate commits in order to get
               | rid of old revisions, as it only uses one data file per
               | resource (a resource is equivalent to a table in a
               | relational system). Thus, the data files are basically
               | logs. Another file simply stores offsets and timestamps
               | read into memory to retrieve a given revision.
               | 
               | https://sirix.io/docs/concepts.html
               | 
               | and
               | 
               | https://sirix.io/docs/jsoniq-tutorial.html
               | 
               | Should probably help to get a further understanding.
               | 
               | HTH and let me know if you're interested in more details
               | :-)
               | 
               | Thanks for asking
        
           | tshaddox wrote:
           | The irony is an extremely popular general-purpose database
           | like PostgreSQL can sometimes work better* for niche use
           | cases than less popular databases designed specifically for
           | that niche use case.
           | 
           | * better, or perhaps nearly as well such that it's not worth
           | learning how to use and maintain the less popular single-
           | purpose database.
        
             | derefr wrote:
             | I certainly wouldn't recommend anyone _start with_ a
             | special-purpose DBMS. General-purpose DBMSes will get you
             | quite far indeed. (They got my data API company to millions
             | in ARR!)
             | 
             | But when you hit a certain scale, and your general-purpose
             | DBMS starts to struggle with things like "delivering
             | millions of sub-1ms reads per second", and you start to
             | look at what would be required to scale horizontally to
             | thousands of elastic nodes with low start-up times and good
             | QoS while serving your 100s-of-TBs dataset... you _might_
             | just start Greenspunning something. Something that you
             | might not at first realize is a domain-specific DBMS. But,
             | once you _do_ realize that, you may wonder whether someone
             | has already done it better.
             | 
             | And _that_ (rather small) set of people, are the intended
             | customers of a domain-specific DBMS.
        
           | kyllo wrote:
           | It's a lot easier to go distributed if you're append-only,
           | that's for sure.
        
           | bob1029 wrote:
           | The append only constraint is super nice until it's not.
           | Developing a way to manage this as the dataset scales is
           | challenging. Replaying a 100gb log after a crash could become
           | a problem. I've built entire product prototypes around
           | something like this, but you always reach a point where it
           | has a bug while you were way up in business logic land and so
           | it feels like being ripped right down to hell. It's no longer
           | fun after the first few cases of that experience.
        
             | derefr wrote:
             | I think you're speaking here about 1. queue-based event-
             | store systems, e.g. combining a compacting Kafka topic with
             | a CQRS reducer agent to serve as a store-of-record +
             | snapshot-state representation respectively; 2. where you've
             | likely _re_ structured what were fundamentally CRUD-esque
             | operations into CQRS commands and events, to hold in the
             | store, just so that they can be folded back down to CRUD
             | updates to a snapshot living in an RDBMS by the reducer? I
             | do agree that this kind of system can get really
             | messy/painful once you have a lot of data.
             | 
             | But I'm thinking more about:
             | 
             | 3. "the Dynamo/Hadoop model, in the small": a single-
             | process client-server row-store, but where a table is made
             | of immutable, read-only 64MB chunks, with the newest chunk
             | being an "open" buffer that becomes "closed" as soon as
             | it's filled up; and where these chunks are of a known file-
             | format such that you could directly compose them outside
             | the DBMS in parallel and push them to the DBMS _as chunks_
             | to be ingested  "whole" (i.e. "mounted" as micro-partitions
             | of the table);
             | 
             | 4. where the business-domain's types are already
             | _fundamentally immutable_ datatypes, that don 't need any
             | "projecting" into a CQRS representation; i.e. where the
             | thing the DB exists to store (and query on!) _is_ the
             | immutable business data, not some latest-computed-state
             | reduction over it, so there 's no need to ever play a log
             | into a CQRS reducer, let alone _replay_ that log.
             | 
             | I know that, for example, Clickhouse's MergeTree engine is
             | at its core akin to the kind of system I'm describing in 3
             | above -- but because Clickhouse is designed as a general-
             | purpose RDBMS (and so still offers an API that includes
             | UPDATE and DELETE) rather than being purpose-built for use-
             | case 4 above, it needs to do a whole bunch of stuff _on top
             | of_ that constrained chunked-immutable storage primitive,
             | to virtualize those updates /deletes pre-merge, and to
             | present MVCC transactions. Same with, for another example,
             | CouchDB: an immutable-data-store core, with a ton of logic
             | on top to allow the user to pretend they can update/delete.
             | 
             | If you imagine a version of Clickhouse or CouchDB that
             | _was_ solely focused on delivering use-case 4 above, then
             | you could strip away all the  "extra stuff." For use-case
             | 4, the "64MB immutable-once-full micro-partitions" paradigm
             | is literally all that's needed to losslessly convey all
             | domain state; and so a storage engine akin to the one
             | described in 3 is all you need to support it.
             | 
             | (If you're wondering, the business domain I'm working in,
             | where this pertains, is: analytical querying of blockchain
             | data. All the "CQRS events" [blockchain blocks and their
             | transactions] come from third parties, and are all
             | guaranteed-immutable upon insert, if not guaranteed-
             | _canonical_. [But canonicity can be tracked as a log of
             | chain tip changes, like a git commit log.] If you don 't
             | care about blockchains, though, domains with similar needs
             | for immutable append-only analytical stores include
             | financial forensic accounting, and structured API audit-
             | logging as state for rule engines inside Intrusion
             | Detection Systems.)
        
               | lichtenberger wrote:
               | It's fundamentally how SirixDB approaches this (basically
               | also storing checksums) as also written in another reply
               | :-)
               | 
               | Every commit directly syncs the binary data to the
               | durable storage (currently a file) and incrementally adds
               | data. Furthermore, it stores optionally the changes (type
               | of change/ctx node/updatePosition... in JSON files). For
               | instance, lately I've implemented a simple copy mechanism
               | based on this. Copy a given revision and optionally apply
               | all changes with intermediate commits to also copy the
               | full history up to the most recent revision). However,
               | the main idea is to use the change tracking also for diff
               | visualizations... maybe even stream these via web
               | sockets.
               | 
               | A production ready system BTW may be Datomic.
               | 
               | And it also reminds me of this paper:
               | https://dl.acm.org/doi/abs/10.5555/3275366.3284969
        
           | azurelake wrote:
           | That's more or less what Kafka is.
        
             | derefr wrote:
             | Kafka doesn't have fast random-access to a row-tuple in a
             | stream by its primary key, let alone declarative indexing
             | by compound keys / computed expressions.
             | 
             | Kafka _with_ those things would be equivalent to what I 'm
             | describing, yes.
        
               | lichtenberger wrote:
               | What about storing the data and thus, the indexes in
               | Kafka. Would it make sense? Let's say currently, I'm
               | storing SirixDB resources in files. However, instead of
               | offsets into a file the index pages could be stored in
               | Kafka optionally (or Pulsar...). Is Kafka too slow for
               | this or only for specific row-tuples? We could make a
               | combined storage caching the pages locally or also
               | storing in the file system and asynchronous storing in
               | Kafk, S3 or whatever.
        
           | polskibus wrote:
           | The ,,advanced" part of Andy Pavlo's great great database
           | course discusses classic database design compromises and the
           | tradeoffs, among other things.
           | 
           | See http://www.cs.cmu.edu/~pavlo/
        
           | tivert wrote:
           | > I often wonder: given how much of the complexity of modern
           | DBMSes exists because of constraints imposed by certain use-
           | cases that only pertain in certain business domains... what
           | efficiencies could we gain if we designed a domain-specific
           | DBMS with the knowledge that use-cases outside of the domain
           | are off-limits and can be ignored?
           | 
           | And how many serious problems would be caused by people not
           | correctly understanding what use-cases could be ignored in
           | their domain?
           | 
           | IMHO, complexity in the service of giving an easier-to-
           | reason-about interface is usually well worth the cost.
        
             | jiggawatts wrote:
             | Precisely this type of thinking resulted in S3, which was
             | the answer to: "what if we drop everything in a traditional
             | file system that is not strictly required"?
             | 
             | Several modern database platforms are similarly designed
             | around one or more traditional interface assumptions being
             | dropped.
             | 
             | IMHO, the biggest one is "interactive transactions" or
             | alternatively: long-running transactions. Anything external
             | to the database that can interact with it in one
             | transaction requires locks and all of the complexity that
             | comes with it, such as dealing with deadlocks.
             | 
             | Dropping the requirement for "interactive Telnet-like SQL
             | shells" can massively simplify things.
        
           | krab wrote:
           | There are niche databases like that.
           | 
           | Check out https://evitadb.io/ for a very different use case.
           | It has a rich query language supporting various e-commerce
           | aggregations (as in faceting). They claim it benchmarks 100x
           | faster than Postgres for this specific use case.
           | 
           | Although it looks very cool, if I'm not building an e-shop,
           | I'm not using it. There will be some unfavorable tradeoffs
           | for my case.
        
       | LAC-Tech wrote:
       | Tangential but I'm a bit blown away by that mongoDB mention -
       | that it just flushes every 100ms, so you can lose writes. The
       | link in the article is invalid, but from
       | https://www.mongodb.com/docs/current/core/journaling/ we see:
       | 
       | "In between write operations, while the journal records remain in
       | the WiredTiger buffers, updates can be lost following a hard
       | shutdown of mongod."
       | 
       | Does Mongodb acknowledge writes as soon as they've hit the
       | buffers or do they wait for an fsync? Because if the former,
       | that's... shocking for a database people use as a source of
       | truth.
        
         | tontinton wrote:
         | Thanks for pointing out the link is bad! I just fixed it.
         | 
         | And yeah this was shocking to me as well.
        
       | dvas wrote:
       | Enjoyed reading it! Great overview of different concepts involved
       | while building DB's. From mentioning SIMD to squeeze out
       | performance on a single machine all the way to consensus
       | algorithms.
       | 
       | While on the topic of DB's, reliability and distributed systems.
       | Formal methods and how they can be applied in these situations
       | and formally apply to Database internals for anyone else wishing
       | to read up on as another concept.
       | 
       | Interesting paper on the S3 team using TLA+ to model.
       | 
       | [0] Use of Formal Methods at Amazon Web Services
       | https://lamport.azurewebsites.net/tla/formal-methods-amazon....
       | 
       | [1] How Amazon Web Services uses formal methods
       | https://www.amazon.science/publications/how-amazon-web-servi...
        
       | pyrolistical wrote:
       | > Please avoid using distributed systems when non distributed
       | solutions suffice.
       | 
       | I would give the opposite advice.
       | 
       | Every non-trivial production system is a distributed system. At
       | the very least your database is a replica set and therefore a
       | distributed system. So avoid learning distributed systems at your
       | own peril.
       | 
       | Check out https://jepsen.io/ and https://raft.github.io/
        
         | emerongi wrote:
         | > Every non-trivial production system is a distributed system.
         | 
         | In the HN bubble, yes. From the perspective of the average
         | business, definitely not. Or at least, doesn't have to be.
        
           | eatonphil wrote:
           | Not to get reductionist but if you interact with more than
           | one (e.g. Linux) process, it's already a distributed system.
           | 
           | > A distributed system is a system whose components are
           | located on different networked computers, which communicate
           | and coordinate their actions by passing messages to one
           | another.
           | 
           | https://en.wikipedia.org/wiki/Distributed_computing
        
             | nrr wrote:
             | I'm unsure whether it's worth splitting hairs over the
             | definition of what a distributed system is. If you have to
             | linearize concurrent access to shared resources, that's
             | where the bulk of your pain will come from.
             | 
             | In the distributed system case, at least as far as that
             | term is commonly understood by technologists, that
             | linearization also has to take into account odd behaviors
             | in interconnects that you don't typically see within a
             | single machine.
        
               | vlovich123 wrote:
               | You wouldn't classify storage as an odd behaving
               | interconnect?
        
               | nrr wrote:
               | No. I classify storage (read: DASDs, not main memory) as
               | universally cursed.
        
           | golergka wrote:
           | Even the P and M of the classic LAMP stack already form a
           | distributed system.
        
             | kag0 wrote:
             | I agree, but in the context of this conversation I think
             | we're just talking about databases/state. So "is M a
             | distributed system on its own?"
        
         | PeterCorless wrote:
         | If you are running shard-per-core on more than one core, you're
         | a distributed system.
        
         | majormajor wrote:
         | Some parts of your system may not be able to avoid network
         | calls or distributed aspects. That doesn't mean introducing
         | them _everywhere_ doesn 't cause a lot _more_ complexity than
         | you 'd otherwise have.
        
         | mekoka wrote:
         | > Every non-trivial production system is a distributed system.
         | 
         | And now you have to define _non-trivial_. Just thrown like this
         | it does not invalidate the suggestion to avoid needless
         | complexity.
         | 
         |  _Learning_ and _using_ are different things. The trick is that
         | once you _learn_ distributed systems, you should exert
         | restraint to only _use_ the approach if it actually fits better
         | than the simpler non-distributed paradigm. Lately, there 's
         | been considerable effort put to transition simple and perfectly
         | working systems to the distributed model, as if it's, at worst,
         | zero cost. But then you look at the problem being solved, the
         | scale of the solution, and it's obvious that the single
         | postgres instance speaking to a monolith was just fine.
        
       | zackmorris wrote:
       | This is a great article!
       | 
       | But fsyncgate makes me cry inside. It sounds like PostgreSQL made
       | the decision to crash instead of retry forever upon fsync errors
       | (which is the wrong way to do it). But it wasn't their fault,
       | because OSs and filesystems don't always perform correctly when
       | they run out of disk space. And they were able to avoid data loss
       | by continuing from the last point in the log before the crash
       | (and maybe have a supervisor process relaunch the database) so in
       | the end, it may not matter.
       | 
       | The 2010s were a strange time. People went to great lengths to
       | work around perceived limitations in the tooling. Stuff like,
       | MongoDB would sometimes lose data, even when it claimed that a
       | write completed. And people were ok with that, because they
       | wanted the speed and freedom in order to iterate quickly and make
       | money.
       | 
       | When we really should have been formally addressing problems from
       | first principles. Because if our OS can't recover gracefully when
       | the disk fills up, then we don't really have anything. And so
       | many OSs (including our desktops) panic when the disk fills up,
       | that we don't really have anything. At a low enough level, all of
       | the tools we rely upon are trash.
       | 
       | But the point I wanted to make is: there's no such thing as
       | exceptional behavior, there's only how many edge cases have been
       | handled. Programs can be converted from imperative to functional
       | by replacing exceptions with blocking (or spin-locking) retries
       | so that their logic eventually finishes executing. In this case,
       | by letting programs wait until the user has cleared off disk
       | space, rather than forcing the user to orchestrate relaunching
       | processes. In other words, synchronous blocking logic can be
       | converted to a spreadsheet (flowchart) and reasoned about. But
       | asynchronous nonblocking logic allows state explosion by forcing
       | the user to deal with meta eventualities outside the core
       | business logic.
       | 
       | So it's understandable how PostgreSQL made the decision to go
       | with a nonblocking fault instead of blocking in 2018. But in
       | 2023, a better approach might be to write something like a
       | container with proper fsync behavior to protect PostgreSQL from
       | immature filesystems.
        
         | marcosdumay wrote:
         | I really envy your optimism, but no, filesystems at 2023 are
         | fucked too, both on the host and the guest. Disks are still
         | fucked in that they will tell you they saved your data, and
         | still lose it due to some failure, or when they are full. The
         | disk communication bus is a bit less fucked though, so that's
         | improvement.
         | 
         | But anyway, you still can't rely on good behavior from your
         | fsync, or even that a successful fsync means your data is safe.
        
         | ElectricalUnion wrote:
         | > When we really should have been formally addressing problems
         | from first principles. Because if our OS can't recover
         | gracefully when the disk fills up, then we don't really have
         | anything. And so many OSs (including our desktops) panic when
         | the disk fills up, that we don't really have anything. At a low
         | enough level, all of the tools we rely upon are trash.
         | 
         | Unfortunately, the OS can't really "recover" most of the time
         | because it is being constantly lied to by several(!!!) layers
         | of uncooperative hardware/firmware. It's zebras all the way
         | down:
         | 
         | https://www.youtube.com/watch?v=fE2KDzZaxvE
         | 
         | > In this case, by letting programs wait until the user has
         | cleared off disk space
         | 
         | If your system is really in a no-disk-space-left state you
         | might not be able to login to fix it:
         | 
         | https://unix.stackexchange.com/questions/165427/cant-login-d...
         | 
         | And even assuming you can login, sometimes you can't delete
         | files either; you need to free space in another way before
         | being able to delete files.
         | 
         | https://serverfault.com/questions/478733/rm-can-not-remove-x...
         | 
         | Filesystems and persistence are hard.
        
         | charcircuit wrote:
         | >there's no such thing as exceptional behavior, there's only
         | how many edge cases have been handled.
         | 
         | Some edge cases can be hard to handle. What if the processor
         | takes the wrong branch? What if one or more bit flips happens?
         | At some point it is just better to crash than try to and
         | actually handle such an edge case.
        
       | hiAndrewQuinn wrote:
       | Articles like this remind me I could basically make an entire
       | career and some version of every web service I could ever want
       | off of PostgreSQL, Django, and React at this point. We're living
       | in good times my friends.
        
       | fjwyasdf wrote:
       | What tool do you use to make those diagrams? They look great!
        
         | tontinton wrote:
         | https://excalidraw.com/
        
           | fjwyasdf wrote:
           | Thanks!
        
       | mahastore wrote:
       | There has been nothing man made in the world that is more elegant
       | and beautiful than Unix.
        
       | zschoche wrote:
       | Quick question: what are the NP-hard problems in the database
       | area?
        
         | PartiallyTyped wrote:
         | Solving for constraints and inverting queries i.e. from query
         | -> data with filtering and all that.
        
       | vaidhy wrote:
       | One quick comment on consistency - There is db consistency and
       | app consistency. For e.g, you can achieve A, I and D on one table
       | at a time, but fail on a multi-table write.
       | 
       | Once you deal with transactions which can update multiple tables
       | at the same time, then consistency comes to play. All the tables
       | should be updated at the same time or none.
        
         | tontinton wrote:
         | Oh that is a great example, I'll update the post tomorrow.
        
       | tonymet wrote:
       | Many people learn databases by learning sql.
       | 
       | I recommend people learn DBs by doing a class like this and
       | understanding b-trees.
       | 
       | Nearly Everything about the strengths & weaknesses of RDBMS is
       | understood through knowing btrees and how they affect key
       | insertion, lookup & ordering.
       | 
       | Most people try to speed up a DB by adding indexes -- but when
       | you realize you're just adding one tree on top of another you're
       | just masking the problem.
       | 
       | Some problems are well suited toward b-trees, and many aren't.
       | 
       | SQL is just a query interface to a remote b-tree system.
        
         | lichtenberger wrote:
         | Well, it may be a B-tree, or an LSM-tree, a trie or whatever
         | index structure suits...
         | 
         | Also, of course you may have covering indexes.
        
         | ljm wrote:
         | I think this is too reductive. A b-tree isn't the only indexing
         | strategy, and it's well understood that an index is intended to
         | improve read performance at the expense of write performance
         | because the common case is that your DB handles a lot more
         | reads than writes.
         | 
         | > but when you realize you're just adding one tree on top of
         | another you're just masking the problem.
         | 
         | What is the problem and how do you propose to solve it without
         | touching indexes? They are a hard requirement for any decent
         | sized table.
        
       | lichtenberger wrote:
       | "Database Design and Implementation" by Edward Sciore is also a
       | very great read with a lot of examples written in Java (actually
       | a small DBS).
       | 
       | For actual research I'd recommend all stuff from Andy (Pavlo),
       | Viktor Leis, Thorsten Grust, Thomas Neumann...
        
       | dxtrous wrote:
       | I was hoping to see an LSMT implementation in bash and left
       | slightly disappointed.
        
       ___________________________________________________________________
       (page generated 2023-12-15 23:00 UTC)