[HN Gopher] Database Fundamentals
___________________________________________________________________
Database Fundamentals
Author : tontinton
Score : 519 points
Date : 2023-12-15 15:28 UTC (7 hours ago)
(HTM) web link (tontinton.com)
(TXT) w3m dump (tontinton.com)
| abriosi wrote:
| Thank you very much for putting the article together
| praveer13 wrote:
| Great article. The book Database Internals looks amazing. Are
| there any other such books that deep dive into internals?
| cmrdporcupine wrote:
| Not a book, but I recommend the database class lectures from
| @apavlo's group at CMU.
|
| https://www.youtube.com/c/cmudatabasegroup
|
| All the classes (intro and advanced) are online, as well as
| presentations and lectures about industry products.
|
| They are very useful.
|
| Also, from a more high level theoretical CS and less physical-
| implementation focused POV, the "Alice" book ("Foundations of
| Databases") is amazing (though very dense and mathematical).
| Focuses on the relational algebra and Datalog (and the
| translation of that into the Rel Algebra). Getting print copies
| is hard now (my used copy arrived with a broken spine and pages
| falling out), but the entire book is online:
| http://webdam.inria.fr/Alice/
| senderista wrote:
| Another excellent DB textbook covering both theory and
| implementation is Weikum & Vossen:
|
| https://www.google.com/books/edition/_/wV5Ran71zNoC?hl=en&gb.
| ..
|
| In case you can't afford to donate $150 to Elsevier: https://
| www.dropbox.com/scl/fi/r0fros5y2kzeqv57v7myz/Transac...
| flashgordon wrote:
| Another book I found very useful (though this DI book is more
| modern) is Raghu Ramakrishnan Database Management Systems book.
| why-el wrote:
| There is a another famous one, focusing on Postgres:
| https://www.interdb.jp/pg/.
| dgellow wrote:
| Is it available as a print or ebook?
| big_whack wrote:
| This is a good paper for a similar kind of overview:
| https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf
| exacube wrote:
| Nice article!
|
| There is a bug in the compaction method: def
| compact(sstables, output_sstable): # Ordered by ascending
| key. pop() results in the item of the smallest key. heap
| = heapq.heapify([(sstable.next(), sstable) for sstable in
| sstables]) while (item, sstable) := heap.pop()
| if not item.is_tombstone():
| output_sstable.write(item)
|
| You should only skip tombstones when you are compacting the final
| (i.e., largest) level, instead of between every level. Otherwise,
| an entry in a lower level will unmask itself because the
| tombstone in the upper level was compacted away.
|
| It's one of the properties of LSM-based DBs that
| deletions/tombstones records linger for a long time, though some
| databases (eg RocksDB) put in some optimizations to get around
| this.
| tontinton wrote:
| You're right this was purposefully left out for brevity, in
| dbeel it is handled.
| exacube wrote:
| Ah, makes sense :)
| vlovich123 wrote:
| What kind of optimizations does RocksDB do? I know they have
| some stuff around range deletions but not sure I've read
| anything about point deletions.
| conqrr wrote:
| Very nice article and looks like a great way to get hands dirty
| for anyone wanting to think like a DB engineer. Related question,
| for all the Database engineers out there. I have always had a
| keen interest to work on Databases. Having worked at one of the
| cloud providers and burn my high amount of ops and oncall, I am
| of the opinion most of the Database engineers usually have a bad
| wlb. Given the depth and complexity of all Database internals,
| Any major impact would take a few quarters to roll out, and its
| not a job for everyone. Is this mostly true?
| cmrdporcupine wrote:
| I worked at a DB product startup, doing DB internals
| development [storage engine, etc], for a few months (before my
| personal/family situation forced me to have to move on), and
| WLB seemed pretty reasonable -- though there was a mandatory
| on-call rotation, that's kind of expected at most startups in
| the cloud space.
|
| People were very smart and friendly, and took vacations and all
| that kind of stuff. I'd say the "worst" part of working there
| was the amplification of my impostor syndrome?
|
| I don't know if that's typical for the industry as a whole. I'd
| definitely like to get back into DB internals dev at some
| point, it's one of the few places in our industry where you get
| to do fundamental systems dev. The job I had was in many ways
| my "dream job" but family situation at the time meant I
| couldn't manage it.
| senderista wrote:
| I had the most fun of my entire career writing a DB from
| scratch at my previous job. It was fun while it lasted (i.e.,
| until the money ran out)...
| aNoob7000 wrote:
| I think it depends on the environment. Yes, most production and
| even some test databases are critical to businesses. Any
| downtime or severe performance issues cause a lot of finger-
| pointing.
|
| Where I work, most database patches are applied to test
| environments and let soak for a couple of months before rolling
| out to production.
|
| And yes, I work a lot on weekends. My only complaint about my
| career as a DBA is that most of our work is behind the scenes
| and goes unnoticed until something breaks.
| eatonphil wrote:
| I can't judge your personal position/responsibilities but
| normally I think of DBA and database developer as not being
| the same thing (except for at small database startups where
| there's more overlap). At medium+ -sized database companies
| I've talked to the database developers are on call but likely
| don't have access to production. They are on call for bug
| reports from production users. Whereas DBAs are folks
| directly responsible for the database in production.
| Icathian wrote:
| I have worked on internals of databases for a few years. WLB is
| just fine, we have oncall rotations same as everyone else. They
| are deep and complex systems, but mostly what that means is
| just that average tenure is very high as it takes forever to
| get up to speed and people generally don't get bored and leave,
| unless you give them good reasons to.
|
| Overall I enjoy it a lot and would generally recommend it as a
| good field to get into if you want to tackle hard problems.
|
| Edit: people seem to be conflating DBAs with engineers working
| on database internals. Those are very different jobs and I'm
| only talking about the second one.
| cowthulhu wrote:
| I'm the de facto DB guy at work, WLB is fine... it's very rare
| that I have to work after-hours. The main reason for this is
| that our DBs are basically exclusively used during business
| hours, and if you set things up right it's rare that you'll get
| anything breaking on its own after-hours.
|
| Also - maybe this is specific to me, but most emergencies I
| deal with are of my own creation. It's rare to have a DB
| spontaniously corrupt a page. Most emergency issues I deal with
| are performance-related (this used to take 2s, now it's taking
| 200s) and you can mitigate those ahead of time if you're
| thoughtful (breaking code up so you lean less on the query
| optimizer, automatic index maintenance, consistent and well-
| indexed join paths).
|
| Depends what you mean on major impacts. A potentially breaking
| upgrade is a total nightmare, and never a fun process. But most
| (almost all?) of the things you do are much lower stakes, and
| definitely do not take a few quarters.
| cmrdporcupine wrote:
| I read the commenter as referring to working on DB internals
| -- that is, _building_ a database -- rather than working
| supporting workflows /queries on an existing DB product. But
| maybe I read wrong.
| twoodfin wrote:
| I recall an hn post a bit back that cataloged all the ways POSIX
| guaranteed atomicity, which might help bashdb. Rename is the
| first thing I'd explore.
|
| Maybe this from 2010 is what I was remembering from a repost,
| alas the link has rotted:
|
| https://news.ycombinator.com/item?id=1035100
| moritonal wrote:
| https://rcrowley.org/2010/01/06/things-unix-can-do-atomicall...
| tontinton wrote:
| This is great, maybe I'll try to improve bashdb
| __alias wrote:
| This article made me laugh because it's synonymous to me trying
| to start a project and going down different rabbit holes.
|
| The author at step 1. of a project trying to pick the tech
| stacks, decides to read a book on databases to help with choosing
| a DB
|
| Then proceeds to segue to writing his own database.
|
| Then writes a blog about said process
|
| I wonder how that original project has come along? I'm looking
| forward to when the author gets to the stage of picking a
| frontend framework
| JoshGlazebrook wrote:
| Don't forget about the part where you actually do start on the
| project, but then you read one article or find another
| tool/software package that makes you second guess everything
| you've already done and you go back down another rabbit hole.
| pphysch wrote:
| IME this "second-guessing" is more often right than wrong.
| You can always return to a project that motivates you, but
| you can't get back time spent digging a hole deeper, and
| often it leads to tunnel vision and bad habit formation.
|
| Not every "project" needs to become "finished" or "product".
| JoshGlazebrook wrote:
| My problem is scope creep. It's much harder to tell myself
| no vs. being on an engineering team at a company and there
| being a set process.
| loeber wrote:
| Yak Shaving: https://seths.blog/2005/03/dont_shave_that/
| spelunker wrote:
| When doing personal projects I have to constantly be reeling
| myself back in from doing x thing "The Right Way", because I
| end up doing a bunch of useless crap and not actually making
| progress on the personal project.
|
| Easy to fall into that trap when 1) it's just you and 2)
| there is little accountability because it's just you!
| bob1029 wrote:
| My tactic for pushing back against this is to try to trick
| myself into doing the simplest thing that might still work.
| It's a challenge to write "bad" code on purpose. The
| opposite of chasing perfect/clean.
|
| I have found that this frees up a lot of weird expectations
| that you place yourself under. You can get much more
| creative when everything is a dumb-ass-simple public static
| member vs when you are spending 2 hours a day fighting a
| cursed IoC/DI abstraction simply because you are worried
| that clean code Jesus _might_ be watching you.
|
| It helps to have an end goal too. It's really easy for me
| to push through a messy prototype when I can see it
| bringing me closer to a strategic objective.
| hiAndrewQuinn wrote:
| Bingo. First get it working, then get it right, then get
| it fast. It's for this reason that almost all of my
| projects start with a SQLite database - it's a program
| I'm very familiar with, like an old reliable chef's
| knife.
| cmrdporcupine wrote:
| Yeah this is me too apart from the writing a blog part because,
| uh, why would I want to expose the rest of humanity to my
| insanity?
| bonestamp2 wrote:
| It is pretty funny. That said, if it's just a personal project
| then sometimes it's more about the journey -- smelling every
| flower is the enjoyable part of the journey. Sometimes.
|
| I mentioned smelling the flowers because I look to young kids
| for reminders about the little things we sometimes forget to
| enjoy along the way, even if it's just the short journey from
| the car to the house. When you're not in a hurry, remember to
| enjoy the wonderful things that lie in your path.
| tontinton wrote:
| One day :)
| i_am_a_squirrel wrote:
| Great read! Thank you! Now do OLAP :p
| adontz wrote:
| "Consistency - This one doesn't really belong on ACID as a
| property of databases, as it is a property of the application."
| Damn, that is not good on so many levels.
|
| First of all, ACID are not properties of a database, they are
| properties of database transaction. Article kind of says it, but
| I feel it is phased ambiguous. Consistency is the property of a
| database transaction which can be easily illustrated by foreign
| keys. AUTHORS | author_id | name |
| | 1 | Mary | | 2 | John |
| ARTICLES | article_id | author_id | title | |
| 11 | 2 | About small animals | | 12 |
| 1 | About big machines |
|
| Now ARTICLES cannot have author_id non existent in AUTHORS. If a
| tuple is inserted in ARTICLES, then author_id should reference a
| valid tuple from AUTHORS. If a tuple is deleted from AUTHORS,
| then all ARTICLES tuples referencing said AUTHORS tuple should be
| deleted too; or deleting from AUTHORS should be rejected. That is
| consistency and that is a property of a database transaction.
| Transaction must fail if one tries to insert an article with
| author_id equal 5. At the end of transaction article 11 must be
| deleted if author 2 was deleted.
|
| Now real life databases are much more complex, but idea is
| basically the same.
|
| Also, consistency is between tuples. Not having
| "31-February-2001" as a date is integrity, not consistency.
| joshxyz wrote:
| damn right sir, and there is also the term "consistency levels"
| which vary from one db to another.
| tontinton wrote:
| Thanks, that's something I didn't know!
|
| I'll fix it later, and change the wording to say that these are
| properties of a database transaction
| dpc_01234 wrote:
| The atomicity in the bash version can be "simply" achieved by
| copying the file to a temporary, modyfing it, then using `sync;
| mv; sync`, right?
| tontinton wrote:
| Oh you're right I'll add that in later.
| n3storm wrote:
| I would go : 1. count records 2. make copy 3. insert 4.
| ensure records = records + 1 5 if records != records +1 or
| file can be grepped, stat, filesize, then assume is corrupted
| and rollback to copy
| o11c wrote:
| Use `look` to get `O(log(n))` lookups (writes are still slow,
| but you could use a multi-level chain I guess). `join` does
| not use binary search even though it could have.
| tontinton wrote:
| no waaay, I have never seen look before, thanks!
| ncruces wrote:
| Also, while you're copying you can inverse grep filter to avoid
| duplicates.
|
| And since you're copying, you could maybe ensure sorting, but I
| don't think doing that with "bash" (plus off the shelf utils)
| makes a lot of sense.
|
| That's what DJBs CDB ( _cdbget_ , _cdbmake_ , etc) is for:
| https://cr.yp.to/cdb.html
| pphysch wrote:
| I love how the article starts out by demystifying "database" by
| implementing a trivial one as a bash "one liner". Great hook.
| PeterCorless wrote:
| "So basically it has a document API like in MongoDB with
| leaderless replication like in Cassandra and thread-per-core
| architecture like in ScyllaDB."
|
| Very cool design. And all in Rust!
| tontinton wrote:
| Thanks :)
| bob1029 wrote:
| > The books piqued my curiosity enough to write my own little
| database
|
| I think many developers go through this phase at some point. I
| wouldn't even try to fight it. You learn so much about what _won
| 't_ work by doing this. Extremely valuable lessons, if you can
| spare the time.
|
| Building my own databases gave me more respect for the existing
| solutions than anything else. Getting the bytes to & from disk
| quickly isn't the hard part. It's doing it reliably for years on
| end while supporting use cases you couldn't have even dreamed of.
| derefr wrote:
| > while supporting use cases you couldn't have even dreamed of.
|
| I often wonder: given how much of the complexity of modern
| DBMSes exists because of constraints imposed by certain use-
| cases that only pertain in certain business domains... what
| efficiencies could we gain if we designed a _domain-specific_
| DBMS with the knowledge that use-cases outside of the domain
| are off-limits and can be ignored?
|
| For example, I currently use general-purpose DBs to deal with
| datasets that are fundamentally append-only. But what if I had
| a DB that _only_ supported working with append-only data? A DB
| that literally has no concept of an update or delete of
| existing rows -- you 've got insert, and _maybe_ dropping a
| table /dataset entirely, and that's it. Could such a DB get
| away with not implementing MVCC transactions? Could such a DB
| avoid using a write-ahead log (because the tables are each
| their own write-ahead logs)? Would it store data more
| efficiently? Could its indexing be chunkwise-atomic rather than
| whole-table-atomic and so require less locking? Etc.
| diatone wrote:
| Check out TigerBeetle!
| lichtenberger wrote:
| Have a look into my DB project: https://sirix.io |
| https://github.com/sirixdb/sirix
|
| https://sirix.io/docs/concepts.html and in progress tutorial
| https://sirix.io/docs/jsoniq-tutorial.html may be especially
| helpful.
|
| It basically turns updates into appends and is based on a
| persistent tree structure (the header with a reference to the
| (revision) root page has to be swapped atomically. Other than
| that the revision indexes for new data are always appended.
| In order to reduce copy-on-write overhead for updated page
| (fragments) a sliding snapshot for the data pages is applied.
|
| Naturally, unchanged pages are simply referenced (e.g.
| through offsets into the file, thus sharing unchanged pages
| between revisions).
|
| What's also special is a path summary of all unordered paths
| in a resource, which enables user-defined smaller tailored
| secondary indexes and other query rewrites :-)
| derefr wrote:
| How does Sirix compare to LMDB (esp. MDBX)?
|
| (I ask because AFAIK LMDB derivatives do a similar-sounding
| thing: it updates pages within a write-transaction by first
| allocating freelist pages to use to write out new copies of
| those pages with the changes included; these changes
| recurse upward because the pages are storing a B-tree,
| until a modified copy of the root page is made; a commit-
| log pointer is updated to point to the new root page; and
| then the old rewritten pages are put into the freelist.)
| lichtenberger wrote:
| Basically, it retains all revisions. Furthermore, the
| main document index is a keyed trie, much like hash array
| mapped tries. That is storing an array as a tree and
| using compact page layouts (bitmaps, 4 references pages,
| full pages) to reduce the page sizes if they are not
| full. However, Sirix assigns monotonically increasing,
| immutable, unique node identifiers, thus most inner pages
| are full with references to the next level pages (also
| checksums of the child pages are stored along with the
| references as in ZFS). The height of the tree increases
| dynamically. Currently every inner page stores at most
| 1024 references, thus it's a very wide tree, but we
| should experiment with other sizes.
|
| The leaf pages of the trie store either the data
| itself/the nodes or nodes of the path summary, nodes of
| the secondary indexes...
|
| Thus, we have the main document index, but a
| RevisionRootPage also has references to the tries, which
| store the secondary indexes. The secondary indexes are
| read into main memory / are reconstructed from the leaf
| pages of the tries (usually small), also a small path
| summary.
|
| The data pages are not simply copied... only nodes, which
| changed or fall out of a sliding window. Thus, a page may
| have to be reconstructed in-memory from at most a small
| number N of page fragments in the worst case. Thus, it
| needs a device, which is suitable for fast random, small
| sized parallel reads and sequential writes.
|
| Currently you have to copy a resource starting from a
| given revision and applying all updates up to the most
| recent revision with intermediate commits in order to get
| rid of old revisions, as it only uses one data file per
| resource (a resource is equivalent to a table in a
| relational system). Thus, the data files are basically
| logs. Another file simply stores offsets and timestamps
| read into memory to retrieve a given revision.
|
| https://sirix.io/docs/concepts.html
|
| and
|
| https://sirix.io/docs/jsoniq-tutorial.html
|
| Should probably help to get a further understanding.
|
| HTH and let me know if you're interested in more details
| :-)
|
| Thanks for asking
| tshaddox wrote:
| The irony is an extremely popular general-purpose database
| like PostgreSQL can sometimes work better* for niche use
| cases than less popular databases designed specifically for
| that niche use case.
|
| * better, or perhaps nearly as well such that it's not worth
| learning how to use and maintain the less popular single-
| purpose database.
| derefr wrote:
| I certainly wouldn't recommend anyone _start with_ a
| special-purpose DBMS. General-purpose DBMSes will get you
| quite far indeed. (They got my data API company to millions
| in ARR!)
|
| But when you hit a certain scale, and your general-purpose
| DBMS starts to struggle with things like "delivering
| millions of sub-1ms reads per second", and you start to
| look at what would be required to scale horizontally to
| thousands of elastic nodes with low start-up times and good
| QoS while serving your 100s-of-TBs dataset... you _might_
| just start Greenspunning something. Something that you
| might not at first realize is a domain-specific DBMS. But,
| once you _do_ realize that, you may wonder whether someone
| has already done it better.
|
| And _that_ (rather small) set of people, are the intended
| customers of a domain-specific DBMS.
| kyllo wrote:
| It's a lot easier to go distributed if you're append-only,
| that's for sure.
| bob1029 wrote:
| The append only constraint is super nice until it's not.
| Developing a way to manage this as the dataset scales is
| challenging. Replaying a 100gb log after a crash could become
| a problem. I've built entire product prototypes around
| something like this, but you always reach a point where it
| has a bug while you were way up in business logic land and so
| it feels like being ripped right down to hell. It's no longer
| fun after the first few cases of that experience.
| derefr wrote:
| I think you're speaking here about 1. queue-based event-
| store systems, e.g. combining a compacting Kafka topic with
| a CQRS reducer agent to serve as a store-of-record +
| snapshot-state representation respectively; 2. where you've
| likely _re_ structured what were fundamentally CRUD-esque
| operations into CQRS commands and events, to hold in the
| store, just so that they can be folded back down to CRUD
| updates to a snapshot living in an RDBMS by the reducer? I
| do agree that this kind of system can get really
| messy/painful once you have a lot of data.
|
| But I'm thinking more about:
|
| 3. "the Dynamo/Hadoop model, in the small": a single-
| process client-server row-store, but where a table is made
| of immutable, read-only 64MB chunks, with the newest chunk
| being an "open" buffer that becomes "closed" as soon as
| it's filled up; and where these chunks are of a known file-
| format such that you could directly compose them outside
| the DBMS in parallel and push them to the DBMS _as chunks_
| to be ingested "whole" (i.e. "mounted" as micro-partitions
| of the table);
|
| 4. where the business-domain's types are already
| _fundamentally immutable_ datatypes, that don 't need any
| "projecting" into a CQRS representation; i.e. where the
| thing the DB exists to store (and query on!) _is_ the
| immutable business data, not some latest-computed-state
| reduction over it, so there 's no need to ever play a log
| into a CQRS reducer, let alone _replay_ that log.
|
| I know that, for example, Clickhouse's MergeTree engine is
| at its core akin to the kind of system I'm describing in 3
| above -- but because Clickhouse is designed as a general-
| purpose RDBMS (and so still offers an API that includes
| UPDATE and DELETE) rather than being purpose-built for use-
| case 4 above, it needs to do a whole bunch of stuff _on top
| of_ that constrained chunked-immutable storage primitive,
| to virtualize those updates /deletes pre-merge, and to
| present MVCC transactions. Same with, for another example,
| CouchDB: an immutable-data-store core, with a ton of logic
| on top to allow the user to pretend they can update/delete.
|
| If you imagine a version of Clickhouse or CouchDB that
| _was_ solely focused on delivering use-case 4 above, then
| you could strip away all the "extra stuff." For use-case
| 4, the "64MB immutable-once-full micro-partitions" paradigm
| is literally all that's needed to losslessly convey all
| domain state; and so a storage engine akin to the one
| described in 3 is all you need to support it.
|
| (If you're wondering, the business domain I'm working in,
| where this pertains, is: analytical querying of blockchain
| data. All the "CQRS events" [blockchain blocks and their
| transactions] come from third parties, and are all
| guaranteed-immutable upon insert, if not guaranteed-
| _canonical_. [But canonicity can be tracked as a log of
| chain tip changes, like a git commit log.] If you don 't
| care about blockchains, though, domains with similar needs
| for immutable append-only analytical stores include
| financial forensic accounting, and structured API audit-
| logging as state for rule engines inside Intrusion
| Detection Systems.)
| lichtenberger wrote:
| It's fundamentally how SirixDB approaches this (basically
| also storing checksums) as also written in another reply
| :-)
|
| Every commit directly syncs the binary data to the
| durable storage (currently a file) and incrementally adds
| data. Furthermore, it stores optionally the changes (type
| of change/ctx node/updatePosition... in JSON files). For
| instance, lately I've implemented a simple copy mechanism
| based on this. Copy a given revision and optionally apply
| all changes with intermediate commits to also copy the
| full history up to the most recent revision). However,
| the main idea is to use the change tracking also for diff
| visualizations... maybe even stream these via web
| sockets.
|
| A production ready system BTW may be Datomic.
|
| And it also reminds me of this paper:
| https://dl.acm.org/doi/abs/10.5555/3275366.3284969
| azurelake wrote:
| That's more or less what Kafka is.
| derefr wrote:
| Kafka doesn't have fast random-access to a row-tuple in a
| stream by its primary key, let alone declarative indexing
| by compound keys / computed expressions.
|
| Kafka _with_ those things would be equivalent to what I 'm
| describing, yes.
| lichtenberger wrote:
| What about storing the data and thus, the indexes in
| Kafka. Would it make sense? Let's say currently, I'm
| storing SirixDB resources in files. However, instead of
| offsets into a file the index pages could be stored in
| Kafka optionally (or Pulsar...). Is Kafka too slow for
| this or only for specific row-tuples? We could make a
| combined storage caching the pages locally or also
| storing in the file system and asynchronous storing in
| Kafk, S3 or whatever.
| polskibus wrote:
| The ,,advanced" part of Andy Pavlo's great great database
| course discusses classic database design compromises and the
| tradeoffs, among other things.
|
| See http://www.cs.cmu.edu/~pavlo/
| tivert wrote:
| > I often wonder: given how much of the complexity of modern
| DBMSes exists because of constraints imposed by certain use-
| cases that only pertain in certain business domains... what
| efficiencies could we gain if we designed a domain-specific
| DBMS with the knowledge that use-cases outside of the domain
| are off-limits and can be ignored?
|
| And how many serious problems would be caused by people not
| correctly understanding what use-cases could be ignored in
| their domain?
|
| IMHO, complexity in the service of giving an easier-to-
| reason-about interface is usually well worth the cost.
| jiggawatts wrote:
| Precisely this type of thinking resulted in S3, which was
| the answer to: "what if we drop everything in a traditional
| file system that is not strictly required"?
|
| Several modern database platforms are similarly designed
| around one or more traditional interface assumptions being
| dropped.
|
| IMHO, the biggest one is "interactive transactions" or
| alternatively: long-running transactions. Anything external
| to the database that can interact with it in one
| transaction requires locks and all of the complexity that
| comes with it, such as dealing with deadlocks.
|
| Dropping the requirement for "interactive Telnet-like SQL
| shells" can massively simplify things.
| krab wrote:
| There are niche databases like that.
|
| Check out https://evitadb.io/ for a very different use case.
| It has a rich query language supporting various e-commerce
| aggregations (as in faceting). They claim it benchmarks 100x
| faster than Postgres for this specific use case.
|
| Although it looks very cool, if I'm not building an e-shop,
| I'm not using it. There will be some unfavorable tradeoffs
| for my case.
| LAC-Tech wrote:
| Tangential but I'm a bit blown away by that mongoDB mention -
| that it just flushes every 100ms, so you can lose writes. The
| link in the article is invalid, but from
| https://www.mongodb.com/docs/current/core/journaling/ we see:
|
| "In between write operations, while the journal records remain in
| the WiredTiger buffers, updates can be lost following a hard
| shutdown of mongod."
|
| Does Mongodb acknowledge writes as soon as they've hit the
| buffers or do they wait for an fsync? Because if the former,
| that's... shocking for a database people use as a source of
| truth.
| tontinton wrote:
| Thanks for pointing out the link is bad! I just fixed it.
|
| And yeah this was shocking to me as well.
| dvas wrote:
| Enjoyed reading it! Great overview of different concepts involved
| while building DB's. From mentioning SIMD to squeeze out
| performance on a single machine all the way to consensus
| algorithms.
|
| While on the topic of DB's, reliability and distributed systems.
| Formal methods and how they can be applied in these situations
| and formally apply to Database internals for anyone else wishing
| to read up on as another concept.
|
| Interesting paper on the S3 team using TLA+ to model.
|
| [0] Use of Formal Methods at Amazon Web Services
| https://lamport.azurewebsites.net/tla/formal-methods-amazon....
|
| [1] How Amazon Web Services uses formal methods
| https://www.amazon.science/publications/how-amazon-web-servi...
| pyrolistical wrote:
| > Please avoid using distributed systems when non distributed
| solutions suffice.
|
| I would give the opposite advice.
|
| Every non-trivial production system is a distributed system. At
| the very least your database is a replica set and therefore a
| distributed system. So avoid learning distributed systems at your
| own peril.
|
| Check out https://jepsen.io/ and https://raft.github.io/
| emerongi wrote:
| > Every non-trivial production system is a distributed system.
|
| In the HN bubble, yes. From the perspective of the average
| business, definitely not. Or at least, doesn't have to be.
| eatonphil wrote:
| Not to get reductionist but if you interact with more than
| one (e.g. Linux) process, it's already a distributed system.
|
| > A distributed system is a system whose components are
| located on different networked computers, which communicate
| and coordinate their actions by passing messages to one
| another.
|
| https://en.wikipedia.org/wiki/Distributed_computing
| nrr wrote:
| I'm unsure whether it's worth splitting hairs over the
| definition of what a distributed system is. If you have to
| linearize concurrent access to shared resources, that's
| where the bulk of your pain will come from.
|
| In the distributed system case, at least as far as that
| term is commonly understood by technologists, that
| linearization also has to take into account odd behaviors
| in interconnects that you don't typically see within a
| single machine.
| vlovich123 wrote:
| You wouldn't classify storage as an odd behaving
| interconnect?
| nrr wrote:
| No. I classify storage (read: DASDs, not main memory) as
| universally cursed.
| golergka wrote:
| Even the P and M of the classic LAMP stack already form a
| distributed system.
| kag0 wrote:
| I agree, but in the context of this conversation I think
| we're just talking about databases/state. So "is M a
| distributed system on its own?"
| PeterCorless wrote:
| If you are running shard-per-core on more than one core, you're
| a distributed system.
| majormajor wrote:
| Some parts of your system may not be able to avoid network
| calls or distributed aspects. That doesn't mean introducing
| them _everywhere_ doesn 't cause a lot _more_ complexity than
| you 'd otherwise have.
| mekoka wrote:
| > Every non-trivial production system is a distributed system.
|
| And now you have to define _non-trivial_. Just thrown like this
| it does not invalidate the suggestion to avoid needless
| complexity.
|
| _Learning_ and _using_ are different things. The trick is that
| once you _learn_ distributed systems, you should exert
| restraint to only _use_ the approach if it actually fits better
| than the simpler non-distributed paradigm. Lately, there 's
| been considerable effort put to transition simple and perfectly
| working systems to the distributed model, as if it's, at worst,
| zero cost. But then you look at the problem being solved, the
| scale of the solution, and it's obvious that the single
| postgres instance speaking to a monolith was just fine.
| zackmorris wrote:
| This is a great article!
|
| But fsyncgate makes me cry inside. It sounds like PostgreSQL made
| the decision to crash instead of retry forever upon fsync errors
| (which is the wrong way to do it). But it wasn't their fault,
| because OSs and filesystems don't always perform correctly when
| they run out of disk space. And they were able to avoid data loss
| by continuing from the last point in the log before the crash
| (and maybe have a supervisor process relaunch the database) so in
| the end, it may not matter.
|
| The 2010s were a strange time. People went to great lengths to
| work around perceived limitations in the tooling. Stuff like,
| MongoDB would sometimes lose data, even when it claimed that a
| write completed. And people were ok with that, because they
| wanted the speed and freedom in order to iterate quickly and make
| money.
|
| When we really should have been formally addressing problems from
| first principles. Because if our OS can't recover gracefully when
| the disk fills up, then we don't really have anything. And so
| many OSs (including our desktops) panic when the disk fills up,
| that we don't really have anything. At a low enough level, all of
| the tools we rely upon are trash.
|
| But the point I wanted to make is: there's no such thing as
| exceptional behavior, there's only how many edge cases have been
| handled. Programs can be converted from imperative to functional
| by replacing exceptions with blocking (or spin-locking) retries
| so that their logic eventually finishes executing. In this case,
| by letting programs wait until the user has cleared off disk
| space, rather than forcing the user to orchestrate relaunching
| processes. In other words, synchronous blocking logic can be
| converted to a spreadsheet (flowchart) and reasoned about. But
| asynchronous nonblocking logic allows state explosion by forcing
| the user to deal with meta eventualities outside the core
| business logic.
|
| So it's understandable how PostgreSQL made the decision to go
| with a nonblocking fault instead of blocking in 2018. But in
| 2023, a better approach might be to write something like a
| container with proper fsync behavior to protect PostgreSQL from
| immature filesystems.
| marcosdumay wrote:
| I really envy your optimism, but no, filesystems at 2023 are
| fucked too, both on the host and the guest. Disks are still
| fucked in that they will tell you they saved your data, and
| still lose it due to some failure, or when they are full. The
| disk communication bus is a bit less fucked though, so that's
| improvement.
|
| But anyway, you still can't rely on good behavior from your
| fsync, or even that a successful fsync means your data is safe.
| ElectricalUnion wrote:
| > When we really should have been formally addressing problems
| from first principles. Because if our OS can't recover
| gracefully when the disk fills up, then we don't really have
| anything. And so many OSs (including our desktops) panic when
| the disk fills up, that we don't really have anything. At a low
| enough level, all of the tools we rely upon are trash.
|
| Unfortunately, the OS can't really "recover" most of the time
| because it is being constantly lied to by several(!!!) layers
| of uncooperative hardware/firmware. It's zebras all the way
| down:
|
| https://www.youtube.com/watch?v=fE2KDzZaxvE
|
| > In this case, by letting programs wait until the user has
| cleared off disk space
|
| If your system is really in a no-disk-space-left state you
| might not be able to login to fix it:
|
| https://unix.stackexchange.com/questions/165427/cant-login-d...
|
| And even assuming you can login, sometimes you can't delete
| files either; you need to free space in another way before
| being able to delete files.
|
| https://serverfault.com/questions/478733/rm-can-not-remove-x...
|
| Filesystems and persistence are hard.
| charcircuit wrote:
| >there's no such thing as exceptional behavior, there's only
| how many edge cases have been handled.
|
| Some edge cases can be hard to handle. What if the processor
| takes the wrong branch? What if one or more bit flips happens?
| At some point it is just better to crash than try to and
| actually handle such an edge case.
| hiAndrewQuinn wrote:
| Articles like this remind me I could basically make an entire
| career and some version of every web service I could ever want
| off of PostgreSQL, Django, and React at this point. We're living
| in good times my friends.
| fjwyasdf wrote:
| What tool do you use to make those diagrams? They look great!
| tontinton wrote:
| https://excalidraw.com/
| fjwyasdf wrote:
| Thanks!
| mahastore wrote:
| There has been nothing man made in the world that is more elegant
| and beautiful than Unix.
| zschoche wrote:
| Quick question: what are the NP-hard problems in the database
| area?
| PartiallyTyped wrote:
| Solving for constraints and inverting queries i.e. from query
| -> data with filtering and all that.
| vaidhy wrote:
| One quick comment on consistency - There is db consistency and
| app consistency. For e.g, you can achieve A, I and D on one table
| at a time, but fail on a multi-table write.
|
| Once you deal with transactions which can update multiple tables
| at the same time, then consistency comes to play. All the tables
| should be updated at the same time or none.
| tontinton wrote:
| Oh that is a great example, I'll update the post tomorrow.
| tonymet wrote:
| Many people learn databases by learning sql.
|
| I recommend people learn DBs by doing a class like this and
| understanding b-trees.
|
| Nearly Everything about the strengths & weaknesses of RDBMS is
| understood through knowing btrees and how they affect key
| insertion, lookup & ordering.
|
| Most people try to speed up a DB by adding indexes -- but when
| you realize you're just adding one tree on top of another you're
| just masking the problem.
|
| Some problems are well suited toward b-trees, and many aren't.
|
| SQL is just a query interface to a remote b-tree system.
| lichtenberger wrote:
| Well, it may be a B-tree, or an LSM-tree, a trie or whatever
| index structure suits...
|
| Also, of course you may have covering indexes.
| ljm wrote:
| I think this is too reductive. A b-tree isn't the only indexing
| strategy, and it's well understood that an index is intended to
| improve read performance at the expense of write performance
| because the common case is that your DB handles a lot more
| reads than writes.
|
| > but when you realize you're just adding one tree on top of
| another you're just masking the problem.
|
| What is the problem and how do you propose to solve it without
| touching indexes? They are a hard requirement for any decent
| sized table.
| lichtenberger wrote:
| "Database Design and Implementation" by Edward Sciore is also a
| very great read with a lot of examples written in Java (actually
| a small DBS).
|
| For actual research I'd recommend all stuff from Andy (Pavlo),
| Viktor Leis, Thorsten Grust, Thomas Neumann...
| dxtrous wrote:
| I was hoping to see an LSMT implementation in bash and left
| slightly disappointed.
___________________________________________________________________
(page generated 2023-12-15 23:00 UTC)