[HN Gopher] We put a distributed database in the browser and mad...
___________________________________________________________________
We put a distributed database in the browser and made a game of it
Author : BratishkaErik
Score : 181 points
Date : 2023-07-11 13:15 UTC (9 hours ago)
(HTM) web link (tigerbeetle.com)
(TXT) w3m dump (tigerbeetle.com)
| eatonphil wrote:
| Direct link to the game: https://sim.tigerbeetle.com/.
|
| While TigerBeetle is open-source, the game isn't yet. But that
| may change. :)
| pohl wrote:
| This is a sweet idea and a nice, game-esque implementation.
|
| It could definitely use some onboarding.
|
| There's nothing to give the "player" a hint as to what they
| should do. What is my goal? Am I trying to defeat the
| communication of the nodes, or help them? If it's the former, why
| did I seem to win the first level after doing nothing? I started
| by trying to mash some keys. Eventually I saw that there were
| tools in the upper-right, but it wasn't clear what to do with
| them. It's a bit frustrating that they disappear after you grab
| them if your click is too far off the target. After successfully
| applying one of them to one of the targets, it's not clear what
| one should do next. Was it a good move? Who knows, because
| there's no feedback.
| kristoff_it wrote:
| Well the point is that the replication protocol is going to fix
| all issues no matter what, so you could say that the only
| winning move is to enjoy watching TigerBeetle move forward
| impervious of scripted nor human-inputted issues.
|
| In other words, you're going to 'win' no matter what :^)
| pohl wrote:
| That makes sense: it's not a game at all, but a simulation
| pretending to be one.
| kristoff_it wrote:
| never played "walking simulators" then? plenty of games
| don't feature traditional win-lose mechanics.
| pohl wrote:
| No, I hadn't even heard of that genre. They sound
| potentially fun, though, since at least they have an
| environment to explore.
| jorangreef wrote:
| Thanks for the feedback! This is the first version of our
| educational SimTigerBeetle frontend.
|
| The goal is to see and interact with a distributed database
| running visually under Easy/Medium/Hard levels of fault
| injection, giving you the opportunity to explore how the
| consensus handles failures, poke the otherwise
| deterministically simulated world, or inspect progress of the
| replicas.
|
| So, it's a game in the sim genre, and we've got more to come,
| but the significance is hopefully what we tried to explain in
| the post:
|
| You're seeing a new design of distributed database survive
| cosmic levels of fault injection (8% storage corruption on the
| read path across all replicas in the Radioactive level!), in
| your browser... and you can smash things with a hammer. ;)
|
| You might also appreciate the video [0] at the end of our post,
| where we dive into the three insights behind deterministic
| simulation testing, and how DST moves beyond chaos engineering.
|
| [0] SimTigerBeetle (Director's Cut!)
| https://www.youtube.com/watch?v=Vch4BWUVzMM
| charbuff wrote:
| This is a really cool sales prop
| jedisct1 wrote:
| This is insanely cool!
| Digikid13 wrote:
| It's very confusing that y'all are calling this a "game". There's
| nothing to play, it's just a simulation to watch.
| cstrahan wrote:
| "Simulation" doesn't necessarily entail stimulating
| interactivity and a pleasing aesthetic. As of writing, I don't
| know if there's a better term than "game" for what this is.
| (Maybe we can coin a new term, like "game-lite" -- somewhat
| akin to what "rogue-lite" is to the "rogue" genre).
|
| See "walking simulators":
|
| https://en.wikipedia.org/wiki/What_Remains_of_Edith_Finch
|
| https://en.wikipedia.org/wiki/The_Stanley_Parable
|
| https://en.wikipedia.org/wiki/Thirty_Flights_of_Loving
|
| Plenty argue that these aren't games either (usual complaints
| involve lack of problem(s) to resolve, and no win-lose
| dynamic); but, then, what _are_ these? The closest category I
| can think of would be "computer-animated film", but... these
| are interactive, and you can navigate and look in any direction
| you want, which yields a very different experience than
| watching a film like "Toy Story".
| Apes wrote:
| I think "demo" or "toy" or "exhibit" would be a better term
| than "game", since there's not really any story, or rules, or
| objectives, or anything resembling a normal gameplay loop.
| The level of interactivity is well below what you would
| typically expect from a game, and there's effectively no
| agency in affecting the outcome of what happens. Even walking
| simulators have at least some of those things.
| zerocrates wrote:
| "Walking simulator" is a little tricky to extrapolate from
| since it was originally a perjorative.
| shard wrote:
| I remember reading an article which talks about the
| difference between games and these interactive demos: games
| have objectives and goals, and for the demos without such,
| they're better to be called as "toys".
| captainhorst wrote:
| There are tools in the upper right corner. You can pick one and
| use it on a beetle. ;) This introduces various faults in the
| simulation that the database has to recover from.
|
| Minor spoiler: there's another tiny game hidden at the end of
| the third level.
| reidjs wrote:
| That adds some interactivity, but I still wouldn't call this
| a game yet. Games generally have some sort of challenge you
| must overcome. This is more like a highly polished
| interactive simulation of a distributed db. Still really cool
| and well done.
| [deleted]
| petrosagg wrote:
| > Sure, we're not yet injecting storage faults, but then formal
| proofs for protocols like Raft and Paxos assume that disks are
| perfect, and depend on this for correctness? After all, you can
| always run your database over RAID, right? Right? > If your
| distributed database was designed before 2018, you probably
| couldn't have done much. The research didn't exist.
|
| I'm trying to understand this part but something seems off. It
| seems to imply that those proofs do not apply to the real world
| because disks are not perfect, but neither is RAM. The hardware
| for both RAM and disks has an inherent error rate which can be
| brought down to arbitrary levels by using error correction codes
| (e.g ECC RAM).
|
| I'm assuming that the TigerBeetle correctness proofs are
| predicated on perfect RAM even though in reality there is a small
| probability of errors. This tells me that there is an error rate
| which they consider negligible. If that's the case what is the
| difference between:
|
| * TigerBeetle's storage approach
|
| * Paxos or Raft with enough error correction on disk writes that
| the probability of errors equals that of ECC RAM which is
| considered negligible
|
| I've probably made a logical error in my reasoning but I can't
| see it. Can someone enlighten me?
| jorangreef wrote:
| Thanks for the question! Joran from TigerBeetle here.
|
| The research in question is the 2018 paper from UW-Madison,
| "Protocol-Aware Recovery for Consensus-Based Storage" (PAR) [0]
| by Ram Alagappan, Aishwarya Ganesan, as well as Remzi and
| Andrea Arpaci-Dusseau (who you may recognize as authors of
| OSTEP).
|
| PAR won best paper at FAST '18 for showing that a single disk
| sector fault, in the write-ahead log (WAL) of a single replica,
| could propagate through the distributed RAFT or MultiPaxos
| consensus protocol, to cause global cluster data loss.
|
| This was counter-intuitive at the time, because PAR showed that
| the redundancy of these consensus and replication protocols did
| not in fact always imply fault-tolerance, as had previously
| been assumed.
|
| The reason is, and we cover this in depth in our recent QCon
| London talk [1], but it was assumed that checksums alone would
| be sufficient to detect and recover from storage faults.
|
| However, while checksums can be used under the "Crash
| Consistency Model" to solve consistency through power loss, PAR
| showed that checksums are not sufficient to be able to
| distinguish between a torn write at the end of the
| (uncommitted) WAL caused by power loss, and a torn write in the
| middle of the (committed) WAL caused by bitrot.
|
| What you tend to find is that the WALs for many of these
| protocols will truncate the WAL at the first sign of a checksum
| mismatch, conflating the mismatch with power loss when it might
| be bitort, and thereby truncating committed transactions, and
| undermining quorum votes in the Raft or MultiPaxos
| implementations.
|
| RAID solutions don't always help here, either. See "Parity Lost
| and Parity Regained" [2] for more details. ZRAID is better
| here, and ZFS is a huge inspiration, but with local redundancy
| under ZFS you're still not leveraging the global redundancy of
| the consensus protocol as well as you could be.
|
| To summarize PAR:
|
| There are fundamental design changes to both the global
| consensus protocol and the local storage engine that would need
| to be made, if the storage fault model of PAR (and TigerBeetle)
| is to be solved correctly.
|
| Furthermore, few simulators even test for these kinds of
| storage faults. For example, misdirected I/O, where the disk
| writes or reads to or from the wrong location of disk, which
| may yet have a valid checksum.
|
| However, this is important, because disks fail in the real
| world. A single disk has on the order of a 0.5-1% chance of
| corruption in a 2 year period [3]. For example, a 5 node
| cluster has a 2.5-5% chance of a single disk sector fault,
| which again in terms of PAR can lead to global cluster data
| loss.
|
| On the other hand, memory (or even CPU) faults, assuming ECC
| are not in the same order of magnitude probability, and
| therefore TigerBeetle's memory fault model is to require ECC
| memory.
|
| But, again, to be crystal clear, checksums alone are not
| sufficient to solve the consensus corruption issue. The fix
| requires protocol changes at the design level, for the
| consensus protocol to be made storage fault-aware.
|
| Thanks for the question and happy to answer more!
|
| [0] "Protocol-Aware Recovery for Consensus-Based Storage"
| https://www.usenix.org/conference/fast18/presentation/alagap...
|
| [1] "A New Era for Database Design" (we also dive into the
| research surrounding Fsyncgate, looking into the latent
| correctness issues that remain)
| https://www.youtube.com/watch?v=_jfOk4L7CiY
|
| [2] "Parity Lost and Parity Regained"
| https://www.usenix.org/conference/fast-08/parity-lost-and-pa...
|
| [3] "An Analysis of Data Corruption in the Storage Stack"
| https://www.cs.toronto.edu/~bianca/papers/fast08.pdf
| petrosagg wrote:
| Thank you for the detailed response!
|
| > However, while checksums can be used under the "Crash
| Consistency Model" to solve consistency through power loss,
| PAR showed that checksums are not sufficient to be able to
| distinguish between a torn write at the end of the
| (uncommitted) WAL caused by power loss, and a torn write in
| the middle of the (committed) WAL caused by bitrot.
|
| The PAR paper states that "although Crash preserves safety,
| it suffers from severe unavailability". I assume that when
| TigerBeetle loads state from RAM into a CPU cache/register it
| operates under the NoDetection consistency model or the Crash
| consistency model if ECC RAM automatically resets the CPU on
| read errors. At the same time it doesn't suffer from severe
| unavailability so what gives?
|
| The answer is probably that ECC RAM is just reliable enough
| that the NoDetection/Crash models are fine in practice.
|
| I can believe that off-the-shelf checksum and redundancy
| options offered by filesystems like ext4 and ZFS or systems
| like RAID don't hit the required error probabilities but why
| does the argument stop there? Couldn't a distributed database
| generate error correcting data on every write in the
| application layer so that the probability becomes low enough
| such that NoDetection/Crash become a non-issue for storage,
| just like RAM? Is there some other fundamental difference
| between reading and write data from RAM versus a disk?
| jorangreef wrote:
| Huge pleasure, thanks again for the question!
|
| The crux of the problem: How do you solve misdirected
| read/write I/O? Where the firmware writes/reads to/from the
| wrong disk sector (but with a valid checksum)?
|
| PAR shows how both global consensus protocol and local
| storage engine need to be modified for this, with
| foundational design changes at the protocol-level, if a
| distributed system is to not only preserve correctness, but
| also optimize for high availability.
|
| Bear in mind that PAR is not only actually correct, but
| it's also more efficient than simply dialing up local
| redundancy, because it lets you recover from the global
| redundancy that you have via replication in the consensus
| protocol.
|
| The paper is great, but will especially reward a few passes
| of reading. The examples they give take time, but are great
| to work through slowly to gain a deeper understanding.
|
| And/or, you can read the Zig code of PAR in TB! :)
|
| Here's a great place to start, one of our favorite pieces
| of code in TigerBeetle: https://github.com/tigerbeetle/tige
| rbeetle/blob/4aca8a22627b...
| convolvatron wrote:
| you just bury these and play games with p. just like
| distributed consensus, there is no perfect storage medium. 1
| bit flip a week too much for you? add secded. secded on memory.
| interleaving to spread out correlated errors. etc.
|
| at some point you run in the probability of the earth being
| coincident with the sun and you call it good.
|
| none of viewstamp replication, paxos, or raft deal with storage
| errors.
|
| where it does get interesting is that in the low p spectrum can
| you can subvert the correctness of these protocols by fuzzing
| them. and then you get into byzantine protocols.
| jorangreef wrote:
| We tried to emphasize "Protocol-Aware Recovery for Consensus-
| Based Storage" in the blog post, because it's how TigerBeetle
| solves the storage fault model, and because PAR shows how you
| can actually solve this using the redundancy you already have
| in the global consensus protocol.
|
| https://www.usenix.org/conference/fast18/presentation/alagap.
| ..
| chubot wrote:
| Not an expert in this area, but I think disks have correlated
| failure modes whereas CPUs and memory generally don't.
|
| Especially spinning platter disks, not sure about SSDs.
|
| The difference in failure rates could be orders of magnitude
| ... Memory will have random bit flips but I think they are
| pretty randomly distributed (or maybe catastrophic if there is
| some cosmic event)
|
| But disks will have non-random manufacturing issues. I'd be
| interested in more info too, but my impression is that the data
| on these issues is pretty thin. Foundation DB mentioned it ~10
| years ago and Google has published data >10 years ago, but
| hardware has changed a lot since then
|
| Software redundancy will take care of non-correlated failures,
| but it fails precisely when there are correlated ones
| imtringued wrote:
| SSDs tend to have highly correlated failure modes because you
| either run into a bug in the firmware which is the same on
| every SSD or you have the same wear on every SSD, which locks
| both into read-only mode within a short period of time. You
| might argue that read-only is not a failure, but read-only
| means downtime and replacing hardware.
| planckscnst wrote:
| I work on a large distributed database system. SSDs
| absolutely have correlated failures. Also CMOS batteries.
| Also CPU and memory (think a manufacturing defect or a
| storage climate issue on specific batches that made it
| through QA). Pretty much nothing is 100% guaranteed to have
| no correlated failures. It comes down to probabilities. You
| can add flexibility, variation, vendor/sourcing diversity to
| reduce risks.
| brimstedt wrote:
| Is this something similar to couch/pouchdb?
| eatonphil wrote:
| TigerBeetle is more domain-specific, i.e. focused on financial
| transactions and high-performance, high-availability. There are
| just two entity types in the database: accounts and transfers
| between accounts. More details here
| https://docs.tigerbeetle.com/design/data-modeling if you're
| interested!
| jazzyjackson wrote:
| In comparison to {C,P}ouchDB, I think the question is around
| offline-first availability. Can a mobile client interact with
| a local database without a degraded experience and expect
| changes to be synchronized once a connection is re-
| established. I would suppose in the financial transaction
| space this is not A Thing.
| eatonphil wrote:
| > In comparison to {C,P}ouchDB, I think the question is
| around offline-first availability.
|
| Got it, thanks! Yeah that is indeed not how TigerBeetle
| works. If you ever cannot connect to the cluster, you keep
| retry messages (idempotently) until you connect and the
| message succeeds.
| jlokier wrote:
| That sounds similar to the account state storage used in
| blockchains. In some of those, high performance and high
| space consumption are technical challenges, and the tendancy
| to have a lot of random keys (hashes) adds another.
|
| I wonder if TigerBeetle would be suited to those storage
| performance challenges, and conversely if the low-level
| storage optimisations in certain implementations of
| blockchain account history would be more broadly applicable
| to the financial applications targeted by TigerBeetle.
| eatonphil wrote:
| Yes it is similar, but simpler! We do chat with a few
| companies using blockchain, looking at TigerBeetle for
| better throughput.
|
| > if the low-level storage optimisations in certain
| implementations of blockchain account history would be more
| broadly applicable to the financial applications targeted
| by TigerBeetle.
|
| Maybe! Any examples you're thinking of?
| cdchn wrote:
| This is pretty cool from a database perspective but I'm really
| interested in what they used to write this 'game.'
| eatonphil wrote:
| It's covered a little bit in the Evolution section (add
| #evolution to the URL and hit enter).
|
| If that doesn't answer everything (it's not a long section,
| granted) Fabio (captainhorst) is on HN answering questions in
| this thread already so feel free to ask!
| nlavezzo wrote:
| This is awesome!
| moralestapia wrote:
| Nice work @eatonphil and TB team. Thanks for sharing.
|
| I'm eagerly awaiting the production ready (or an RC or something)
| of TB, I've tried it and it truly excels in its domain.
|
| I have a few clients that could make good use of it but I have to
| hold my horses a little bit.
| jorangreef wrote:
| Thanks, Alex! Awesome to hear that. SimTigerBeetle has been a
| skunkworks project, on the side, the past 12 months. We're
| looking forward to sharing the production release of
| TigerBeetle with you soon when it's ready.
| muhaaa wrote:
| I thought tigerbeetle distributes the db on web clients? I
| skimmed through the docs and cannot find how to design a web app
| with tigerbeetle like the game https://sim.tigerbeetle.com/
| eatonphil wrote:
| Hey! I'm not completely sure I follow the question. But the
| game is built a little differently from how you'd typically run
| TigerBeetle in production. In production you'd run a cluster of
| replicas (see
| https://docs.tigerbeetle.com/deploy/hardware#cluster-of-
| repl...) and then you'd have your application connect to the
| cluster using one of our client libraries.
|
| The game is more a reflection of how we do simulation testing
| (as the post mentioned). You can find the code for that here: h
| ttps://github.com/tigerbeetledb/tigerbeetle/blob/main/src/s....
| muhaaa wrote:
| I have a very interactive app with limited data use <10MB.
| Right now its a lots of different API calls to sync state
| between server and multiple clients. But if I have a
| distributed sync protocol (that works), I can replicate the
| state on each cient and server by tunneling the sync protocol
| to the server and back. I have reduced a lots of api
| complexity. My react UI just subscribes to the data changes
| in the synced DB. Similar concept impelmented pouchdb &
| couchdb but its a bit outdated.
| cdchn wrote:
| What did you use to make the graphics? Thats very cool.
| eatonphil wrote:
| They're handmade by Joy Machs! We credit him in the post.
| :) https://twitter.com/joymachs
| schmichael wrote:
| This is incredible! I am curious about this paragraph though:
|
| > You're going to see view changes when the primary crashes or is
| partitioned, and VSR's telltale round robin rotation of the new
| primary among replicas, until a new primary is established. This
| is in contrast to Raft, which elects a primary at random, but
| then suffers from the risk of (or increased latency to mitigate)
| dueling leaders.
|
| It seems like regardless of primary/leader selection mechanism
| used (deterministic/roundrobin vs voting), you still need a
| quorum of nodes to agree (and for the minority side to know they
| can't proceed). Surely in a round robin selection mechanism some
| sort of vote or liveness check must be performed before it is
| safe for the #2 node to promote itself to #1/primary? Otherwise
| if the _link_ between #2 and #1 /primary is partitioned, #2 could
| unilaterally assume it was the primary/leader, even if the rest
| of the nodes could still communicate with #1 (the original
| primary). I don't understand how round robin solves the
| _agreement_ aspect that leader election does.
|
| The simulation seems to only partition _nodes_ and not _links_ so
| I 'm not sure it exercises asymmetric connectivity between
| members. Although it does mention flaky links, so perhaps they do
| cover this case.
|
| Edit: I have been informed there is still a vote. :)
| jorangreef wrote:
| Thanks, great to hear you enjoyed it!
|
| The round robin "view change" in VSR is still consensus, and
| uses quorums to do fault isolation of the old primary, and to
| preserve the intersection property, to ensure that the
| committed log survives into the new view.
|
| What's cool about VSR's consensus though, is that the dice is
| also preloaded, ahead of time, so that there's more information
| baked into the protocol than with Raft or MultiPaxos, which
| means that you can neatly sidestep their dueling leader
| problem, or the latency padding that is often added to mitigate
| it.
|
| You can read more about VSR's intuitive view change consensus
| here: http://pmg.csail.mit.edu/papers/vr-revisited.pdf
|
| The round robin view chance can also make VSR (slightly) more
| resilient to weird network faults, like you describe. For
| example, variants of VSR using stable storage as hinted at by
| the '12 paper, are in fact able to survive most of the
| OmniPaxos liveness scenarios (we submitted some errata to the
| paper's authors for this).
|
| We haven't yet explored visualizing asymmetric connectivity in
| SimTigerBeetle (we tried to start with the big things!),
| however we recently wrote about how we test for this in our
| VOPR simulator here:
| https://tigerbeetle.com/blog/2023-07-06-simulation-testing-f...
| schmichael wrote:
| Thanks for the details. A coworker pointed me at Heidi Howard
| (of Flexible Paxos, etc) and Diego Ongaro (Raft) discussing
| this very question! https://groups.google.com/g/raft-
| dev/c/cBNLTZT2q8o?pli=1
___________________________________________________________________
(page generated 2023-07-11 23:01 UTC)