[HN Gopher] We put a distributed database in the browser and mad...
       ___________________________________________________________________
        
       We put a distributed database in the browser and made a game of it
        
       Author : BratishkaErik
       Score  : 181 points
       Date   : 2023-07-11 13:15 UTC (9 hours ago)
        
 (HTM) web link (tigerbeetle.com)
 (TXT) w3m dump (tigerbeetle.com)
        
       | eatonphil wrote:
       | Direct link to the game: https://sim.tigerbeetle.com/.
       | 
       | While TigerBeetle is open-source, the game isn't yet. But that
       | may change. :)
        
       | pohl wrote:
       | This is a sweet idea and a nice, game-esque implementation.
       | 
       | It could definitely use some onboarding.
       | 
       | There's nothing to give the "player" a hint as to what they
       | should do. What is my goal? Am I trying to defeat the
       | communication of the nodes, or help them? If it's the former, why
       | did I seem to win the first level after doing nothing? I started
       | by trying to mash some keys. Eventually I saw that there were
       | tools in the upper-right, but it wasn't clear what to do with
       | them. It's a bit frustrating that they disappear after you grab
       | them if your click is too far off the target. After successfully
       | applying one of them to one of the targets, it's not clear what
       | one should do next. Was it a good move? Who knows, because
       | there's no feedback.
        
         | kristoff_it wrote:
         | Well the point is that the replication protocol is going to fix
         | all issues no matter what, so you could say that the only
         | winning move is to enjoy watching TigerBeetle move forward
         | impervious of scripted nor human-inputted issues.
         | 
         | In other words, you're going to 'win' no matter what :^)
        
           | pohl wrote:
           | That makes sense: it's not a game at all, but a simulation
           | pretending to be one.
        
             | kristoff_it wrote:
             | never played "walking simulators" then? plenty of games
             | don't feature traditional win-lose mechanics.
        
               | pohl wrote:
               | No, I hadn't even heard of that genre. They sound
               | potentially fun, though, since at least they have an
               | environment to explore.
        
         | jorangreef wrote:
         | Thanks for the feedback! This is the first version of our
         | educational SimTigerBeetle frontend.
         | 
         | The goal is to see and interact with a distributed database
         | running visually under Easy/Medium/Hard levels of fault
         | injection, giving you the opportunity to explore how the
         | consensus handles failures, poke the otherwise
         | deterministically simulated world, or inspect progress of the
         | replicas.
         | 
         | So, it's a game in the sim genre, and we've got more to come,
         | but the significance is hopefully what we tried to explain in
         | the post:
         | 
         | You're seeing a new design of distributed database survive
         | cosmic levels of fault injection (8% storage corruption on the
         | read path across all replicas in the Radioactive level!), in
         | your browser... and you can smash things with a hammer. ;)
         | 
         | You might also appreciate the video [0] at the end of our post,
         | where we dive into the three insights behind deterministic
         | simulation testing, and how DST moves beyond chaos engineering.
         | 
         | [0] SimTigerBeetle (Director's Cut!)
         | https://www.youtube.com/watch?v=Vch4BWUVzMM
        
       | charbuff wrote:
       | This is a really cool sales prop
        
       | jedisct1 wrote:
       | This is insanely cool!
        
       | Digikid13 wrote:
       | It's very confusing that y'all are calling this a "game". There's
       | nothing to play, it's just a simulation to watch.
        
         | cstrahan wrote:
         | "Simulation" doesn't necessarily entail stimulating
         | interactivity and a pleasing aesthetic. As of writing, I don't
         | know if there's a better term than "game" for what this is.
         | (Maybe we can coin a new term, like "game-lite" -- somewhat
         | akin to what "rogue-lite" is to the "rogue" genre).
         | 
         | See "walking simulators":
         | 
         | https://en.wikipedia.org/wiki/What_Remains_of_Edith_Finch
         | 
         | https://en.wikipedia.org/wiki/The_Stanley_Parable
         | 
         | https://en.wikipedia.org/wiki/Thirty_Flights_of_Loving
         | 
         | Plenty argue that these aren't games either (usual complaints
         | involve lack of problem(s) to resolve, and no win-lose
         | dynamic); but, then, what _are_ these? The closest category I
         | can think of would be  "computer-animated film", but... these
         | are interactive, and you can navigate and look in any direction
         | you want, which yields a very different experience than
         | watching a film like "Toy Story".
        
           | Apes wrote:
           | I think "demo" or "toy" or "exhibit" would be a better term
           | than "game", since there's not really any story, or rules, or
           | objectives, or anything resembling a normal gameplay loop.
           | The level of interactivity is well below what you would
           | typically expect from a game, and there's effectively no
           | agency in affecting the outcome of what happens. Even walking
           | simulators have at least some of those things.
        
           | zerocrates wrote:
           | "Walking simulator" is a little tricky to extrapolate from
           | since it was originally a perjorative.
        
           | shard wrote:
           | I remember reading an article which talks about the
           | difference between games and these interactive demos: games
           | have objectives and goals, and for the demos without such,
           | they're better to be called as "toys".
        
         | captainhorst wrote:
         | There are tools in the upper right corner. You can pick one and
         | use it on a beetle. ;) This introduces various faults in the
         | simulation that the database has to recover from.
         | 
         | Minor spoiler: there's another tiny game hidden at the end of
         | the third level.
        
           | reidjs wrote:
           | That adds some interactivity, but I still wouldn't call this
           | a game yet. Games generally have some sort of challenge you
           | must overcome. This is more like a highly polished
           | interactive simulation of a distributed db. Still really cool
           | and well done.
        
             | [deleted]
        
       | petrosagg wrote:
       | > Sure, we're not yet injecting storage faults, but then formal
       | proofs for protocols like Raft and Paxos assume that disks are
       | perfect, and depend on this for correctness? After all, you can
       | always run your database over RAID, right? Right? > If your
       | distributed database was designed before 2018, you probably
       | couldn't have done much. The research didn't exist.
       | 
       | I'm trying to understand this part but something seems off. It
       | seems to imply that those proofs do not apply to the real world
       | because disks are not perfect, but neither is RAM. The hardware
       | for both RAM and disks has an inherent error rate which can be
       | brought down to arbitrary levels by using error correction codes
       | (e.g ECC RAM).
       | 
       | I'm assuming that the TigerBeetle correctness proofs are
       | predicated on perfect RAM even though in reality there is a small
       | probability of errors. This tells me that there is an error rate
       | which they consider negligible. If that's the case what is the
       | difference between:
       | 
       | * TigerBeetle's storage approach
       | 
       | * Paxos or Raft with enough error correction on disk writes that
       | the probability of errors equals that of ECC RAM which is
       | considered negligible
       | 
       | I've probably made a logical error in my reasoning but I can't
       | see it. Can someone enlighten me?
        
         | jorangreef wrote:
         | Thanks for the question! Joran from TigerBeetle here.
         | 
         | The research in question is the 2018 paper from UW-Madison,
         | "Protocol-Aware Recovery for Consensus-Based Storage" (PAR) [0]
         | by Ram Alagappan, Aishwarya Ganesan, as well as Remzi and
         | Andrea Arpaci-Dusseau (who you may recognize as authors of
         | OSTEP).
         | 
         | PAR won best paper at FAST '18 for showing that a single disk
         | sector fault, in the write-ahead log (WAL) of a single replica,
         | could propagate through the distributed RAFT or MultiPaxos
         | consensus protocol, to cause global cluster data loss.
         | 
         | This was counter-intuitive at the time, because PAR showed that
         | the redundancy of these consensus and replication protocols did
         | not in fact always imply fault-tolerance, as had previously
         | been assumed.
         | 
         | The reason is, and we cover this in depth in our recent QCon
         | London talk [1], but it was assumed that checksums alone would
         | be sufficient to detect and recover from storage faults.
         | 
         | However, while checksums can be used under the "Crash
         | Consistency Model" to solve consistency through power loss, PAR
         | showed that checksums are not sufficient to be able to
         | distinguish between a torn write at the end of the
         | (uncommitted) WAL caused by power loss, and a torn write in the
         | middle of the (committed) WAL caused by bitrot.
         | 
         | What you tend to find is that the WALs for many of these
         | protocols will truncate the WAL at the first sign of a checksum
         | mismatch, conflating the mismatch with power loss when it might
         | be bitort, and thereby truncating committed transactions, and
         | undermining quorum votes in the Raft or MultiPaxos
         | implementations.
         | 
         | RAID solutions don't always help here, either. See "Parity Lost
         | and Parity Regained" [2] for more details. ZRAID is better
         | here, and ZFS is a huge inspiration, but with local redundancy
         | under ZFS you're still not leveraging the global redundancy of
         | the consensus protocol as well as you could be.
         | 
         | To summarize PAR:
         | 
         | There are fundamental design changes to both the global
         | consensus protocol and the local storage engine that would need
         | to be made, if the storage fault model of PAR (and TigerBeetle)
         | is to be solved correctly.
         | 
         | Furthermore, few simulators even test for these kinds of
         | storage faults. For example, misdirected I/O, where the disk
         | writes or reads to or from the wrong location of disk, which
         | may yet have a valid checksum.
         | 
         | However, this is important, because disks fail in the real
         | world. A single disk has on the order of a 0.5-1% chance of
         | corruption in a 2 year period [3]. For example, a 5 node
         | cluster has a 2.5-5% chance of a single disk sector fault,
         | which again in terms of PAR can lead to global cluster data
         | loss.
         | 
         | On the other hand, memory (or even CPU) faults, assuming ECC
         | are not in the same order of magnitude probability, and
         | therefore TigerBeetle's memory fault model is to require ECC
         | memory.
         | 
         | But, again, to be crystal clear, checksums alone are not
         | sufficient to solve the consensus corruption issue. The fix
         | requires protocol changes at the design level, for the
         | consensus protocol to be made storage fault-aware.
         | 
         | Thanks for the question and happy to answer more!
         | 
         | [0] "Protocol-Aware Recovery for Consensus-Based Storage"
         | https://www.usenix.org/conference/fast18/presentation/alagap...
         | 
         | [1] "A New Era for Database Design" (we also dive into the
         | research surrounding Fsyncgate, looking into the latent
         | correctness issues that remain)
         | https://www.youtube.com/watch?v=_jfOk4L7CiY
         | 
         | [2] "Parity Lost and Parity Regained"
         | https://www.usenix.org/conference/fast-08/parity-lost-and-pa...
         | 
         | [3] "An Analysis of Data Corruption in the Storage Stack"
         | https://www.cs.toronto.edu/~bianca/papers/fast08.pdf
        
           | petrosagg wrote:
           | Thank you for the detailed response!
           | 
           | > However, while checksums can be used under the "Crash
           | Consistency Model" to solve consistency through power loss,
           | PAR showed that checksums are not sufficient to be able to
           | distinguish between a torn write at the end of the
           | (uncommitted) WAL caused by power loss, and a torn write in
           | the middle of the (committed) WAL caused by bitrot.
           | 
           | The PAR paper states that "although Crash preserves safety,
           | it suffers from severe unavailability". I assume that when
           | TigerBeetle loads state from RAM into a CPU cache/register it
           | operates under the NoDetection consistency model or the Crash
           | consistency model if ECC RAM automatically resets the CPU on
           | read errors. At the same time it doesn't suffer from severe
           | unavailability so what gives?
           | 
           | The answer is probably that ECC RAM is just reliable enough
           | that the NoDetection/Crash models are fine in practice.
           | 
           | I can believe that off-the-shelf checksum and redundancy
           | options offered by filesystems like ext4 and ZFS or systems
           | like RAID don't hit the required error probabilities but why
           | does the argument stop there? Couldn't a distributed database
           | generate error correcting data on every write in the
           | application layer so that the probability becomes low enough
           | such that NoDetection/Crash become a non-issue for storage,
           | just like RAM? Is there some other fundamental difference
           | between reading and write data from RAM versus a disk?
        
             | jorangreef wrote:
             | Huge pleasure, thanks again for the question!
             | 
             | The crux of the problem: How do you solve misdirected
             | read/write I/O? Where the firmware writes/reads to/from the
             | wrong disk sector (but with a valid checksum)?
             | 
             | PAR shows how both global consensus protocol and local
             | storage engine need to be modified for this, with
             | foundational design changes at the protocol-level, if a
             | distributed system is to not only preserve correctness, but
             | also optimize for high availability.
             | 
             | Bear in mind that PAR is not only actually correct, but
             | it's also more efficient than simply dialing up local
             | redundancy, because it lets you recover from the global
             | redundancy that you have via replication in the consensus
             | protocol.
             | 
             | The paper is great, but will especially reward a few passes
             | of reading. The examples they give take time, but are great
             | to work through slowly to gain a deeper understanding.
             | 
             | And/or, you can read the Zig code of PAR in TB! :)
             | 
             | Here's a great place to start, one of our favorite pieces
             | of code in TigerBeetle: https://github.com/tigerbeetle/tige
             | rbeetle/blob/4aca8a22627b...
        
         | convolvatron wrote:
         | you just bury these and play games with p. just like
         | distributed consensus, there is no perfect storage medium. 1
         | bit flip a week too much for you? add secded. secded on memory.
         | interleaving to spread out correlated errors. etc.
         | 
         | at some point you run in the probability of the earth being
         | coincident with the sun and you call it good.
         | 
         | none of viewstamp replication, paxos, or raft deal with storage
         | errors.
         | 
         | where it does get interesting is that in the low p spectrum can
         | you can subvert the correctness of these protocols by fuzzing
         | them. and then you get into byzantine protocols.
        
           | jorangreef wrote:
           | We tried to emphasize "Protocol-Aware Recovery for Consensus-
           | Based Storage" in the blog post, because it's how TigerBeetle
           | solves the storage fault model, and because PAR shows how you
           | can actually solve this using the redundancy you already have
           | in the global consensus protocol.
           | 
           | https://www.usenix.org/conference/fast18/presentation/alagap.
           | ..
        
         | chubot wrote:
         | Not an expert in this area, but I think disks have correlated
         | failure modes whereas CPUs and memory generally don't.
         | 
         | Especially spinning platter disks, not sure about SSDs.
         | 
         | The difference in failure rates could be orders of magnitude
         | ... Memory will have random bit flips but I think they are
         | pretty randomly distributed (or maybe catastrophic if there is
         | some cosmic event)
         | 
         | But disks will have non-random manufacturing issues. I'd be
         | interested in more info too, but my impression is that the data
         | on these issues is pretty thin. Foundation DB mentioned it ~10
         | years ago and Google has published data >10 years ago, but
         | hardware has changed a lot since then
         | 
         | Software redundancy will take care of non-correlated failures,
         | but it fails precisely when there are correlated ones
        
           | imtringued wrote:
           | SSDs tend to have highly correlated failure modes because you
           | either run into a bug in the firmware which is the same on
           | every SSD or you have the same wear on every SSD, which locks
           | both into read-only mode within a short period of time. You
           | might argue that read-only is not a failure, but read-only
           | means downtime and replacing hardware.
        
           | planckscnst wrote:
           | I work on a large distributed database system. SSDs
           | absolutely have correlated failures. Also CMOS batteries.
           | Also CPU and memory (think a manufacturing defect or a
           | storage climate issue on specific batches that made it
           | through QA). Pretty much nothing is 100% guaranteed to have
           | no correlated failures. It comes down to probabilities. You
           | can add flexibility, variation, vendor/sourcing diversity to
           | reduce risks.
        
       | brimstedt wrote:
       | Is this something similar to couch/pouchdb?
        
         | eatonphil wrote:
         | TigerBeetle is more domain-specific, i.e. focused on financial
         | transactions and high-performance, high-availability. There are
         | just two entity types in the database: accounts and transfers
         | between accounts. More details here
         | https://docs.tigerbeetle.com/design/data-modeling if you're
         | interested!
        
           | jazzyjackson wrote:
           | In comparison to {C,P}ouchDB, I think the question is around
           | offline-first availability. Can a mobile client interact with
           | a local database without a degraded experience and expect
           | changes to be synchronized once a connection is re-
           | established. I would suppose in the financial transaction
           | space this is not A Thing.
        
             | eatonphil wrote:
             | > In comparison to {C,P}ouchDB, I think the question is
             | around offline-first availability.
             | 
             | Got it, thanks! Yeah that is indeed not how TigerBeetle
             | works. If you ever cannot connect to the cluster, you keep
             | retry messages (idempotently) until you connect and the
             | message succeeds.
        
           | jlokier wrote:
           | That sounds similar to the account state storage used in
           | blockchains. In some of those, high performance and high
           | space consumption are technical challenges, and the tendancy
           | to have a lot of random keys (hashes) adds another.
           | 
           | I wonder if TigerBeetle would be suited to those storage
           | performance challenges, and conversely if the low-level
           | storage optimisations in certain implementations of
           | blockchain account history would be more broadly applicable
           | to the financial applications targeted by TigerBeetle.
        
             | eatonphil wrote:
             | Yes it is similar, but simpler! We do chat with a few
             | companies using blockchain, looking at TigerBeetle for
             | better throughput.
             | 
             | > if the low-level storage optimisations in certain
             | implementations of blockchain account history would be more
             | broadly applicable to the financial applications targeted
             | by TigerBeetle.
             | 
             | Maybe! Any examples you're thinking of?
        
       | cdchn wrote:
       | This is pretty cool from a database perspective but I'm really
       | interested in what they used to write this 'game.'
        
         | eatonphil wrote:
         | It's covered a little bit in the Evolution section (add
         | #evolution to the URL and hit enter).
         | 
         | If that doesn't answer everything (it's not a long section,
         | granted) Fabio (captainhorst) is on HN answering questions in
         | this thread already so feel free to ask!
        
       | nlavezzo wrote:
       | This is awesome!
        
       | moralestapia wrote:
       | Nice work @eatonphil and TB team. Thanks for sharing.
       | 
       | I'm eagerly awaiting the production ready (or an RC or something)
       | of TB, I've tried it and it truly excels in its domain.
       | 
       | I have a few clients that could make good use of it but I have to
       | hold my horses a little bit.
        
         | jorangreef wrote:
         | Thanks, Alex! Awesome to hear that. SimTigerBeetle has been a
         | skunkworks project, on the side, the past 12 months. We're
         | looking forward to sharing the production release of
         | TigerBeetle with you soon when it's ready.
        
       | muhaaa wrote:
       | I thought tigerbeetle distributes the db on web clients? I
       | skimmed through the docs and cannot find how to design a web app
       | with tigerbeetle like the game https://sim.tigerbeetle.com/
        
         | eatonphil wrote:
         | Hey! I'm not completely sure I follow the question. But the
         | game is built a little differently from how you'd typically run
         | TigerBeetle in production. In production you'd run a cluster of
         | replicas (see
         | https://docs.tigerbeetle.com/deploy/hardware#cluster-of-
         | repl...) and then you'd have your application connect to the
         | cluster using one of our client libraries.
         | 
         | The game is more a reflection of how we do simulation testing
         | (as the post mentioned). You can find the code for that here: h
         | ttps://github.com/tigerbeetledb/tigerbeetle/blob/main/src/s....
        
           | muhaaa wrote:
           | I have a very interactive app with limited data use <10MB.
           | Right now its a lots of different API calls to sync state
           | between server and multiple clients. But if I have a
           | distributed sync protocol (that works), I can replicate the
           | state on each cient and server by tunneling the sync protocol
           | to the server and back. I have reduced a lots of api
           | complexity. My react UI just subscribes to the data changes
           | in the synced DB. Similar concept impelmented pouchdb &
           | couchdb but its a bit outdated.
        
           | cdchn wrote:
           | What did you use to make the graphics? Thats very cool.
        
             | eatonphil wrote:
             | They're handmade by Joy Machs! We credit him in the post.
             | :) https://twitter.com/joymachs
        
       | schmichael wrote:
       | This is incredible! I am curious about this paragraph though:
       | 
       | > You're going to see view changes when the primary crashes or is
       | partitioned, and VSR's telltale round robin rotation of the new
       | primary among replicas, until a new primary is established. This
       | is in contrast to Raft, which elects a primary at random, but
       | then suffers from the risk of (or increased latency to mitigate)
       | dueling leaders.
       | 
       | It seems like regardless of primary/leader selection mechanism
       | used (deterministic/roundrobin vs voting), you still need a
       | quorum of nodes to agree (and for the minority side to know they
       | can't proceed). Surely in a round robin selection mechanism some
       | sort of vote or liveness check must be performed before it is
       | safe for the #2 node to promote itself to #1/primary? Otherwise
       | if the _link_ between #2 and #1 /primary is partitioned, #2 could
       | unilaterally assume it was the primary/leader, even if the rest
       | of the nodes could still communicate with #1 (the original
       | primary). I don't understand how round robin solves the
       | _agreement_ aspect that leader election does.
       | 
       | The simulation seems to only partition _nodes_ and not _links_ so
       | I 'm not sure it exercises asymmetric connectivity between
       | members. Although it does mention flaky links, so perhaps they do
       | cover this case.
       | 
       | Edit: I have been informed there is still a vote. :)
        
         | jorangreef wrote:
         | Thanks, great to hear you enjoyed it!
         | 
         | The round robin "view change" in VSR is still consensus, and
         | uses quorums to do fault isolation of the old primary, and to
         | preserve the intersection property, to ensure that the
         | committed log survives into the new view.
         | 
         | What's cool about VSR's consensus though, is that the dice is
         | also preloaded, ahead of time, so that there's more information
         | baked into the protocol than with Raft or MultiPaxos, which
         | means that you can neatly sidestep their dueling leader
         | problem, or the latency padding that is often added to mitigate
         | it.
         | 
         | You can read more about VSR's intuitive view change consensus
         | here: http://pmg.csail.mit.edu/papers/vr-revisited.pdf
         | 
         | The round robin view chance can also make VSR (slightly) more
         | resilient to weird network faults, like you describe. For
         | example, variants of VSR using stable storage as hinted at by
         | the '12 paper, are in fact able to survive most of the
         | OmniPaxos liveness scenarios (we submitted some errata to the
         | paper's authors for this).
         | 
         | We haven't yet explored visualizing asymmetric connectivity in
         | SimTigerBeetle (we tried to start with the big things!),
         | however we recently wrote about how we test for this in our
         | VOPR simulator here:
         | https://tigerbeetle.com/blog/2023-07-06-simulation-testing-f...
        
           | schmichael wrote:
           | Thanks for the details. A coworker pointed me at Heidi Howard
           | (of Flexible Paxos, etc) and Diego Ongaro (Raft) discussing
           | this very question! https://groups.google.com/g/raft-
           | dev/c/cBNLTZT2q8o?pli=1
        
       ___________________________________________________________________
       (page generated 2023-07-11 23:01 UTC)