[HN Gopher] Jepsen: TigerBeetle 0.16.11
       ___________________________________________________________________
        
       Jepsen: TigerBeetle 0.16.11
        
       Author : aphyr
       Score  : 210 points
       Date   : 2025-06-06 10:53 UTC (12 hours ago)
        
 (HTM) web link (jepsen.io)
 (TXT) w3m dump (jepsen.io)
        
       | koakuma-chan wrote:
       | Curios if they got any large bank or stock exchange to use
       | TigerBeetle
        
         | nindalf wrote:
         | I think if they had, they'd brag about it on their homepage. So
         | far the biggest endorsement from there is from some YouTuber. A
         | popular YouTuber, no doubt, but a YouTuber nevertheless.
        
           | koakuma-chan wrote:
           | Yeah, TigerBeetle itself and their testing suite looks
           | impressive, but putting Primeagen there makes them look like
           | Next.js or Cursor.
        
             | jorangreef wrote:
             | That's a talk for engineers that was streamed on the
             | Primeagen and went a bit viral. If you haven't watched it
             | yet, it's an intro to TigerBeetle technically.
             | 
             | Otherwise check out https://tigerbeetle.com/company if you
             | want more about the corporate side.
        
               | nindalf wrote:
               | If you can stand that guy speak, it's worth a watch.
        
               | jorangreef wrote:
               | I actually love the pace at which Prime speaks, but I
               | feel awkward at hearing my own voice. Hopefully the ideas
               | stand on merit!
        
         | jorangreef wrote:
         | Joran, creator and CEO from TigerBeetle here!
         | 
         | At a national level, we're working with the Gates Foundation to
         | integrate TigerBeetle into their non-profit central bank switch
         | that will be powering Rwanda's National Digital Payments System
         | 2.0 later this year [1].
         | 
         | At an enterprise level, TigerBeetle already powers customers
         | processing 100M+ transactions per month in production, and we
         | recently signed our first $2B fintech unicorn in Europe with a
         | few more in the US about to close. Because of the move to
         | realtime transaction processing around the world [2] there's
         | been quite a bit of interest from companies wanting to move to
         | TigerBeetle for more performance.
         | 
         | Finally, to your question, some of the founders of Clear
         | Street, a fairly large brokerage on Wall Street have since
         | invested [3] in TigerBeetle.
         | 
         | [1] https://mojaloop.io/how-mojaloop-enables-rndps-2-0-ekash/
         | 
         | [2] https://tigerbeetle.com/blog/2024-07-23-rediscovering-
         | transa...
         | 
         | [3] https://tigerbeetle.com/company
        
           | diggan wrote:
           | > some of the founders of Clear Street, a fairly large
           | brokerage on Wall Street have since invested [3] in
           | TigerBeetle
           | 
           | "Invested" in terms of "giving you money" or in terms of "Now
           | uses the database themselves"? I read it as the first, but I
           | think the question is about usage, not investments.
        
             | jorangreef wrote:
             | Both. In terms of investing and planning to migrate.
        
               | diggan wrote:
               | Thanks for the clarification :)
        
               | jorangreef wrote:
               | You too! :)
        
               | thomaspaine wrote:
               | I work on the ledgering system at clear street and as far
               | as I know we have no plans to do this. We evaluated it
               | internally a few years ago and found that the account and
               | transaction model was too different from ours to migrate
               | over.
        
               | jorangreef wrote:
               | Hi Thomas, yes, I was there. However, this is something
               | that Sachin and I subsequently discussed last year
               | (Sachin recently provided the TPS footnote to be used in
               | the report here). However, I understand that roadmap may
               | since have changed, but this is to the best of my
               | knowledge.
        
               | sachnk99 wrote:
               | Hi -- Sachin here, one of the founders of Clear Street.
               | To clarify:
               | 
               | - The investment in TigerBeetle was done personally, not
               | through Clear Street.
               | 
               | - I'm no longer actively involved day-to-day as CTO at
               | Clear Street, but while I was, TigerBeetle was a solution
               | we very much had in mind as our volumes were increasing.
               | 
               | That said, roadmaps change, priorities shift, etc. If
               | TigerBeetle existed when we started Clear Street, I very
               | much would have used it, and saved me from many
               | headaches.
        
           | dralley wrote:
           | Have you had a difficult time convincing customers to use a
           | product written in a pre-1.0 programming language?
        
             | matklad wrote:
             | From the user's perspective, this doesn't matter at all.
             | Zig is implementation detail, what we actually ship is a
             | fully statically linked native executable for the database,
             | and "links only libc" (because thread locals!) .a/.so
             | native "C" library for clients. Nothing will change, for
             | the user, if we decide to rewrite the thing in Rust, or C,
             | or Hare, nothing Zig-specific leaks out.
             | 
             | Form the developer perspective, the big thing is that we
             | don't have any dependencies, so updating compiler for us is
             | just a small amount of work once in a while, and not your
             | typical ecosystem-wide coordination problem. Otherwise,
             | Zig's pretty much "finished" for our use-case, it more or
             | less just works.
        
             | jorangreef wrote:
             | Zig's pre-1.0 status also refers more to API stability. The
             | language and tooling already has more quality, at least in
             | my own experience, than if we had picked C, which was the
             | only other choice available to us when we made the decision
             | to invest in Zig's trajectory back in 2020, given we needed
             | to do static allocation and that any sort of global
             | allocator was out of the question.
             | 
             | But, no. On the commercial side, I don't think we've had
             | one conversation with a prospect or CTO or engineering team
             | where they were concerned that we picked a systems language
             | for the next thirty years. And while Zig is a beautiful,
             | perfect replacement for C, I think the real reason the
             | question has never come up, is that our customers come to
             | us instead of us to them. We're not trying to convince
             | anyone. They're already appreciating the extensive end-to-
             | end testing we do on everything we ship.
             | 
             | However, I should emphasize again, that given all the
             | assertions, fuzzing and DST we do, Zig's quality can't be
             | overstated. It holds up.
        
         | SOLAR_FIELDS wrote:
         | Not a bank or exchange but I work for a very large fintech and
         | we are using it on our newer products.
        
           | jorangreef wrote:
           | Awesome to hear that! Are we chatting in Slack? Or please DM
           | me or Lewis. Would love to chat!
        
       | nindalf wrote:
       | Very impressed with this report. Whenever I read TigerBeetle's
       | claims on reliability and scalability, I'd think "ok, let's wait
       | for the Jepsen report".
       | 
       | This report found a number of issues, which might be a cause for
       | concern. But I think it's a positive because they didn't just fix
       | the issues, they've expanded their internal test suite to catch
       | similar bugs in future. With such an approach to engineering I
       | feel like in 10 years TigerBeetle would have achieved the "just
       | use Postgres" level of default database in its niche of financial
       | applications.
       | 
       | Also great work aphyr! I feel like I learned a lot reading this
       | report.
        
         | jorangreef wrote:
         | Thanks!
         | 
         | Yes, we have around 6,000+ assertions in TigerBeetle. A few of
         | these were overtight, hence some of the crashes. But those were
         | the assertions doing their job, alerting us that we needed to
         | adjust our mental model, which we did.
         | 
         | Otherwise, apart from a small correctness bug in an internal
         | testing feature we added (only in our Java client and only for
         | Jepsen to facilitate the audit) there was only one correctness
         | bug found by Jepsen, and it didn't affect durability. We've
         | written about it here:
         | https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
         | 
         | Finally, to be fair, TigerBeetle can (and is tested) to survive
         | more faults than Postgres can, since it was designed with an
         | explicit storage fault model and using research that was not
         | available at the time when Postgres was released in '96. TB's
         | fault models are further tested with Deterministic Simulation
         | Testing and we use techniques such as static memory allocation
         | following NASA's Power of Ten Rules for Safety-Critical Code.
         | There are known scenarios in the literature that will cause
         | Postgres to lose data, which TigerBeetle can detect and recover
         | from.
         | 
         | For more on this, see the section in Kyle's report on helical
         | fault injection (most Raft and Paxos implementations were not
         | designed to survive this) as well as a talk we gave at QCon
         | London: https://m.youtube.com/watch?v=_jfOk4L7CiY
        
           | jrpelkonen wrote:
           | Hi Joran,
           | 
           | I have followed TigerBeetle with interest for a while, and
           | thank you for your inspirational work and informative
           | presentations.
           | 
           | However, you have stated in several occasions that the lack
           | of memory safety in Zig is not a concern since you don't
           | dynamically allocate memory post startup. However, one of the
           | defects uncovered here (#2435) was caused by dereferencing an
           | uninitialized pointer. I find this pretty concerning, so I
           | wonder if there is something that you will be doing
           | differently to eliminate all similar bugs going forward?
        
             | matklad wrote:
             | Note that that's a bug in the client, in the Zig-java FFI
             | code, which is inherently unsafe. We'd likely made an a
             | similar bug in Rust.
             | 
             | Which is, yeah, one of the bigger technical challenges for
             | us --- we ship language-native libraries for
             | Go,node,Java,C#,Python and Rust, and, like in the Tolstoi
             | novel, each one is peculiar in its own way. What's worse,
             | they aren't directly covered by our deterministic
             | simulator. That's one of the major reasons why we invest in
             | full-system simulation with jepsen, antithesis and vortex
             | (https://tigerbeetle.com/blog/2025-02-13-a-descent-into-
             | the-v...). We are also toying with the idea of generating
             | _more_ of that code, so there's less room for human error.
             | Maybe one day we'll even do fully native client (eg, pure
             | Java, pure Go), but we are not there yet.
             | 
             | One super-specific in-progress thing is that, at the
             | moment, the _bulk_ of the client testing is duplicated per
             | client, and also the _bulk_ of the testing is example-
             | based. Building simulator/workload is a lot of work, and
             | duplicating it for each client is unreasonable. What we
             | want to do here is to use multi-process architecture, where
             | there's a single Zig process that generates the workloads
             | and generates interesting sequences of commands for
             | clients, and than in each client we implement just a tiny
             | "interpreter" for workload language, getting a test suite
             | for free. This is still WIP though!
             | 
             | Regarding the broader memory safety issue in the database.
             | We did have a couple of memory safety bugs, which were
             | caught early in testing. We did have one very bad aliasing
             | bug, which would have been totally prevented by Rust, which
             | slipped through the bulk of our testing and into the
             | release (it was caught in testing _after_ it was
             | introduced):
             | https://github.com/tigerbeetle/tigerbeetle/pull/2774.
             | Notably, while the bug was bad enough to completely mess up
             | our internal data structure, it was immediately caught by
             | an assert down the line, and downgraded from correctness
             | issues to a small availability issues (just restarting the
             | replica would fix it). Curiously, the root cause for that
             | bug was that we over-complicated our code. Long before the
             | actual bug we felt uneasy about the data structure in
             | question, and thought about refactoring it away (which
             | refactor is underway. Hilariously, it looks that just
             | "removing" the thing without any other code changes
             | improves performance!).
             | 
             | So, on balance, yeah, Rust would've prevented a small
             | number of easy bugs, and on gnarly bug, but then the entire
             | thing would have to look completely different, as the
             | architecture of TigerBeetle not at all Rust-friendly. I'd
             | be curious to see someone replicating single-thread io-
             | uring no malloc after startup architecture in Rust! I
             | personally don't know off the top of my head whether that
             | would work or not.
        
               | jcalabro wrote:
               | I remember reading a similar thing about FoundationDB
               | with their DST a while back. Over time, they surfaced
               | relatively few bugs in the core server, but found a bunch
               | in the client libraries because the clients were more
               | complicated and were not run under their DST.
               | 
               | Anyways, really interesting report and project. I also
               | like your youtube show - keep up the great work! :)
        
               | matklad wrote:
               | Oh, important clarification from
               | andrewrk(https://lobste.rs/c/tf6jng), which I totally
               | missed myself: this isn't actually a dereference of
               | uninitialized pointer, it's a defer of a pointer which is
               | explicitly set to a specific, invalid value.
        
               | jrpelkonen wrote:
               | This is indeed an important point, the way I originally
               | understood the bug was that the memory was not
               | initialized at all. Thanks for the clarification
        
             | AndyKelley wrote:
             | TigerBeetle uses ReleaseSafe optimization mode, which means
             | that the pointer was in fact initialized to
             | 0xaaaaaaaaaaaaaaaa. Since nothing is mapped to this
             | address, it reliably causes a segfault. This is equivalent
             | to an assertion failure.
        
               | jrpelkonen wrote:
               | That's good to hear! Thanks for the clarification.
        
           | anarazel wrote:
           | > There are known scenarios in the literature that will cause
           | Postgres to lose data, which TigerBeetle can detect and
           | recover from.
           | 
           | What are you referencing here?
        
             | jorangreef wrote:
             | The scenarios described in our QCon London talk linked
             | above.
             | 
             | This surveys the excellent storage fault research from UW-
             | Madison, and in particular:                 "Can
             | Applications Recover from fsync Failures?"
             | "Protocol-Aware Recovery for Consensus-Based Storage"
             | 
             | Finally, I'd recommend watching "Consensus and the Art of
             | Durability", our talk from SD24 in NYC last year:
             | 
             | https://www.youtube.com/watch?v=tRgvaqpQPwE
        
         | SOLAR_FIELDS wrote:
         | I always get excited to read Kyle's write ups. I feel like I
         | level up my distributed systems knowledge every time he puts
         | something out.
        
       | cmrdporcupine wrote:
       | The articles link to the paper about "Viewstamped Replication" is
       | unfortunately broken (https://pmg.csail.mit.edu/papers/vr-
       | revisited.pdf connection refused).
       | 
       | I think it should be http://pmg.csail.mit.edu/papers/vr-
       | revisited.pdf (http scheme not https) ?
       | 
       | And now I have some Friday evening reading material.
        
         | jorangreef wrote:
         | It should be fixed soon!
         | 
         | The VSR 2012 paper is one of my favorites as is "Protocol-Aware
         | Recovery for Consensus-Based Storage", which is so powerful.
         | 
         | Hope you enjoy the read!
        
       | tomhow wrote:
       | See also:
       | 
       |  _Fuzzer Blind Spots (Meet Jepsen!)_ -
       | https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
        
       | Ygg2 wrote:
       | TigerBeetle is impressive, but it's a single purpose DB. Unless
       | you fit within the account ledger model it's extremely
       | restrictive.
        
         | jorangreef wrote:
         | Joran from TigerBeetle here!
         | 
         | Yes, TigerBeetle specializes only for transaction processing
         | (OLTP). It's not a general-purpose (OLGP) DBMS.
         | 
         | That said, we have customers from energy to gaming, and of
         | course fintech.
        
         | SOLAR_FIELDS wrote:
         | That is 100% correct. You use TigerBeetle when you need a
         | really good double entry accounting system that is open source.
         | You wouldn't use it for much else other than that. Which makes
         | it great software, it's purpose made to solve one problem
         | really well
        
         | saaaaaam wrote:
         | That's a slightly redundant criticism though - it doesn't
         | present itself as anything other than a single purpose database
         | designed for financial transactions.
         | 
         | That's like saying that rice noodles are no good for making
         | risotto. At the core they are both rice...
        
           | Ygg2 wrote:
           | People seem to describe it at OLTP, and one of first DBs to
           | come up in OLTP search is MySQL.
        
             | dumah wrote:
             | OLTP (Online Transaction Processing) is a database paradigm
             | optimized for handling high volumes of short, fast
             | transactions in real-time, typically supporting day-to-day
             | operational activities like order processing, inventory
             | updates, and customer account management where data
             | integrity and quick response times are critical.
             | 
             | Another paradigm is OLAP, in which aggregation of large
             | datasets is the principal concern.
        
               | Ygg2 wrote:
               | Yes, I'm aware. It seems now there is a further
               | bifurcation. OLTP is no longer general purpose, but now
               | it's also for only one narrow use-case.
        
       | wiradikusuma wrote:
       | If memory serves, TigerBeetle is/was not free for production? I
       | can't find the Pricing page, but I kinda remember reading about
       | it somewhere (or it was implied) a while back.
        
         | jorangreef wrote:
         | The DBMS is Apache 2.0 and our customers pay us (well) for
         | everything else to run, integrate, migrate, operate and support
         | that.
         | 
         | For more on our open source thinking and how this is orthogonal
         | to business model (and product!), see our interview with the
         | Changelog: https://m.youtube.com/watch?v=Yr8Y2EYnxJs
        
           | boris wrote:
           | I watched that but I don't see it as convincing. Let's take
           | the AWS example brought up in the talk. The "compete on the
           | interface, not (open source) implementation" idea I think
           | misses (at least) the following points:
           | 
           | 1. AWS will take your initial and ongoing investment in the
           | implementation but they don't have to share theirs with you.
           | Specifically, they will take your improvements but their own
           | improvements (say some performance optimizations) they can
           | keep to themselves. It's good business sense if it allows
           | them to further differentiate their "improved" offering from
           | your "vanilla" service.
           | 
           | 2. Competing on the the interface in this case really means
           | competing on related services like management, etc. So your
           | thesis is that you will provide a better/cheaper managed
           | service than AWS. Even if that's true (a big if), most of the
           | time the decision which service to use will have little to do
           | with technical merit. I.e. we already use AWS, have SLA
           | painfully negotiated, get volume discounts, etc. Do we really
           | want to go through all of this with another vendor just for
           | one extra service.
           | 
           | Just a couple of thoughts that will hopefully help you
           | sharpen your thesis.
        
             | kristoff_it wrote:
             | > AWS will take your initial and ongoing investment in the
             | implementation but they don't have to share theirs with
             | you. Specifically, they will take your improvements but
             | their own improvements (say some performance optimizations)
             | they can keep to themselves. It's good business sense if it
             | allows them to further differentiate their "improved"
             | offering from your "vanilla" service.
             | 
             | In practice all I've seen from AWS is just to add
             | integrations with their internal orchestrators and not much
             | else. Back when I was at Redis Labs, AWS added TLS support
             | to Redis and was dying to get that upstreamed (so that they
             | wouldn't have to maintain the patch), except that as far as
             | I understood nobody upstream wanted that code. In other
             | words, hypothetical improvements by AWS (and other Clouds)
             | are extremely overrated. When it comes to tigerbeetle, I
             | would put the chance that they introduce bugs and
             | vulnerabilities much higher than the possibility they add
             | any meaningful improvement over what the actual experts
             | (the tigrebeetle team) have already done.
             | 
             | > Do we really want to go through all of this with another
             | vendor just for one extra service.
             | 
             | That's a great point, and in fact I've seen AWS
             | purposefully offer insane (in Europe maybe we would say
             | anti-competitive) discounts precisely to prevent Redis Labs
             | from gaining market share. I'm sure they will try the same
             | with TB once it becomes mainstream enough. What TB has that
             | Redis doesn't have is the fact that it's a database
             | designed for truly mission-critical stuff (i.e. counting
             | the money) and maybe customers will be willing to go
             | through the extra motions to ensure they get the best
             | service they can (assuming TB will be able to provide
             | that).
        
               | boris wrote:
               | > In other words, hypothetical improvements by AWS (and
               | other Clouds) are extremely overrated.
               | 
               | Interesting, in a recent thread (I think it was about
               | Redis going back open source) an AWS employer was
               | bragging about substantial concurrency optimizations they
               | implemented in Valkey. At the time I thought it could
               | have been a great differentiator to keep proprietary but
               | perhaps they decide to sacrifice it to help make sure
               | Valkey takes over the Redis midshare.
        
               | kristoff_it wrote:
               | That's a special case for sure, given the new fight for
               | supremacy between the two forks, that said you can see in
               | all those threads antirez bickering with the AWS people
               | over exactly who introduced what.
        
             | jorangreef wrote:
             | To be clear, we have no problem if all the hyperscalers
             | decide to offer TigerBeetle as their flagship OLTP
             | database. That builds trust and is a good thing for the
             | ecosystem as a whole.
             | 
             | We also don't expect (or need) anyone to contribute
             | improvements upstream to us. That's open source!
             | 
             | Finally, open source is not the same thing as product.
             | There are thousands of companies around the world who make
             | high quality products that people pay for. TigerBeetle is
             | no different.
        
       | jitl wrote:
       | Really happy to see TigerBeetle live up to its claims as verified
       | by aphyr - because it's good to see that when you take the right
       | approach, you get the right results.
       | 
       | Question about how people end up using TigerBeetle. There's
       | presumably a lot of external systems and other databases around a
       | TigerBeetle install for everything that isn't an Account or
       | Transfer. What's the typical pattern for those less reliable
       | systems to square up to TigerBeetle, especially to recover from
       | consistency issues between the two?
        
         | jorangreef wrote:
         | Joran from TigerBeetle here! Thanks! Really happy to see the
         | report published too.
         | 
         | The typical pattern in integrating TigerBeetle is to
         | differentiate between control plane (Postgres for general
         | purpose or OLGP) and data plane (TigerBeetle for transaction
         | processing or OLTP).
         | 
         | All your users (names, addresses, passwords etc.) and products
         | (descriptions, prices etc.) then go into OLGP as your "filing
         | cabinet".
         | 
         | And then all the Black Friday transactions these users (or
         | entities) make, to move products from inventory accounts to
         | shopping cart accounts, and from there to checkout and delivery
         | accounts--all these go into OLTP as your "bank vault".
         | TigerBeetle lets you store up to 3 user data identifiers per
         | account or transfer to link events (between entitites) back to
         | your OLGP database which describes these entities.
         | 
         | This architecture [1] gives you a clean "separation of
         | concerns", allowing you to scale and manage the different
         | workloads independently. For example, if you're a bank, it's
         | probably a good idea not to keep all your cash in the filing
         | cabinet with the customer records, but rather to keep the cash
         | in the bank vault, since the information has different
         | performance/compliance/retention characteristics.
         | 
         | This pattern makes sense because users change their name or
         | email address (OLGP) far less frequently than they transact
         | (OLTP).
         | 
         | Finally, to preserve consistency, on the write path, you treat
         | TigerBeetle as the OLTP data plane as your "system of record".
         | When a "move to shopping cart" or "checkout" transaction comes
         | in, you first write all your data dependencies to OLGP if any
         | (and say S3 if you have related blob data) and then finally you
         | commit your transaction by writing to TigerBeetle. On the read
         | path, you query your system of record first, preserving strict
         | serializability.
         | 
         | Does that make sense? Let me know if there's anything here we
         | can drill into further!
         | 
         | [1] https://docs.tigerbeetle.com/coding/system-architecture/
        
       | andyferris wrote:
       | I found the line about Tigerbeetle's model assuming entire disk
       | sector errors but not bit/byte errors rather interesting - as
       | someone who has created error correcting codes, this seems out of
       | line with my understanding. The only situation I can see it works
       | is where the disk or driver encodes and decodes the sectors...
       | and (on any disk/driver I would care to store an important
       | transactional database) would be reporting tonnes of (possibly
       | corrected) faults before Tigerbeetle was even aware.
       | 
       | Or possibly my mental model of how physical disks and the driver
       | stack behave these days is outdated.
        
         | matklad wrote:
         | Just to clarify, our _model_ totally assumes bit/byte error!
         | It's just that our fuzzer was buggy and wasn't actually
         | exercising those faults!
        
       | DetroitThrow wrote:
       | This is a particularly fun Jepsen report after reading their
       | fuzzer blind spots post.
       | 
       | It looks like the segfaults on the JNI side would not have been
       | protected if Rust or some other memory safe language were being
       | used - the lack of memory safety bugs gives some decent proof
       | that TigerBeetle's approach to Zig programming (TigerStyle iirc,
       | lol) does what it sets out to do.
        
         | matklad wrote:
         | See https://news.ycombinator.com/item?id=44201189. We did have
         | one bug where Rust would've saved our bacon (instead, the bacon
         | was saved by an assertion, so it was just slightly crispy, not
         | charred).
         | 
         | EDIT: But, yeah, totally, if not for TigerStyle, we'd die to
         | nasal demons!
        
       | ryeats wrote:
       | I think it is interesting but obvious in hindsight that it is
       | necessary to have the distributed system under test report the
       | time/order things actually happened to enable accurate validation
       | against an external model of the system instead of using wall-
       | clock time.
        
         | matklad wrote:
         | Note that this works because we have strict serializability.
         | With weaker consistency guarantees, there isn't necessarily a
         | single global consistent timeline.
         | 
         | This is an interesting meta pattern where doing something
         | _harder_ actually simplifies the system.
         | 
         | Another example is that, because we assume that the disk can
         | fail and need to include repair protocol, we get state-
         | synchronization for a lagging replica "for free", because it is
         | precisely the same situation as when the entire disk gets
         | corrupted!
        
           | aphyr wrote:
           | To build on this--this is something of a novel technique in
           | Jepsen testing! We've done arbitrary state machine
           | verification before, but usually that requires playing
           | forward lots of alternate timelines: one for each possible
           | ordering of concurrent operations. That search (see the
           | Knossos linearizability checker) is an exponential nightmare.
           | 
           | In TigerBeetle, we take advantage of some special properties
           | to make the state machine checking part linear-time. We let
           | TigerBeetle tell us exactly which transactions happen. We can
           | do this because it's a.) strong serializable, b.) immutable
           | (in that we can inspect DB state to determine whether an op
           | took place), and c.) exposes a totally ordered timestamp for
           | every operation. Then we check that that timestamp order is
           | consistent with real-time order, using a linear-time cycle
           | detection approach called Elle. Having established that
           | TigerBeetle's claims about the timestamp order are valid, we
           | can apply those operations to a simulated version of the
           | state machine to check semantic correctness!
           | 
           | I'd like to generalize this to other systems, but it's
           | surprisingly tricky to find all three of those properties in
           | one database. Maybe an avenue for future research!
        
       | FlyingSnake wrote:
       | Love the wonderfully detailed report. Getting it tested and
       | signed off by Jepsen is such a huge endorsement for TigerBeetle.
       | It's not even reached v1.0 and I can't wait to see it hit new
       | milestone in the future.
       | 
       | Special kudos to the founders who are sharing great insights in
       | this thread.
        
         | jorangreef wrote:
         | Yes, Kyle did an incredible job and I also love the detail he
         | put into the report. I kept saying to myself: "this is like a
         | work of art", the craftsmanship and precision.
         | 
         | Appreciate your kind words too, and look forward also to
         | sharing something new in our talks at SD25 in Amsterdam soon!
        
       | eevmanu wrote:
       | I have a question that I hope is not misinterpreted, as I'm
       | asking purely out of a desire to learn. I am new to distributed
       | systems and fascinated by deterministic simulation testing.
       | 
       | After reading the Jepsen report on TigerBeetle, the related blog
       | post, and briefly reviewing the Antithesis integration code on
       | GitHub workflow, I'm trying to better understand the testing
       | scope.
       | 
       | My core question is: could these bugs detected by the Jepsen test
       | suite have also been found by the Antithesis integration?
       | 
       | This question comes from a few assumptions I made, which may be
       | incorrect:
       | 
       | - I thought TigerBeetle was already comprehensively tested by its
       | internal test suite and the Antithesis product.
       | 
       | - I had the impression that the Antithesis test suite was more
       | robust than Jepsen's, so I was surprised that Jepsen found an
       | issue that Antithesis apparently did not.
       | 
       | I'm wondering if my understanding is flawed. For instance:
       | 
       | 1. Was the Antithesis test suite not fully capable of detecting
       | this specific class of bug?
       | 
       | 2. Was this particular part of the system not yet covered by the
       | Antithesis tests?
       | 
       | 3. Am I fundamentally comparing apples and oranges,
       | misunderstanding the different strengths and goals of the Jepsen
       | and Antithesis testing suites?
       | 
       | I would greatly appreciate any insights that could help me
       | understand this better. I want to be clear that my goal is to
       | educate myself on these topics, not to make incorrect assumptions
       | or assign responsibility.
        
         | aphyr wrote:
         | Yeah, TigerBeetle's blog post goes into more detail here, but
         | in short, the tests that were running in Antithesis (which were
         | remarkably thorough) didn't happen to generate the precise
         | combination of intersecting queries _and_ out-of-order values
         | that were necessary to find the index bug, whereas the Jepsen
         | generator did hit that combination.
         | 
         | There are almost certainly blind spots in the Jepsen test
         | generators too--that's part of why designing different
         | generators is so helpful!
        
           | eevmanu wrote:
           | Thanks for your answer aphyr and for this amazing analysis
        
         | matklad wrote:
         | To add to what aphyr says, you generally need three components
         | for generative testing of distributed systems:
         | 
         | 1. Some sort of environment, which can run the system. The
         | simplest environment is to spin up a real cluster of machines,
         | but ideally you want something fancier, to improve performance,
         | control over responses of external APIs, determinism,
         | reproducibility, etc. 2. Some sort of load generator, which
         | makes the system in the environment do interesting thing 3.
         | Some sort of auditor, which observes the behavior of the system
         | under load and decides whether the system behaves according to
         | the specification.
         | 
         | Antithesis mostly tackles problem #1, providing a deterministic
         | simulation environment as a virtual machine. The same problem
         | is talked by jepsen (by using real machines, but injecting
         | faults at the OS level), and by TigerBeetle's own VOPR (which
         | is co-designed with the database, and for that reason can run
         | the whole cluster on just a single thread). There there
         | approaches are complimentary and are good at different things.
         | 
         | For this bug, the critical part was #2, #3 --- writing workload
         | verifier and auditor that actually can trigger the bug. Here,
         | it was aphyr's 1600 lines of TigerBeetle-specfic Clojure code
         | that triggred and detected the bug (and then we patched _our_
         | equivalent to also trigger it. Really, what's buggy here is not
         | the database, but the VOPR. Database having bugs is par of
         | course, you can't just avoid bugs through the sheer force of
         | will. So you need testing strategy that can trigger most bugs,
         | and any bug that slips through is pointing to the deficiency in
         | the workload generator.)
        
           | aphyr wrote:
           | And honestly--designing a generator for a system like this is
           | hard. Really hard. I struggled for weeks to get something
           | that didn't just fail 99% of requests trivially, and it's an
           | (ahem) giant pile of probabilistic hacks. So I wouldn't be
           | too hard on the various TB test generators here!
           | 
           | https://github.com/jepsen-
           | io/tigerbeetle/blob/main/src/jepse...
        
         | jorangreef wrote:
         | (Note also that 90% of our deterministic simulation testing is
         | done primarily by the VOPR, TigerBeetle's own deterministic
         | simulator, which we built inhouse, and which runs on a fleet of
         | 1,000 dedicated CPU cores 24/7. We also use Antithesis, but as
         | a second layer of DST.)
         | 
         | To understand why the query engine bug slipped through, see:
         | https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
        
       | ManBeardPc wrote:
       | TigerBeetle is something I'm interested in. I see there is no C
       | or Zig client listed in the clients documentation. Thought these
       | would be the first ones to exist given it is written in Zig. Do
       | they exist or maybe still WIP?
        
       | 12_throw_away wrote:
       | A small appreciation for the section entitled "Panic! At the Disk
       | 0": <golf clap>
        
       ___________________________________________________________________
       (page generated 2025-06-06 23:01 UTC)