[HN Gopher] Jepsen: TigerBeetle 0.16.11
___________________________________________________________________
Jepsen: TigerBeetle 0.16.11
Author : aphyr
Score : 210 points
Date : 2025-06-06 10:53 UTC (12 hours ago)
(HTM) web link (jepsen.io)
(TXT) w3m dump (jepsen.io)
| koakuma-chan wrote:
| Curios if they got any large bank or stock exchange to use
| TigerBeetle
| nindalf wrote:
| I think if they had, they'd brag about it on their homepage. So
| far the biggest endorsement from there is from some YouTuber. A
| popular YouTuber, no doubt, but a YouTuber nevertheless.
| koakuma-chan wrote:
| Yeah, TigerBeetle itself and their testing suite looks
| impressive, but putting Primeagen there makes them look like
| Next.js or Cursor.
| jorangreef wrote:
| That's a talk for engineers that was streamed on the
| Primeagen and went a bit viral. If you haven't watched it
| yet, it's an intro to TigerBeetle technically.
|
| Otherwise check out https://tigerbeetle.com/company if you
| want more about the corporate side.
| nindalf wrote:
| If you can stand that guy speak, it's worth a watch.
| jorangreef wrote:
| I actually love the pace at which Prime speaks, but I
| feel awkward at hearing my own voice. Hopefully the ideas
| stand on merit!
| jorangreef wrote:
| Joran, creator and CEO from TigerBeetle here!
|
| At a national level, we're working with the Gates Foundation to
| integrate TigerBeetle into their non-profit central bank switch
| that will be powering Rwanda's National Digital Payments System
| 2.0 later this year [1].
|
| At an enterprise level, TigerBeetle already powers customers
| processing 100M+ transactions per month in production, and we
| recently signed our first $2B fintech unicorn in Europe with a
| few more in the US about to close. Because of the move to
| realtime transaction processing around the world [2] there's
| been quite a bit of interest from companies wanting to move to
| TigerBeetle for more performance.
|
| Finally, to your question, some of the founders of Clear
| Street, a fairly large brokerage on Wall Street have since
| invested [3] in TigerBeetle.
|
| [1] https://mojaloop.io/how-mojaloop-enables-rndps-2-0-ekash/
|
| [2] https://tigerbeetle.com/blog/2024-07-23-rediscovering-
| transa...
|
| [3] https://tigerbeetle.com/company
| diggan wrote:
| > some of the founders of Clear Street, a fairly large
| brokerage on Wall Street have since invested [3] in
| TigerBeetle
|
| "Invested" in terms of "giving you money" or in terms of "Now
| uses the database themselves"? I read it as the first, but I
| think the question is about usage, not investments.
| jorangreef wrote:
| Both. In terms of investing and planning to migrate.
| diggan wrote:
| Thanks for the clarification :)
| jorangreef wrote:
| You too! :)
| thomaspaine wrote:
| I work on the ledgering system at clear street and as far
| as I know we have no plans to do this. We evaluated it
| internally a few years ago and found that the account and
| transaction model was too different from ours to migrate
| over.
| jorangreef wrote:
| Hi Thomas, yes, I was there. However, this is something
| that Sachin and I subsequently discussed last year
| (Sachin recently provided the TPS footnote to be used in
| the report here). However, I understand that roadmap may
| since have changed, but this is to the best of my
| knowledge.
| sachnk99 wrote:
| Hi -- Sachin here, one of the founders of Clear Street.
| To clarify:
|
| - The investment in TigerBeetle was done personally, not
| through Clear Street.
|
| - I'm no longer actively involved day-to-day as CTO at
| Clear Street, but while I was, TigerBeetle was a solution
| we very much had in mind as our volumes were increasing.
|
| That said, roadmaps change, priorities shift, etc. If
| TigerBeetle existed when we started Clear Street, I very
| much would have used it, and saved me from many
| headaches.
| dralley wrote:
| Have you had a difficult time convincing customers to use a
| product written in a pre-1.0 programming language?
| matklad wrote:
| From the user's perspective, this doesn't matter at all.
| Zig is implementation detail, what we actually ship is a
| fully statically linked native executable for the database,
| and "links only libc" (because thread locals!) .a/.so
| native "C" library for clients. Nothing will change, for
| the user, if we decide to rewrite the thing in Rust, or C,
| or Hare, nothing Zig-specific leaks out.
|
| Form the developer perspective, the big thing is that we
| don't have any dependencies, so updating compiler for us is
| just a small amount of work once in a while, and not your
| typical ecosystem-wide coordination problem. Otherwise,
| Zig's pretty much "finished" for our use-case, it more or
| less just works.
| jorangreef wrote:
| Zig's pre-1.0 status also refers more to API stability. The
| language and tooling already has more quality, at least in
| my own experience, than if we had picked C, which was the
| only other choice available to us when we made the decision
| to invest in Zig's trajectory back in 2020, given we needed
| to do static allocation and that any sort of global
| allocator was out of the question.
|
| But, no. On the commercial side, I don't think we've had
| one conversation with a prospect or CTO or engineering team
| where they were concerned that we picked a systems language
| for the next thirty years. And while Zig is a beautiful,
| perfect replacement for C, I think the real reason the
| question has never come up, is that our customers come to
| us instead of us to them. We're not trying to convince
| anyone. They're already appreciating the extensive end-to-
| end testing we do on everything we ship.
|
| However, I should emphasize again, that given all the
| assertions, fuzzing and DST we do, Zig's quality can't be
| overstated. It holds up.
| SOLAR_FIELDS wrote:
| Not a bank or exchange but I work for a very large fintech and
| we are using it on our newer products.
| jorangreef wrote:
| Awesome to hear that! Are we chatting in Slack? Or please DM
| me or Lewis. Would love to chat!
| nindalf wrote:
| Very impressed with this report. Whenever I read TigerBeetle's
| claims on reliability and scalability, I'd think "ok, let's wait
| for the Jepsen report".
|
| This report found a number of issues, which might be a cause for
| concern. But I think it's a positive because they didn't just fix
| the issues, they've expanded their internal test suite to catch
| similar bugs in future. With such an approach to engineering I
| feel like in 10 years TigerBeetle would have achieved the "just
| use Postgres" level of default database in its niche of financial
| applications.
|
| Also great work aphyr! I feel like I learned a lot reading this
| report.
| jorangreef wrote:
| Thanks!
|
| Yes, we have around 6,000+ assertions in TigerBeetle. A few of
| these were overtight, hence some of the crashes. But those were
| the assertions doing their job, alerting us that we needed to
| adjust our mental model, which we did.
|
| Otherwise, apart from a small correctness bug in an internal
| testing feature we added (only in our Java client and only for
| Jepsen to facilitate the audit) there was only one correctness
| bug found by Jepsen, and it didn't affect durability. We've
| written about it here:
| https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
|
| Finally, to be fair, TigerBeetle can (and is tested) to survive
| more faults than Postgres can, since it was designed with an
| explicit storage fault model and using research that was not
| available at the time when Postgres was released in '96. TB's
| fault models are further tested with Deterministic Simulation
| Testing and we use techniques such as static memory allocation
| following NASA's Power of Ten Rules for Safety-Critical Code.
| There are known scenarios in the literature that will cause
| Postgres to lose data, which TigerBeetle can detect and recover
| from.
|
| For more on this, see the section in Kyle's report on helical
| fault injection (most Raft and Paxos implementations were not
| designed to survive this) as well as a talk we gave at QCon
| London: https://m.youtube.com/watch?v=_jfOk4L7CiY
| jrpelkonen wrote:
| Hi Joran,
|
| I have followed TigerBeetle with interest for a while, and
| thank you for your inspirational work and informative
| presentations.
|
| However, you have stated in several occasions that the lack
| of memory safety in Zig is not a concern since you don't
| dynamically allocate memory post startup. However, one of the
| defects uncovered here (#2435) was caused by dereferencing an
| uninitialized pointer. I find this pretty concerning, so I
| wonder if there is something that you will be doing
| differently to eliminate all similar bugs going forward?
| matklad wrote:
| Note that that's a bug in the client, in the Zig-java FFI
| code, which is inherently unsafe. We'd likely made an a
| similar bug in Rust.
|
| Which is, yeah, one of the bigger technical challenges for
| us --- we ship language-native libraries for
| Go,node,Java,C#,Python and Rust, and, like in the Tolstoi
| novel, each one is peculiar in its own way. What's worse,
| they aren't directly covered by our deterministic
| simulator. That's one of the major reasons why we invest in
| full-system simulation with jepsen, antithesis and vortex
| (https://tigerbeetle.com/blog/2025-02-13-a-descent-into-
| the-v...). We are also toying with the idea of generating
| _more_ of that code, so there's less room for human error.
| Maybe one day we'll even do fully native client (eg, pure
| Java, pure Go), but we are not there yet.
|
| One super-specific in-progress thing is that, at the
| moment, the _bulk_ of the client testing is duplicated per
| client, and also the _bulk_ of the testing is example-
| based. Building simulator/workload is a lot of work, and
| duplicating it for each client is unreasonable. What we
| want to do here is to use multi-process architecture, where
| there's a single Zig process that generates the workloads
| and generates interesting sequences of commands for
| clients, and than in each client we implement just a tiny
| "interpreter" for workload language, getting a test suite
| for free. This is still WIP though!
|
| Regarding the broader memory safety issue in the database.
| We did have a couple of memory safety bugs, which were
| caught early in testing. We did have one very bad aliasing
| bug, which would have been totally prevented by Rust, which
| slipped through the bulk of our testing and into the
| release (it was caught in testing _after_ it was
| introduced):
| https://github.com/tigerbeetle/tigerbeetle/pull/2774.
| Notably, while the bug was bad enough to completely mess up
| our internal data structure, it was immediately caught by
| an assert down the line, and downgraded from correctness
| issues to a small availability issues (just restarting the
| replica would fix it). Curiously, the root cause for that
| bug was that we over-complicated our code. Long before the
| actual bug we felt uneasy about the data structure in
| question, and thought about refactoring it away (which
| refactor is underway. Hilariously, it looks that just
| "removing" the thing without any other code changes
| improves performance!).
|
| So, on balance, yeah, Rust would've prevented a small
| number of easy bugs, and on gnarly bug, but then the entire
| thing would have to look completely different, as the
| architecture of TigerBeetle not at all Rust-friendly. I'd
| be curious to see someone replicating single-thread io-
| uring no malloc after startup architecture in Rust! I
| personally don't know off the top of my head whether that
| would work or not.
| jcalabro wrote:
| I remember reading a similar thing about FoundationDB
| with their DST a while back. Over time, they surfaced
| relatively few bugs in the core server, but found a bunch
| in the client libraries because the clients were more
| complicated and were not run under their DST.
|
| Anyways, really interesting report and project. I also
| like your youtube show - keep up the great work! :)
| matklad wrote:
| Oh, important clarification from
| andrewrk(https://lobste.rs/c/tf6jng), which I totally
| missed myself: this isn't actually a dereference of
| uninitialized pointer, it's a defer of a pointer which is
| explicitly set to a specific, invalid value.
| jrpelkonen wrote:
| This is indeed an important point, the way I originally
| understood the bug was that the memory was not
| initialized at all. Thanks for the clarification
| AndyKelley wrote:
| TigerBeetle uses ReleaseSafe optimization mode, which means
| that the pointer was in fact initialized to
| 0xaaaaaaaaaaaaaaaa. Since nothing is mapped to this
| address, it reliably causes a segfault. This is equivalent
| to an assertion failure.
| jrpelkonen wrote:
| That's good to hear! Thanks for the clarification.
| anarazel wrote:
| > There are known scenarios in the literature that will cause
| Postgres to lose data, which TigerBeetle can detect and
| recover from.
|
| What are you referencing here?
| jorangreef wrote:
| The scenarios described in our QCon London talk linked
| above.
|
| This surveys the excellent storage fault research from UW-
| Madison, and in particular: "Can
| Applications Recover from fsync Failures?"
| "Protocol-Aware Recovery for Consensus-Based Storage"
|
| Finally, I'd recommend watching "Consensus and the Art of
| Durability", our talk from SD24 in NYC last year:
|
| https://www.youtube.com/watch?v=tRgvaqpQPwE
| SOLAR_FIELDS wrote:
| I always get excited to read Kyle's write ups. I feel like I
| level up my distributed systems knowledge every time he puts
| something out.
| cmrdporcupine wrote:
| The articles link to the paper about "Viewstamped Replication" is
| unfortunately broken (https://pmg.csail.mit.edu/papers/vr-
| revisited.pdf connection refused).
|
| I think it should be http://pmg.csail.mit.edu/papers/vr-
| revisited.pdf (http scheme not https) ?
|
| And now I have some Friday evening reading material.
| jorangreef wrote:
| It should be fixed soon!
|
| The VSR 2012 paper is one of my favorites as is "Protocol-Aware
| Recovery for Consensus-Based Storage", which is so powerful.
|
| Hope you enjoy the read!
| tomhow wrote:
| See also:
|
| _Fuzzer Blind Spots (Meet Jepsen!)_ -
| https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
| Ygg2 wrote:
| TigerBeetle is impressive, but it's a single purpose DB. Unless
| you fit within the account ledger model it's extremely
| restrictive.
| jorangreef wrote:
| Joran from TigerBeetle here!
|
| Yes, TigerBeetle specializes only for transaction processing
| (OLTP). It's not a general-purpose (OLGP) DBMS.
|
| That said, we have customers from energy to gaming, and of
| course fintech.
| SOLAR_FIELDS wrote:
| That is 100% correct. You use TigerBeetle when you need a
| really good double entry accounting system that is open source.
| You wouldn't use it for much else other than that. Which makes
| it great software, it's purpose made to solve one problem
| really well
| saaaaaam wrote:
| That's a slightly redundant criticism though - it doesn't
| present itself as anything other than a single purpose database
| designed for financial transactions.
|
| That's like saying that rice noodles are no good for making
| risotto. At the core they are both rice...
| Ygg2 wrote:
| People seem to describe it at OLTP, and one of first DBs to
| come up in OLTP search is MySQL.
| dumah wrote:
| OLTP (Online Transaction Processing) is a database paradigm
| optimized for handling high volumes of short, fast
| transactions in real-time, typically supporting day-to-day
| operational activities like order processing, inventory
| updates, and customer account management where data
| integrity and quick response times are critical.
|
| Another paradigm is OLAP, in which aggregation of large
| datasets is the principal concern.
| Ygg2 wrote:
| Yes, I'm aware. It seems now there is a further
| bifurcation. OLTP is no longer general purpose, but now
| it's also for only one narrow use-case.
| wiradikusuma wrote:
| If memory serves, TigerBeetle is/was not free for production? I
| can't find the Pricing page, but I kinda remember reading about
| it somewhere (or it was implied) a while back.
| jorangreef wrote:
| The DBMS is Apache 2.0 and our customers pay us (well) for
| everything else to run, integrate, migrate, operate and support
| that.
|
| For more on our open source thinking and how this is orthogonal
| to business model (and product!), see our interview with the
| Changelog: https://m.youtube.com/watch?v=Yr8Y2EYnxJs
| boris wrote:
| I watched that but I don't see it as convincing. Let's take
| the AWS example brought up in the talk. The "compete on the
| interface, not (open source) implementation" idea I think
| misses (at least) the following points:
|
| 1. AWS will take your initial and ongoing investment in the
| implementation but they don't have to share theirs with you.
| Specifically, they will take your improvements but their own
| improvements (say some performance optimizations) they can
| keep to themselves. It's good business sense if it allows
| them to further differentiate their "improved" offering from
| your "vanilla" service.
|
| 2. Competing on the the interface in this case really means
| competing on related services like management, etc. So your
| thesis is that you will provide a better/cheaper managed
| service than AWS. Even if that's true (a big if), most of the
| time the decision which service to use will have little to do
| with technical merit. I.e. we already use AWS, have SLA
| painfully negotiated, get volume discounts, etc. Do we really
| want to go through all of this with another vendor just for
| one extra service.
|
| Just a couple of thoughts that will hopefully help you
| sharpen your thesis.
| kristoff_it wrote:
| > AWS will take your initial and ongoing investment in the
| implementation but they don't have to share theirs with
| you. Specifically, they will take your improvements but
| their own improvements (say some performance optimizations)
| they can keep to themselves. It's good business sense if it
| allows them to further differentiate their "improved"
| offering from your "vanilla" service.
|
| In practice all I've seen from AWS is just to add
| integrations with their internal orchestrators and not much
| else. Back when I was at Redis Labs, AWS added TLS support
| to Redis and was dying to get that upstreamed (so that they
| wouldn't have to maintain the patch), except that as far as
| I understood nobody upstream wanted that code. In other
| words, hypothetical improvements by AWS (and other Clouds)
| are extremely overrated. When it comes to tigerbeetle, I
| would put the chance that they introduce bugs and
| vulnerabilities much higher than the possibility they add
| any meaningful improvement over what the actual experts
| (the tigrebeetle team) have already done.
|
| > Do we really want to go through all of this with another
| vendor just for one extra service.
|
| That's a great point, and in fact I've seen AWS
| purposefully offer insane (in Europe maybe we would say
| anti-competitive) discounts precisely to prevent Redis Labs
| from gaining market share. I'm sure they will try the same
| with TB once it becomes mainstream enough. What TB has that
| Redis doesn't have is the fact that it's a database
| designed for truly mission-critical stuff (i.e. counting
| the money) and maybe customers will be willing to go
| through the extra motions to ensure they get the best
| service they can (assuming TB will be able to provide
| that).
| boris wrote:
| > In other words, hypothetical improvements by AWS (and
| other Clouds) are extremely overrated.
|
| Interesting, in a recent thread (I think it was about
| Redis going back open source) an AWS employer was
| bragging about substantial concurrency optimizations they
| implemented in Valkey. At the time I thought it could
| have been a great differentiator to keep proprietary but
| perhaps they decide to sacrifice it to help make sure
| Valkey takes over the Redis midshare.
| kristoff_it wrote:
| That's a special case for sure, given the new fight for
| supremacy between the two forks, that said you can see in
| all those threads antirez bickering with the AWS people
| over exactly who introduced what.
| jorangreef wrote:
| To be clear, we have no problem if all the hyperscalers
| decide to offer TigerBeetle as their flagship OLTP
| database. That builds trust and is a good thing for the
| ecosystem as a whole.
|
| We also don't expect (or need) anyone to contribute
| improvements upstream to us. That's open source!
|
| Finally, open source is not the same thing as product.
| There are thousands of companies around the world who make
| high quality products that people pay for. TigerBeetle is
| no different.
| jitl wrote:
| Really happy to see TigerBeetle live up to its claims as verified
| by aphyr - because it's good to see that when you take the right
| approach, you get the right results.
|
| Question about how people end up using TigerBeetle. There's
| presumably a lot of external systems and other databases around a
| TigerBeetle install for everything that isn't an Account or
| Transfer. What's the typical pattern for those less reliable
| systems to square up to TigerBeetle, especially to recover from
| consistency issues between the two?
| jorangreef wrote:
| Joran from TigerBeetle here! Thanks! Really happy to see the
| report published too.
|
| The typical pattern in integrating TigerBeetle is to
| differentiate between control plane (Postgres for general
| purpose or OLGP) and data plane (TigerBeetle for transaction
| processing or OLTP).
|
| All your users (names, addresses, passwords etc.) and products
| (descriptions, prices etc.) then go into OLGP as your "filing
| cabinet".
|
| And then all the Black Friday transactions these users (or
| entities) make, to move products from inventory accounts to
| shopping cart accounts, and from there to checkout and delivery
| accounts--all these go into OLTP as your "bank vault".
| TigerBeetle lets you store up to 3 user data identifiers per
| account or transfer to link events (between entitites) back to
| your OLGP database which describes these entities.
|
| This architecture [1] gives you a clean "separation of
| concerns", allowing you to scale and manage the different
| workloads independently. For example, if you're a bank, it's
| probably a good idea not to keep all your cash in the filing
| cabinet with the customer records, but rather to keep the cash
| in the bank vault, since the information has different
| performance/compliance/retention characteristics.
|
| This pattern makes sense because users change their name or
| email address (OLGP) far less frequently than they transact
| (OLTP).
|
| Finally, to preserve consistency, on the write path, you treat
| TigerBeetle as the OLTP data plane as your "system of record".
| When a "move to shopping cart" or "checkout" transaction comes
| in, you first write all your data dependencies to OLGP if any
| (and say S3 if you have related blob data) and then finally you
| commit your transaction by writing to TigerBeetle. On the read
| path, you query your system of record first, preserving strict
| serializability.
|
| Does that make sense? Let me know if there's anything here we
| can drill into further!
|
| [1] https://docs.tigerbeetle.com/coding/system-architecture/
| andyferris wrote:
| I found the line about Tigerbeetle's model assuming entire disk
| sector errors but not bit/byte errors rather interesting - as
| someone who has created error correcting codes, this seems out of
| line with my understanding. The only situation I can see it works
| is where the disk or driver encodes and decodes the sectors...
| and (on any disk/driver I would care to store an important
| transactional database) would be reporting tonnes of (possibly
| corrected) faults before Tigerbeetle was even aware.
|
| Or possibly my mental model of how physical disks and the driver
| stack behave these days is outdated.
| matklad wrote:
| Just to clarify, our _model_ totally assumes bit/byte error!
| It's just that our fuzzer was buggy and wasn't actually
| exercising those faults!
| DetroitThrow wrote:
| This is a particularly fun Jepsen report after reading their
| fuzzer blind spots post.
|
| It looks like the segfaults on the JNI side would not have been
| protected if Rust or some other memory safe language were being
| used - the lack of memory safety bugs gives some decent proof
| that TigerBeetle's approach to Zig programming (TigerStyle iirc,
| lol) does what it sets out to do.
| matklad wrote:
| See https://news.ycombinator.com/item?id=44201189. We did have
| one bug where Rust would've saved our bacon (instead, the bacon
| was saved by an assertion, so it was just slightly crispy, not
| charred).
|
| EDIT: But, yeah, totally, if not for TigerStyle, we'd die to
| nasal demons!
| ryeats wrote:
| I think it is interesting but obvious in hindsight that it is
| necessary to have the distributed system under test report the
| time/order things actually happened to enable accurate validation
| against an external model of the system instead of using wall-
| clock time.
| matklad wrote:
| Note that this works because we have strict serializability.
| With weaker consistency guarantees, there isn't necessarily a
| single global consistent timeline.
|
| This is an interesting meta pattern where doing something
| _harder_ actually simplifies the system.
|
| Another example is that, because we assume that the disk can
| fail and need to include repair protocol, we get state-
| synchronization for a lagging replica "for free", because it is
| precisely the same situation as when the entire disk gets
| corrupted!
| aphyr wrote:
| To build on this--this is something of a novel technique in
| Jepsen testing! We've done arbitrary state machine
| verification before, but usually that requires playing
| forward lots of alternate timelines: one for each possible
| ordering of concurrent operations. That search (see the
| Knossos linearizability checker) is an exponential nightmare.
|
| In TigerBeetle, we take advantage of some special properties
| to make the state machine checking part linear-time. We let
| TigerBeetle tell us exactly which transactions happen. We can
| do this because it's a.) strong serializable, b.) immutable
| (in that we can inspect DB state to determine whether an op
| took place), and c.) exposes a totally ordered timestamp for
| every operation. Then we check that that timestamp order is
| consistent with real-time order, using a linear-time cycle
| detection approach called Elle. Having established that
| TigerBeetle's claims about the timestamp order are valid, we
| can apply those operations to a simulated version of the
| state machine to check semantic correctness!
|
| I'd like to generalize this to other systems, but it's
| surprisingly tricky to find all three of those properties in
| one database. Maybe an avenue for future research!
| FlyingSnake wrote:
| Love the wonderfully detailed report. Getting it tested and
| signed off by Jepsen is such a huge endorsement for TigerBeetle.
| It's not even reached v1.0 and I can't wait to see it hit new
| milestone in the future.
|
| Special kudos to the founders who are sharing great insights in
| this thread.
| jorangreef wrote:
| Yes, Kyle did an incredible job and I also love the detail he
| put into the report. I kept saying to myself: "this is like a
| work of art", the craftsmanship and precision.
|
| Appreciate your kind words too, and look forward also to
| sharing something new in our talks at SD25 in Amsterdam soon!
| eevmanu wrote:
| I have a question that I hope is not misinterpreted, as I'm
| asking purely out of a desire to learn. I am new to distributed
| systems and fascinated by deterministic simulation testing.
|
| After reading the Jepsen report on TigerBeetle, the related blog
| post, and briefly reviewing the Antithesis integration code on
| GitHub workflow, I'm trying to better understand the testing
| scope.
|
| My core question is: could these bugs detected by the Jepsen test
| suite have also been found by the Antithesis integration?
|
| This question comes from a few assumptions I made, which may be
| incorrect:
|
| - I thought TigerBeetle was already comprehensively tested by its
| internal test suite and the Antithesis product.
|
| - I had the impression that the Antithesis test suite was more
| robust than Jepsen's, so I was surprised that Jepsen found an
| issue that Antithesis apparently did not.
|
| I'm wondering if my understanding is flawed. For instance:
|
| 1. Was the Antithesis test suite not fully capable of detecting
| this specific class of bug?
|
| 2. Was this particular part of the system not yet covered by the
| Antithesis tests?
|
| 3. Am I fundamentally comparing apples and oranges,
| misunderstanding the different strengths and goals of the Jepsen
| and Antithesis testing suites?
|
| I would greatly appreciate any insights that could help me
| understand this better. I want to be clear that my goal is to
| educate myself on these topics, not to make incorrect assumptions
| or assign responsibility.
| aphyr wrote:
| Yeah, TigerBeetle's blog post goes into more detail here, but
| in short, the tests that were running in Antithesis (which were
| remarkably thorough) didn't happen to generate the precise
| combination of intersecting queries _and_ out-of-order values
| that were necessary to find the index bug, whereas the Jepsen
| generator did hit that combination.
|
| There are almost certainly blind spots in the Jepsen test
| generators too--that's part of why designing different
| generators is so helpful!
| eevmanu wrote:
| Thanks for your answer aphyr and for this amazing analysis
| matklad wrote:
| To add to what aphyr says, you generally need three components
| for generative testing of distributed systems:
|
| 1. Some sort of environment, which can run the system. The
| simplest environment is to spin up a real cluster of machines,
| but ideally you want something fancier, to improve performance,
| control over responses of external APIs, determinism,
| reproducibility, etc. 2. Some sort of load generator, which
| makes the system in the environment do interesting thing 3.
| Some sort of auditor, which observes the behavior of the system
| under load and decides whether the system behaves according to
| the specification.
|
| Antithesis mostly tackles problem #1, providing a deterministic
| simulation environment as a virtual machine. The same problem
| is talked by jepsen (by using real machines, but injecting
| faults at the OS level), and by TigerBeetle's own VOPR (which
| is co-designed with the database, and for that reason can run
| the whole cluster on just a single thread). There there
| approaches are complimentary and are good at different things.
|
| For this bug, the critical part was #2, #3 --- writing workload
| verifier and auditor that actually can trigger the bug. Here,
| it was aphyr's 1600 lines of TigerBeetle-specfic Clojure code
| that triggred and detected the bug (and then we patched _our_
| equivalent to also trigger it. Really, what's buggy here is not
| the database, but the VOPR. Database having bugs is par of
| course, you can't just avoid bugs through the sheer force of
| will. So you need testing strategy that can trigger most bugs,
| and any bug that slips through is pointing to the deficiency in
| the workload generator.)
| aphyr wrote:
| And honestly--designing a generator for a system like this is
| hard. Really hard. I struggled for weeks to get something
| that didn't just fail 99% of requests trivially, and it's an
| (ahem) giant pile of probabilistic hacks. So I wouldn't be
| too hard on the various TB test generators here!
|
| https://github.com/jepsen-
| io/tigerbeetle/blob/main/src/jepse...
| jorangreef wrote:
| (Note also that 90% of our deterministic simulation testing is
| done primarily by the VOPR, TigerBeetle's own deterministic
| simulator, which we built inhouse, and which runs on a fleet of
| 1,000 dedicated CPU cores 24/7. We also use Antithesis, but as
| a second layer of DST.)
|
| To understand why the query engine bug slipped through, see:
| https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
| ManBeardPc wrote:
| TigerBeetle is something I'm interested in. I see there is no C
| or Zig client listed in the clients documentation. Thought these
| would be the first ones to exist given it is written in Zig. Do
| they exist or maybe still WIP?
| 12_throw_away wrote:
| A small appreciation for the section entitled "Panic! At the Disk
| 0": <golf clap>
___________________________________________________________________
(page generated 2025-06-06 23:01 UTC)