[HN Gopher] Antithesis of a One-in-a-Million Bug: Taming Demonic...
       ___________________________________________________________________
        
       Antithesis of a One-in-a-Million Bug: Taming Demonic Nondeterminism
        
       Author : eatonphil
       Score  : 112 points
       Date   : 2024-03-21 18:48 UTC (1 days ago)
        
 (HTM) web link (www.cockroachlabs.com)
 (TXT) w3m dump (www.cockroachlabs.com)
        
       | Taikonerd wrote:
       | Cockroach DB is a very logical place to use Antithesis -- its
       | original use case was for FoundationDB!
        
         | WarOnPrivacy wrote:
         | Ah it _is_ coding then. I was trying to work out what Demonic
         | Nondeterminism might be and got increasingly excited about the
         | possibilities. At the least - it would have included all the
         | demon psych /religious backstory that we know about. I'd
         | finally get some answers.
        
         | wwilson wrote:
         | To be clear -- many of us worked at FoundationDB, and we were
         | certainly inspired by the testing technology we had there. But
         | Antithesis is built from scratch, and WAY more general and WAY
         | more powerful than the FDB simulator.
        
       | notresidenter wrote:
       | Antithesis are doing a great job at marketing their product and
       | building hype. Their platform looks really interesting.
        
         | wwilson wrote:
         | Thanks! But just you wait... the _really_ insane stuff isn't
         | even public yet!
        
           | thimkerbell wrote:
           | Uh.
        
       | wwilson wrote:
       | Antithesis co-founder here -- happy to answer any questions about
       | the company, the technology, or this particular engagement!
        
         | mtyurt wrote:
         | What is your target market? Is it database companies, cloud
         | providers etc. that implements a proper distributed
         | architecture? If it is not limited to that, what would be the
         | value Antithesis can provide to a company that utilizes a
         | small-scale service oriented architecture that has its own
         | share of distributed complexity?
        
           | wwilson wrote:
           | Database vendors are very natural customers, because they're
           | distributed systems for whom correctness and uptime are
           | paramount. But they're far from our only customers! We have
           | tons of people who are doing exactly what you describe --
           | running a client-server architecture or a collection of
           | microservices that need to be fault-tolerant.
           | 
           | One of our theses is that the vast majority of real world
           | "distributed systems developers" would never describe
           | themselves that way and don't read the same blogs that all
           | the database people do. Nonetheless, these people are writing
           | distributed systems, and share the pain that we've all
           | experienced. One of the most important tasks ahead of us is
           | precisely to reach this vast "distributed systems dark
           | matter" and explain to them how it is that we can help their
           | day-to-day jobs.
        
         | abtinf wrote:
         | I lead a platform engineering team. We have a lot of challenges
         | that I think antithesis could help with.
         | 
         | I'd like to try it. But...
         | 
         | I did not know the history of foundation db until last week--
         | specifically about the Apple acquisition and the resulting
         | termination of client support and downloads for 5 years.
         | 
         | So my question is, how can I trust the same founding team to
         | not do that again? It's one thing to get acquired by a company
         | that would likely continue to support business users
         | (Microsoft, IBM, AWS, even Oracle). It's another to sell to
         | acquirers that will likely shut down the public service
         | (Facebook, Apple, Google).
         | 
         | If I invest the time, money, and effort to adopt antithesis, I
         | won't even have the security of "well, at least we downloaded
         | the packages before they went offline."
         | 
         | Maybe this is an unfair question. I think it's great that you
         | had a successful liquidity event with foundation. Yet I must
         | manage risk.
        
           | wwilson wrote:
           | I will let Dave and Nick chime in here if they want to, since
           | they saw the Apple acquisition closer up than I did, but here
           | are my views:
           | 
           | (1) Your beliefs about what happened to FoundationDB's
           | customers are incorrect. Everybody who had a FoundationDB
           | license when we were acquired either got a free license to
           | the software in perpetuity at their current levels (the free
           | tier), or in the case of our paid customers, they were able
           | to continue using it on an even larger scale in the future.
           | None of our paying customers were screwed.
           | 
           | (2) We are not aiming for an acquisition or any other kind of
           | early exit. Our previous successes enable us to be a little
           | bit more risk-neutral this time.
           | 
           | (3) Even if we did vanish, what have you actually lost at
           | this point? If your operational database gets yanked (which,
           | I reemphasize, didn't actually happen to anybody), then
           | you're screwed. If your exotic software testing technology
           | gets yanked, then you're literally exactly where you are
           | today. Doesn't seem like as big a risk to me.
           | 
           | Apologies if my answers here were not diplomatic or whatever,
           | but I try to be very direct about this stuff.
           | 
           | EDIT:
           | 
           | Actually let me add a number (4) We have validated that
           | people are willing to pay for Antithesis, and we have a
           | growing business being built. To the extent we're an
           | attractive acquisition target, it's likely to be due to the
           | strength of our business and their desire to scale it out /
           | make it more widely available, and not simply to improve the
           | internal tech stack of an acquirer internally.
           | 
           | BTW, one example of an FDB customer that was still a small
           | startup when we were acquired was Snowflake. They obviously
           | had no problems continuing to grow and use FDB, as they still
           | use FoundationDB as their core metadata storage today and
           | have since they started working with us.
        
             | abtinf wrote:
             | Thank you for explaining point 1. My brief googling of the
             | issue led me to find mostly loud, angry voices, none of
             | which mentioned these facts. This completely changes my
             | evaluation.
             | 
             | Regarding point 3, I agree that the software would be in a
             | higher quality state even if the testing framework
             | disappears. However, fully adopting this framework likely
             | means integrating and explaining it in our processes and
             | compliance documentation (my company provides services to
             | financial institutions).
             | 
             | I appreciate your directness. Thanks.
        
             | SloopJon wrote:
             | > If your exotic software testing technology gets yanked,
             | then you're literally exactly where you are today.
             | 
             | As someone testing concurrent/distributed software,
             | Antithesis is potentially useful to me. It would be a
             | substantial investment to build test infrastructure around
             | it, with a very big opportunity cost if it wasn't
             | successful, or disappeared after an acquisition. If this
             | exotic technology is more than a toy, I wouldn't be so
             | cavalier about its long term prospects.
        
               | geodel wrote:
               | Well, thats the price of implementing very new
               | technology. When risk appetite is less people can always
               | go with established large vendors.
        
               | wwilson wrote:
               | That's true. And it's also true that investing heavily
               | would take time, and the more time goes on the less
               | likely we are to be acquired or disappear (or for the
               | acquisition to be by somebody who wants to resell this as
               | I note above). In that sense, you have a lot of
               | optionality/convexity on this bet.
               | 
               | In the meantime, you can use it _right now_ to solve your
               | problems without a ton of integration. We have customers
               | at every level of the spectrum from  "sending us
               | unmodified output of their CI system" to "deeply
               | integrating with our SDKs". If the former are seeing
               | value, you can too, and your risk is genuinely minimal.
        
         | costco wrote:
         | Your software sounds amazing and will probably make you a
         | fortune. I wanted to check my understanding of how Antithesis
         | fuzzes software. The way I understand it from reading the
         | documentation is that you create a "workload" which is sort
         | analogous to a fuzzing harness that will typically make random
         | API calls, and Antithesis will pursue sequences of events that
         | are more interesting as defined by coverage and also inject
         | faults. So something like this could probably be pretty easily
         | be adapted to a workload: https://github.com/grpc/grpc/blob/869
         | 53f66948aaf49ecda56a0b9.... Do you use any interesting coverage
         | metrics or just basic blocks?
        
           | wwilson wrote:
           | Your understanding of our approach is pretty much correct. As
           | for interesting coverage metrics... stay tuned! We're going
           | to write about this a lot in the future!
        
         | quadrature wrote:
         | Are there plans to support the debugging of replays ?. It seems
         | like a really hard problem, I'm assuming that instrumenting
         | could change the outcome of the run.
         | 
         | Is there literature that is a good starting point on
         | determinism/non-determinism in computing ? I'd like to
         | understand the sources of non-determinism better.
        
           | wwilson wrote:
           | Yes, we are working on integrated debugging technology (what
           | our customers mostly do these days is just run a conventional
           | debugger inside the simulation, but that doesn't quite use
           | our full power).
           | 
           | You're correct that tiny things like modifying the binary
           | under test or attaching a debugger can change determinism and
           | result in the bug slipping away. That's why... it's a good
           | thing we've already built autonomous bug-searching
           | technology? Since a deterministic hypervisor is also a
           | hypervisor, we can generally rewind and do the intrusive
           | debugging action at the "last possible moment", when it's
           | least likely to cause the bug to disappear. Then if it still
           | does, we simply fuzz onwards from that point and re-find the
           | bug. This usually goes pretty quickly because we're in a
           | timeline that's "close" to it, and we can use clues from the
           | original repro to guide the fuzzing.
        
         | eslaught wrote:
         | I work on a distributed runtime system for heterogeneous
         | supercomputers [1].
         | 
         | As an example of the sort of bug we regularly deal with, I am
         | at this exact moment tracking down a freeze that occurs on
         | 8,192 nodes of a supercomputer [2]. That means I'm using about
         | 64,000 GPUs and about half a million CPU cores. The smallest
         | node count I've seen my issue is 2,048 nodes and at that scale
         | it only happens about 10% of the time.
         | 
         | We've been debating internally whether Antithesis could help us
         | or not. On the one hand, the fuzzing to explore the state
         | space, and deterministic reproduction, are exactly what we
         | want. On the other hand, we believe our state space is much
         | larger than what you see in a typical distributed database.
         | (And not just because of the sheer scale of things, but even on
         | a single node we have state machines with order hundreds to
         | thousands of states in them.) Based on the post here and the
         | "scenario" count explored in CouchDB, I'm not convinced you'd
         | be able to handle us. :-)
         | 
         | I'd be curious what you think. Happy to discuss here, or
         | contact info in profile.
         | 
         | [1]: https://legion.stanford.edu/
         | 
         | [2]: https://www.olcf.ornl.gov/frontier/
        
           | wwilson wrote:
           | A drawback of our approach is absolutely that it is expensive
           | to test extremely large volumes of data or compute this way.
           | Even before you start running into physical limitations of
           | our current platform, you will probably be complaining about
           | your bills. :-)
           | 
           | Our advice on this is that there are actually a lot of things
           | you can do to exercise behaviors that are usually only seen
           | at massive scale. For example, if you run a distributed
           | storage system, you can probably configure it to split and
           | move shards at 1/1,000,000th of the production size. That
           | might let us hit a tricky codepath much more cheaply. We have
           | a lot more about this in our documentation, e.g. here: https:
           | //antithesis.com/docs/best_practices/optimizing.html#k... and
           | here: https://antithesis.com/docs/best_practices/find_more_bu
           | gs.ht...
           | 
           | The other thing is just that the reason many bugs only happen
           | at scale is that they're some kind of subtle distributed
           | race, and you need a lot of nodes for one of the runners in
           | the race to be slow enough that the other sometimes wins. But
           | we can very easily and efficiently provoke these sorts of
           | races by pausing individual threads or freezing nodes, etc.
           | We actually pretty regularly hit issues with tiny deployments
           | that our customers only see in their largest clusters (but no
           | promises, this obviously depends on the details of the
           | software).
        
         | farresito wrote:
         | What does the tech stack look like?
        
       | zitterbewegung wrote:
       | As time goes on the bugs get weirder in a software project ...
        
       | grumpycamel wrote:
       | Appendix is not visible
       | https://docs.google.com/document/d/1hA7bpfAMyyAYix0lelUZLI7W...
        
         | srosenberg wrote:
         | Sorry about that; should be fixed now.
        
       | 38 wrote:
       | Personal and professional Go developer here. I will never use or
       | recommend that my company use something called cockroach. Call me
       | petty, but fix the fucking name.
        
       ___________________________________________________________________
       (page generated 2024-03-22 23:01 UTC)