[HN Gopher] Antithesis of a One-in-a-Million Bug: Taming Demonic...
___________________________________________________________________
Antithesis of a One-in-a-Million Bug: Taming Demonic Nondeterminism
Author : eatonphil
Score : 112 points
Date : 2024-03-21 18:48 UTC (1 days ago)
(HTM) web link (www.cockroachlabs.com)
(TXT) w3m dump (www.cockroachlabs.com)
| Taikonerd wrote:
| Cockroach DB is a very logical place to use Antithesis -- its
| original use case was for FoundationDB!
| WarOnPrivacy wrote:
| Ah it _is_ coding then. I was trying to work out what Demonic
| Nondeterminism might be and got increasingly excited about the
| possibilities. At the least - it would have included all the
| demon psych /religious backstory that we know about. I'd
| finally get some answers.
| wwilson wrote:
| To be clear -- many of us worked at FoundationDB, and we were
| certainly inspired by the testing technology we had there. But
| Antithesis is built from scratch, and WAY more general and WAY
| more powerful than the FDB simulator.
| notresidenter wrote:
| Antithesis are doing a great job at marketing their product and
| building hype. Their platform looks really interesting.
| wwilson wrote:
| Thanks! But just you wait... the _really_ insane stuff isn't
| even public yet!
| thimkerbell wrote:
| Uh.
| wwilson wrote:
| Antithesis co-founder here -- happy to answer any questions about
| the company, the technology, or this particular engagement!
| mtyurt wrote:
| What is your target market? Is it database companies, cloud
| providers etc. that implements a proper distributed
| architecture? If it is not limited to that, what would be the
| value Antithesis can provide to a company that utilizes a
| small-scale service oriented architecture that has its own
| share of distributed complexity?
| wwilson wrote:
| Database vendors are very natural customers, because they're
| distributed systems for whom correctness and uptime are
| paramount. But they're far from our only customers! We have
| tons of people who are doing exactly what you describe --
| running a client-server architecture or a collection of
| microservices that need to be fault-tolerant.
|
| One of our theses is that the vast majority of real world
| "distributed systems developers" would never describe
| themselves that way and don't read the same blogs that all
| the database people do. Nonetheless, these people are writing
| distributed systems, and share the pain that we've all
| experienced. One of the most important tasks ahead of us is
| precisely to reach this vast "distributed systems dark
| matter" and explain to them how it is that we can help their
| day-to-day jobs.
| abtinf wrote:
| I lead a platform engineering team. We have a lot of challenges
| that I think antithesis could help with.
|
| I'd like to try it. But...
|
| I did not know the history of foundation db until last week--
| specifically about the Apple acquisition and the resulting
| termination of client support and downloads for 5 years.
|
| So my question is, how can I trust the same founding team to
| not do that again? It's one thing to get acquired by a company
| that would likely continue to support business users
| (Microsoft, IBM, AWS, even Oracle). It's another to sell to
| acquirers that will likely shut down the public service
| (Facebook, Apple, Google).
|
| If I invest the time, money, and effort to adopt antithesis, I
| won't even have the security of "well, at least we downloaded
| the packages before they went offline."
|
| Maybe this is an unfair question. I think it's great that you
| had a successful liquidity event with foundation. Yet I must
| manage risk.
| wwilson wrote:
| I will let Dave and Nick chime in here if they want to, since
| they saw the Apple acquisition closer up than I did, but here
| are my views:
|
| (1) Your beliefs about what happened to FoundationDB's
| customers are incorrect. Everybody who had a FoundationDB
| license when we were acquired either got a free license to
| the software in perpetuity at their current levels (the free
| tier), or in the case of our paid customers, they were able
| to continue using it on an even larger scale in the future.
| None of our paying customers were screwed.
|
| (2) We are not aiming for an acquisition or any other kind of
| early exit. Our previous successes enable us to be a little
| bit more risk-neutral this time.
|
| (3) Even if we did vanish, what have you actually lost at
| this point? If your operational database gets yanked (which,
| I reemphasize, didn't actually happen to anybody), then
| you're screwed. If your exotic software testing technology
| gets yanked, then you're literally exactly where you are
| today. Doesn't seem like as big a risk to me.
|
| Apologies if my answers here were not diplomatic or whatever,
| but I try to be very direct about this stuff.
|
| EDIT:
|
| Actually let me add a number (4) We have validated that
| people are willing to pay for Antithesis, and we have a
| growing business being built. To the extent we're an
| attractive acquisition target, it's likely to be due to the
| strength of our business and their desire to scale it out /
| make it more widely available, and not simply to improve the
| internal tech stack of an acquirer internally.
|
| BTW, one example of an FDB customer that was still a small
| startup when we were acquired was Snowflake. They obviously
| had no problems continuing to grow and use FDB, as they still
| use FoundationDB as their core metadata storage today and
| have since they started working with us.
| abtinf wrote:
| Thank you for explaining point 1. My brief googling of the
| issue led me to find mostly loud, angry voices, none of
| which mentioned these facts. This completely changes my
| evaluation.
|
| Regarding point 3, I agree that the software would be in a
| higher quality state even if the testing framework
| disappears. However, fully adopting this framework likely
| means integrating and explaining it in our processes and
| compliance documentation (my company provides services to
| financial institutions).
|
| I appreciate your directness. Thanks.
| SloopJon wrote:
| > If your exotic software testing technology gets yanked,
| then you're literally exactly where you are today.
|
| As someone testing concurrent/distributed software,
| Antithesis is potentially useful to me. It would be a
| substantial investment to build test infrastructure around
| it, with a very big opportunity cost if it wasn't
| successful, or disappeared after an acquisition. If this
| exotic technology is more than a toy, I wouldn't be so
| cavalier about its long term prospects.
| geodel wrote:
| Well, thats the price of implementing very new
| technology. When risk appetite is less people can always
| go with established large vendors.
| wwilson wrote:
| That's true. And it's also true that investing heavily
| would take time, and the more time goes on the less
| likely we are to be acquired or disappear (or for the
| acquisition to be by somebody who wants to resell this as
| I note above). In that sense, you have a lot of
| optionality/convexity on this bet.
|
| In the meantime, you can use it _right now_ to solve your
| problems without a ton of integration. We have customers
| at every level of the spectrum from "sending us
| unmodified output of their CI system" to "deeply
| integrating with our SDKs". If the former are seeing
| value, you can too, and your risk is genuinely minimal.
| costco wrote:
| Your software sounds amazing and will probably make you a
| fortune. I wanted to check my understanding of how Antithesis
| fuzzes software. The way I understand it from reading the
| documentation is that you create a "workload" which is sort
| analogous to a fuzzing harness that will typically make random
| API calls, and Antithesis will pursue sequences of events that
| are more interesting as defined by coverage and also inject
| faults. So something like this could probably be pretty easily
| be adapted to a workload: https://github.com/grpc/grpc/blob/869
| 53f66948aaf49ecda56a0b9.... Do you use any interesting coverage
| metrics or just basic blocks?
| wwilson wrote:
| Your understanding of our approach is pretty much correct. As
| for interesting coverage metrics... stay tuned! We're going
| to write about this a lot in the future!
| quadrature wrote:
| Are there plans to support the debugging of replays ?. It seems
| like a really hard problem, I'm assuming that instrumenting
| could change the outcome of the run.
|
| Is there literature that is a good starting point on
| determinism/non-determinism in computing ? I'd like to
| understand the sources of non-determinism better.
| wwilson wrote:
| Yes, we are working on integrated debugging technology (what
| our customers mostly do these days is just run a conventional
| debugger inside the simulation, but that doesn't quite use
| our full power).
|
| You're correct that tiny things like modifying the binary
| under test or attaching a debugger can change determinism and
| result in the bug slipping away. That's why... it's a good
| thing we've already built autonomous bug-searching
| technology? Since a deterministic hypervisor is also a
| hypervisor, we can generally rewind and do the intrusive
| debugging action at the "last possible moment", when it's
| least likely to cause the bug to disappear. Then if it still
| does, we simply fuzz onwards from that point and re-find the
| bug. This usually goes pretty quickly because we're in a
| timeline that's "close" to it, and we can use clues from the
| original repro to guide the fuzzing.
| eslaught wrote:
| I work on a distributed runtime system for heterogeneous
| supercomputers [1].
|
| As an example of the sort of bug we regularly deal with, I am
| at this exact moment tracking down a freeze that occurs on
| 8,192 nodes of a supercomputer [2]. That means I'm using about
| 64,000 GPUs and about half a million CPU cores. The smallest
| node count I've seen my issue is 2,048 nodes and at that scale
| it only happens about 10% of the time.
|
| We've been debating internally whether Antithesis could help us
| or not. On the one hand, the fuzzing to explore the state
| space, and deterministic reproduction, are exactly what we
| want. On the other hand, we believe our state space is much
| larger than what you see in a typical distributed database.
| (And not just because of the sheer scale of things, but even on
| a single node we have state machines with order hundreds to
| thousands of states in them.) Based on the post here and the
| "scenario" count explored in CouchDB, I'm not convinced you'd
| be able to handle us. :-)
|
| I'd be curious what you think. Happy to discuss here, or
| contact info in profile.
|
| [1]: https://legion.stanford.edu/
|
| [2]: https://www.olcf.ornl.gov/frontier/
| wwilson wrote:
| A drawback of our approach is absolutely that it is expensive
| to test extremely large volumes of data or compute this way.
| Even before you start running into physical limitations of
| our current platform, you will probably be complaining about
| your bills. :-)
|
| Our advice on this is that there are actually a lot of things
| you can do to exercise behaviors that are usually only seen
| at massive scale. For example, if you run a distributed
| storage system, you can probably configure it to split and
| move shards at 1/1,000,000th of the production size. That
| might let us hit a tricky codepath much more cheaply. We have
| a lot more about this in our documentation, e.g. here: https:
| //antithesis.com/docs/best_practices/optimizing.html#k... and
| here: https://antithesis.com/docs/best_practices/find_more_bu
| gs.ht...
|
| The other thing is just that the reason many bugs only happen
| at scale is that they're some kind of subtle distributed
| race, and you need a lot of nodes for one of the runners in
| the race to be slow enough that the other sometimes wins. But
| we can very easily and efficiently provoke these sorts of
| races by pausing individual threads or freezing nodes, etc.
| We actually pretty regularly hit issues with tiny deployments
| that our customers only see in their largest clusters (but no
| promises, this obviously depends on the details of the
| software).
| farresito wrote:
| What does the tech stack look like?
| zitterbewegung wrote:
| As time goes on the bugs get weirder in a software project ...
| grumpycamel wrote:
| Appendix is not visible
| https://docs.google.com/document/d/1hA7bpfAMyyAYix0lelUZLI7W...
| srosenberg wrote:
| Sorry about that; should be fixed now.
| 38 wrote:
| Personal and professional Go developer here. I will never use or
| recommend that my company use something called cockroach. Call me
| petty, but fix the fucking name.
___________________________________________________________________
(page generated 2024-03-22 23:01 UTC)