[HN Gopher] Jepsen: Jetcd 0.8.2
___________________________________________________________________
Jepsen: Jetcd 0.8.2
Author : aphyr
Score : 123 points
Date : 2024-08-08 14:09 UTC (8 hours ago)
(HTM) web link (jepsen.io)
(TXT) w3m dump (jepsen.io)
| marksomnian wrote:
| Interesting footnote:
|
| > In the 2022 engagement, the client's engineers were
| enthusiastic about the prospect of a public analysis, and Jepsen
| was allowed to file public issues against systems including etcd.
| Following the conclusion of the contract, Jepsen independently
| completed a written report discussing the behaviors we'd found in
| etcd. However, Jepsen was unable to secure official permission
| from the client's legal department to disclose that the client
| had funded part of the work. This created an unusual state of
| affairs: the issues, test suite, and reproduction instructions
| were all public, but per Jepsen's ethics policy, the analysis
| itself could not be published. Jepsen shelved that analysis and
| it remains unpublished. The present analysis is based on entirely
| new work and verifies a different software system: jetcd, rather
| than etcd
| protosam wrote:
| Makes me wonder if the Go v3 client has the same problem. If
| yes, that would be a major problem for all the Kubernetes
| systems in production.
| mdaniel wrote:
| At the very real risk of "talk is cheap," my understanding is
| that is part of why Jepsen publishes the test suites (e.g.
| https://github.com/jepsen-io/etcd ) so it's not "take my word
| for it" but rather "lein run test-all" and watch the
| fireworks. So, a sufficiently motivated actor, say for
| example one of the deep-pocketed stewards of the Kubernetes
| project could run the tests themselves
|
| Between my indescribable hatred for etcd and my long-held
| lust for a pluggable KV backend
| (https://github.com/kubernetes/kubernetes/issues/1957 et al)
| it'd be awesome if any provable KV safety violations were
| finally the evidence required for them to take that request
| seriously
| protosam wrote:
| Having looked at the test suite already, I know enough to
| know that I don't understand it well enough to be that guy
| to do this. It's for this reason, I'm personally going to
| pull out the popcorn and see what happens over the next few
| weeks.
| jhgg wrote:
| I'm currently working on a Rust v3 client, and have been
| reading the Go v3 source code, and the code definitely is
| hard to follow so I would be unsurprised if there were issues
| lurking.
| tjungblut wrote:
| could you be more specific on what's so hard to follow?
| it's quite literally just the implementation of the GRPC
| interface [1].
|
| [1] https://github.com/etcd-
| io/etcd/blob/main/client/v3/kv.go#L3...
| silverlyra wrote:
| I was curious and dug into the Go client code. You linked
| to the definition of KV - the easiest way to create one
| is with NewKV [1], which internally creates a RetryKV [2]
| wrapper around the Client you give it.
|
| RetryKV implements the KV methods by delegating to the
| underlying client. But before it delegates an immutable
| request (e.g., range), it sets the request retry policy
| to _repeatable_ [3].
|
| Retries are implemented with a gRPC interceptor, which
| checks the retry policy when deciding whether a request
| should be retried [4].
|
| The Jepsen writeup says a client can retry a request when
| "the client can prove the first request could never
| execute, or that the request is idempotent". In my (cold)
| read of the code, the Go client stays within those
| bounds.
|
| For non-idempotent requests, the Go client only retries
| when it knows the request was never sent in the first
| place [5]. For idempotent requests, any response with
| gRPC status _unavailable_ will be retried [6].
|
| Unlike jetcd, the Go client's retry behavior is safe.
|
| [1] https://github.com/etcd-
| io/etcd/blob/main/client/v3/kv.go#L9... [2]
| https://github.com/etcd-
| io/etcd/blob/main/client/v3/retry.go... [3]
| https://github.com/etcd-
| io/etcd/blob/main/client/v3/retry_in... [4]
| https://github.com/etcd-
| io/etcd/blob/main/client/v3/retry_in... [5]
| https://github.com/etcd-
| io/etcd/blob/main/client/v3/retry.go... [6]
| https://github.com/etcd-
| io/etcd/blob/main/client/v3/retry.go...
| mdaniel wrote:
| > No one followed up on the jetcd issue, and it was
| automatically closed as stale.
|
| Another excellent outcome from those GH automations
| protosam wrote:
| Your posts are something I have in my bookmarks and reference
| regularly as I continue to build my own distributed data system.
| Thanks for continuing to test and report on these issues. These
| posts have clarified a lot of details about the consistency
| guarantees of these systems that I really couldn't discern from
| their own documentation. The knowledge is invaluable with how
| developers lean towards just trusting the system they consume to
| be correct.
| mjb wrote:
| The first bug is a great reminder that even strict
| serializability doesn't imply idempotency. If you're doing non-
| idempotent operations like unconditional writes, you've got to
| think very carefully before you add any retries to a system. Even
| with conditional writes, you need to think carefully about ABA
| bugs.
|
| Both of these bugs are a great reminder that distributed system
| behavior includes clients. From the application's perspective
| bugs like this being introduced by the client isn't any
| practically different from them being introduced by the server -
| the same badness happens. A database needs to consider it's
| properties end-to-end from the application API.
|
| It's also a great reminder that APIs that make it hard for
| clients to do the right thing will likely lead to bugs like this.
| Failures happen, and a good API needs to be designed in a way
| that allows the client to do something sensible following a
| failure. A great API makes it easy for a client to do something
| sensible, and hard for a client to do the wrong thing. Perhaps my
| favorite non-distributed example of this is AES-GCM, the
| ubiquitous AEAD crypto primitive: one tiny bug (reusing an IV)
| completely blows up the whole scheme.
|
| And, as always, this is great stuff from Kyle. His Jepsen work
| has been moving the industry forward for years, and it's great to
| see him continue it (and continue to put the effort into writing
| up his findings so clearly).
| aphyr wrote:
| > strict serializability doesn't imply idempotency
|
| I think we're probably getting at the same thing, but I do want
| to clarify a bit. A Strict Serializable history, like a
| Serializable one, requires equivalence to a total order of
| transactions. That's clearly not true for etcd+jetcd: no
| possible order of transactions can allow (e.g.) a transaction
| to read from its own future. It's totally fine to submit non-
| idempotent transactions against a Serializable system: systems
| which actually provide Serializable will execute known-
| committed transactions exactly once.
|
| Plenty of other databases pass this test; etcd+jetcd does not.
| This system is simply not Serializable.
| mjb wrote:
| Maybe what I should have said is "you can't just retry
| transactions against a strict serializable database and
| expect to still get strict serializability (from the
| applications's perspective)". This is true of distributed
| system APIs more generally, too.
| aphyr wrote:
| Yeah, that's a good way of phrasing it! :-)
| mikemitchelldev wrote:
| What are the minimum resources you'd need to run similar types of
| tests at the scale that Jepsen does?
| aphyr wrote:
| For this test, you can do it on pretty much any reasonable
| Linux machine. Longer histories can churn through more CPU and
| RAM--some of the more aggressive tests I ran for this work
| involved 20 GB heaps and 50 cores--but you can tune all that
| lower.
___________________________________________________________________
(page generated 2024-08-08 23:01 UTC)