[HN Gopher] Jepsen: Jetcd 0.8.2
       ___________________________________________________________________
        
       Jepsen: Jetcd 0.8.2
        
       Author : aphyr
       Score  : 123 points
       Date   : 2024-08-08 14:09 UTC (8 hours ago)
        
 (HTM) web link (jepsen.io)
 (TXT) w3m dump (jepsen.io)
        
       | marksomnian wrote:
       | Interesting footnote:
       | 
       | > In the 2022 engagement, the client's engineers were
       | enthusiastic about the prospect of a public analysis, and Jepsen
       | was allowed to file public issues against systems including etcd.
       | Following the conclusion of the contract, Jepsen independently
       | completed a written report discussing the behaviors we'd found in
       | etcd. However, Jepsen was unable to secure official permission
       | from the client's legal department to disclose that the client
       | had funded part of the work. This created an unusual state of
       | affairs: the issues, test suite, and reproduction instructions
       | were all public, but per Jepsen's ethics policy, the analysis
       | itself could not be published. Jepsen shelved that analysis and
       | it remains unpublished. The present analysis is based on entirely
       | new work and verifies a different software system: jetcd, rather
       | than etcd
        
         | protosam wrote:
         | Makes me wonder if the Go v3 client has the same problem. If
         | yes, that would be a major problem for all the Kubernetes
         | systems in production.
        
           | mdaniel wrote:
           | At the very real risk of "talk is cheap," my understanding is
           | that is part of why Jepsen publishes the test suites (e.g.
           | https://github.com/jepsen-io/etcd ) so it's not "take my word
           | for it" but rather "lein run test-all" and watch the
           | fireworks. So, a sufficiently motivated actor, say for
           | example one of the deep-pocketed stewards of the Kubernetes
           | project could run the tests themselves
           | 
           | Between my indescribable hatred for etcd and my long-held
           | lust for a pluggable KV backend
           | (https://github.com/kubernetes/kubernetes/issues/1957 et al)
           | it'd be awesome if any provable KV safety violations were
           | finally the evidence required for them to take that request
           | seriously
        
             | protosam wrote:
             | Having looked at the test suite already, I know enough to
             | know that I don't understand it well enough to be that guy
             | to do this. It's for this reason, I'm personally going to
             | pull out the popcorn and see what happens over the next few
             | weeks.
        
           | jhgg wrote:
           | I'm currently working on a Rust v3 client, and have been
           | reading the Go v3 source code, and the code definitely is
           | hard to follow so I would be unsurprised if there were issues
           | lurking.
        
             | tjungblut wrote:
             | could you be more specific on what's so hard to follow?
             | it's quite literally just the implementation of the GRPC
             | interface [1].
             | 
             | [1] https://github.com/etcd-
             | io/etcd/blob/main/client/v3/kv.go#L3...
        
               | silverlyra wrote:
               | I was curious and dug into the Go client code. You linked
               | to the definition of KV - the easiest way to create one
               | is with NewKV [1], which internally creates a RetryKV [2]
               | wrapper around the Client you give it.
               | 
               | RetryKV implements the KV methods by delegating to the
               | underlying client. But before it delegates an immutable
               | request (e.g., range), it sets the request retry policy
               | to _repeatable_ [3].
               | 
               | Retries are implemented with a gRPC interceptor, which
               | checks the retry policy when deciding whether a request
               | should be retried [4].
               | 
               | The Jepsen writeup says a client can retry a request when
               | "the client can prove the first request could never
               | execute, or that the request is idempotent". In my (cold)
               | read of the code, the Go client stays within those
               | bounds.
               | 
               | For non-idempotent requests, the Go client only retries
               | when it knows the request was never sent in the first
               | place [5]. For idempotent requests, any response with
               | gRPC status _unavailable_ will be retried [6].
               | 
               | Unlike jetcd, the Go client's retry behavior is safe.
               | 
               | [1] https://github.com/etcd-
               | io/etcd/blob/main/client/v3/kv.go#L9... [2]
               | https://github.com/etcd-
               | io/etcd/blob/main/client/v3/retry.go... [3]
               | https://github.com/etcd-
               | io/etcd/blob/main/client/v3/retry_in... [4]
               | https://github.com/etcd-
               | io/etcd/blob/main/client/v3/retry_in... [5]
               | https://github.com/etcd-
               | io/etcd/blob/main/client/v3/retry.go... [6]
               | https://github.com/etcd-
               | io/etcd/blob/main/client/v3/retry.go...
        
         | mdaniel wrote:
         | > No one followed up on the jetcd issue, and it was
         | automatically closed as stale.
         | 
         | Another excellent outcome from those GH automations
        
       | protosam wrote:
       | Your posts are something I have in my bookmarks and reference
       | regularly as I continue to build my own distributed data system.
       | Thanks for continuing to test and report on these issues. These
       | posts have clarified a lot of details about the consistency
       | guarantees of these systems that I really couldn't discern from
       | their own documentation. The knowledge is invaluable with how
       | developers lean towards just trusting the system they consume to
       | be correct.
        
       | mjb wrote:
       | The first bug is a great reminder that even strict
       | serializability doesn't imply idempotency. If you're doing non-
       | idempotent operations like unconditional writes, you've got to
       | think very carefully before you add any retries to a system. Even
       | with conditional writes, you need to think carefully about ABA
       | bugs.
       | 
       | Both of these bugs are a great reminder that distributed system
       | behavior includes clients. From the application's perspective
       | bugs like this being introduced by the client isn't any
       | practically different from them being introduced by the server -
       | the same badness happens. A database needs to consider it's
       | properties end-to-end from the application API.
       | 
       | It's also a great reminder that APIs that make it hard for
       | clients to do the right thing will likely lead to bugs like this.
       | Failures happen, and a good API needs to be designed in a way
       | that allows the client to do something sensible following a
       | failure. A great API makes it easy for a client to do something
       | sensible, and hard for a client to do the wrong thing. Perhaps my
       | favorite non-distributed example of this is AES-GCM, the
       | ubiquitous AEAD crypto primitive: one tiny bug (reusing an IV)
       | completely blows up the whole scheme.
       | 
       | And, as always, this is great stuff from Kyle. His Jepsen work
       | has been moving the industry forward for years, and it's great to
       | see him continue it (and continue to put the effort into writing
       | up his findings so clearly).
        
         | aphyr wrote:
         | > strict serializability doesn't imply idempotency
         | 
         | I think we're probably getting at the same thing, but I do want
         | to clarify a bit. A Strict Serializable history, like a
         | Serializable one, requires equivalence to a total order of
         | transactions. That's clearly not true for etcd+jetcd: no
         | possible order of transactions can allow (e.g.) a transaction
         | to read from its own future. It's totally fine to submit non-
         | idempotent transactions against a Serializable system: systems
         | which actually provide Serializable will execute known-
         | committed transactions exactly once.
         | 
         | Plenty of other databases pass this test; etcd+jetcd does not.
         | This system is simply not Serializable.
        
           | mjb wrote:
           | Maybe what I should have said is "you can't just retry
           | transactions against a strict serializable database and
           | expect to still get strict serializability (from the
           | applications's perspective)". This is true of distributed
           | system APIs more generally, too.
        
             | aphyr wrote:
             | Yeah, that's a good way of phrasing it! :-)
        
       | mikemitchelldev wrote:
       | What are the minimum resources you'd need to run similar types of
       | tests at the scale that Jepsen does?
        
         | aphyr wrote:
         | For this test, you can do it on pretty much any reasonable
         | Linux machine. Longer histories can churn through more CPU and
         | RAM--some of the more aggressive tests I ran for this work
         | involved 20 GB heaps and 50 cores--but you can tune all that
         | lower.
        
       ___________________________________________________________________
       (page generated 2024-08-08 23:01 UTC)