[HN Gopher] Leader Election with S3 Conditional Writes
       ___________________________________________________________________
        
       Leader Election with S3 Conditional Writes
        
       Author : gunnarmorling
       Score  : 103 points
       Date   : 2024-08-26 13:54 UTC (4 days ago)
        
 (HTM) web link (www.morling.dev)
 (TXT) w3m dump (www.morling.dev)
        
       | mprime1 wrote:
       | Leader election and distributed locking reduce to the same
       | problem... which is proven to be impossible. It means in some
       | edge case it will fail on you, is your system handling those
       | cases?
       | 
       | I didn't read past this:
       | 
       | > Systems like Apache ZooKeeper or Postgres (via Advisory Locks)
       | provide the required building blocks for this
       | 
       | Zookeeper is the original sin. Convincing a whole generation of
       | programmers that distributed lock are a feasible solution.
       | 
       | This is my biggest pet peeve in distributed systems.
       | 
       | ----
       | 
       | And if you don't believe me, maybe you'll trust Kyle K. of Jepsen
       | fame:
       | 
       | > However, perfect failure detectors are impossible in
       | asynchronous networks.
       | 
       | Links to:
       | https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-...
       | 
       | https://jepsen.io/analyses/datomic-pro-1.0.7075
        
         | realaleris149 wrote:
         | Reliable network communication is also proven to be impossible
         | [1], yet it happens all the time. Yes, sometime it fails but it
         | still "works".
         | 
         | [1] https://en.wikipedia.org/wiki/Two_Generals%27_Problem
        
           | EGreg wrote:
           | I literally had this argument with David Schwartz of Ripple
           | about our lack of a consensus protocol.
        
           | akira2501 wrote:
           | > sometime it fails
           | 
           | That failures are a possibility does not concern me. That the
           | failures are not fully characterized does.
        
         | withinboredom wrote:
         | For me, I almost stopped reading at the assertion that clock
         | drift doesn't matter. They clearly didn't think through the
         | constant fight that would occur over who the leader actually
         | was and just hand-wave it away as 'not an issue.' They need to
         | remove time from their equation completely if they want clock
         | drift to not matter.
        
           | pluto_modadic wrote:
           | part of me thinks that clock drift would be reliably biased
           | toward a particular node, not leapfrogging between two nodes.
        
             | sillysaurusx wrote:
             | It's deeper than that. See the paper on time in distributed
             | systems by Lamport:
             | https://lamport.azurewebsites.net/pubs/time-clocks.pdf
             | 
             | There's a relativity-like issue where it's impossible to
             | have a globally consistent view of time.
             | 
             | See IR2 for how to synchronize physical time in a
             | distributed system.
             | 
             | "(a) If Pg sends a message m at physical time t, then m
             | contains a timestamp Tm= C/(t). (b) Upon receiving a
             | message m at time t', process P/ sets C/(t') equal to
             | maximum (Cj(t' - 0), Tm + /Zm)."
             | 
             | I _think_ the formula is saying that the clocks will only
             | ever increase (i.e. drift upwards). If so, then you could
             | imagine two processes leapfrogging if one of them sends a
             | message that bumps the other's clock, then that one sends a
             | message back that bumps the first.
             | 
             | But I'm curious how it behaves if one of the clocks is
             | running faster, e.g. a satellite has a physically different
             | sense of time than an observer on the ground.
             | 
             | Also note the paper claims you can't get rid of clocks if
             | you want to totally order the events.
        
               | rhaen wrote:
               | It's a truly fantastic paper, but like many things the
               | idea that it's impossible to have a _perfectly_
               | consistent global view of time doesn 't mean it's not
               | possible to have a near-perfect one. The blog mentioned
               | AWS's Timesync which lets you be synchronized into the
               | microseconds (https://aws.amazon.com/about-aws/whats-
               | new/2023/11/amazon-ti...), and Google's TrueTime is used
               | to give ranges of times that you're guaranteed to be
               | within (https://cloud.google.com/spanner/docs/true-time-
               | external-con...).
        
               | withinboredom wrote:
               | It's near-perfect ... until it isn't and everything goes
               | to shit in a cascading wtafc (what-the-actual-fuck-
               | catastrophe).
               | 
               | 1ms is an extremely long time in computer land where
               | decisions are made in nanoseconds. There are 1000
               | nanoseconds per millisecond.
        
               | withinboredom wrote:
               | > Also note the paper claims you can't get rid of clocks
               | if you want to totally order the events.
               | 
               | The order of events isn't important here. We only care
               | about the 'last event' which everyone can agree on; they
               | just can't agree on when it happened.
               | 
               | In other words, they can all agree that there is a
               | leader; we just don't know if that leader is still alive
               | or not. The best thing to do is simply make the leader
               | election deterministically random:
               | 
               | 1. 'seed' the file with a 'no leader' file.
               | 
               | 2. at randomly short intervals (max latency you want to
               | deal with, so like once every few seconds or so), try to
               | claim the file by using the hash of the current file as
               | your etag. The winner is the leader.
               | 
               | 3. Once there is a leader and you are the leader, every N
               | seconds, update the file with a new number, using the
               | hash of the current file as the etag.
               | 
               | 4. If you are not the leader, every N*(random jitter)*C
               | (where C>N*2, adjust for latency), attempt to take the
               | file using the same method above. If you fail, retry
               | again in N*(random jitter)*C.
               | 
               | 5. If the leader is taken, you are not the leader until
               | you've held leadership for at least N seconds.
               | 
               | This doesn't necessarily remove the clock. However, most
               | -- if not all -- modern clocks will agree that the length
               | of a second is within a few nanoseconds of any other
               | clock, regardless of how accurate its total count of
               | seconds is since the epoch. It also probably doesn't
               | work, but it probably isn't _that_ far away from one that
               | does.
        
         | orf wrote:
         | > ...which is proven to be impossible
         | 
         | For some definition of impossible, given that many systems
         | utilise them effectively. Not all corner cases or theoretical
         | failure modes are relevant to everyone.
        
           | lijok wrote:
           | Many systems are buggy, periodically running into corner
           | cases that are borderline impossible to debug.
        
             | ahoka wrote:
             | Sometimes it's enough to detect these cases and reset the
             | system to its stable state.
             | 
             | For example bicycle wheels are bistable systems, but they
             | usually stay in their useful state so it does not matter in
             | practice.
        
             | Spivak wrote:
             | Yes, and yet those systems still work and deliver real
             | value all day every day. If every company Rollbar I've ever
             | seen is the measure good software can have millions of
             | faults and still work for users.
        
         | everforward wrote:
         | > Convincing a whole generation of programmers that distributed
         | lock are a feasible solution.
         | 
         | I too hate this. Not just because the edge cases exist, but
         | also because of the related property: it makes the system very
         | hard to reason about.
         | 
         | Questions that should be simple become complicated. What
         | happens when the distributed locking system is down? What
         | happens when we reboot all the nodes at once? What if they
         | don't come down at exactly the same time and there's leader
         | churn for like 2 minutes? Etc, etc.
         | 
         | Those questions should be fairly simple, but become something
         | where a senior dev is having to trace codepaths and draw on a
         | whiteboard to figure it out. It's not even enough to understand
         | how a single node works in-depth, they have to figure out how
         | this node works but also how this node's state might impact
         | another node's.
         | 
         | All of this is much simpler in leaderless systems (where the
         | leader system is replaced with idempotency or a scheduler or
         | something else).
         | 
         | I very strongly prefer avoiding leader systems; it's a method
         | of last resort when literally nothing else will work. I would
         | much rather scale a SQL database to support the queries for
         | idempotency than deal with a leader system.
         | 
         | I've never seen an idempotent system switch to a leader system,
         | but I've sure seen the reverse a few times.
        
           | anothername12 wrote:
           | >> Convincing a whole generation of programmers that
           | distributed lock are a feasible solution.
           | 
           | > I too hate this. Not just because the edge cases exist, but
           | also because of the related property: it makes the system
           | very hard to reason about.
           | 
           | I think this is a huge problem with the way we're developing
           | software now. Distributed systems are extremely difficult for
           | a lot of reasons, yet it's often or first choice when
           | developing even small systems!
           | 
           | At $COMPANY we have hundreds of lambdas, DocumentDB (btw,
           | that is hell in case you're considering it) and other cloud
           | storage and queuing components. On call and bugs basically
           | are quests in finding some corner case race condition/timing
           | problem, read after write assumption etc.
           | 
           | I'm ashamed to say, we have reads wrapped in retry loops
           | everywhere.
           | 
           | The whole thing could have been a Rails app with a fraction
           | of the team size and a massive increase in reliability and
           | easier to reason about/better time delivering features.
           | 
           | You could say we're doing it wrong, and you'd probably be
           | partly right for sure, but I've done consulting for a decade
           | at dozens of other places and it always seems like this.
        
             | icedchai wrote:
             | I see the same. All this complexity to handle a few
             | requests/second... but at least we can say it's "cloud
             | native."
        
             | giovannibonetti wrote:
             | Same thing where I work now. Many experienced developers
             | waste a huge chunk of their time trying to wrap their heads
             | around their Django micro services communication patterns
             | and edge cases. Much more complex than an equivalent Rails
             | monolith, even though Ruby and Rails both have their issues
             | and could be replaced by more modern tech in 2024.
        
             | everforward wrote:
             | > You could say we're doing it wrong, and you'd probably be
             | partly right for sure, but I've done consulting for a
             | decade at dozens of other places and it always seems like
             | this.
             | 
             | The older I get, the more I think this is a result of
             | Conway's law and that a lot of this architectural cruft
             | stems from designing systems around communication
             | boundaries rather than things that make technical sense.
             | 
             | Monolithic apps like Rails only happen under a single team
             | or teams that are so tightly coupled people wonder whether
             | they should just merge.
             | 
             | Distributed apps are very loosely coupled, so it's what you
             | would expect to get from two teams that are far apart on
             | the org chart.
             | 
             | Anecdotally, it mirrors what I've seen in practice. Closely
             | related teams trust each other and are willing to make a
             | monolith under an assumption that their partner team won't
             | make it a mess. Distantly related teams play games around
             | ensuring that their portion is loosely coupled enough that
             | it can have its own due dates, reliability, etc.
             | 
             | Queues are the king of distantly coupled systems. A team's
             | part of a queue-based app can be declared "done" before the
             | rest of it is even stood up. "We're dumping stuff into the
             | queue, they just need to consume it" or the inverse "we're
             | consuming, they just need to produce". Both sides of the
             | queue are basically blind to each other. That's not to say
             | that all queues are bad, but I have seen a fair few queues
             | that existed basically just to create an ownership
             | boundary.
             | 
             | I once saw an app that did bidirectional RPC over message
             | queues because one team didn't believe the other
             | could/would do retries, on an app that handled single digit
             | QPS. It still boggles my mind that they thought it was
             | easier to invent a paradigm to match responses to requests
             | than it was to remind the other team to do retries, or
             | write them a library with retries built in, or just
             | participate in bleeping code reviews.
        
         | silasdavis wrote:
         | This is rather misleading, the FLP theorem talks about fully
         | asynchronous networks with unbounded delay. Partial synchrony
         | is a perfectly reasonable assumption and allows atomic
         | broadcast and locking to work perfectly well even if there is
         | an unknown but finite bound on network delay.
        
         | gunnarmorling wrote:
         | Had you read on, you'd have seen that I am discussing this very
         | point:
         | 
         | > leader election will only ever be eventually correct... So
         | you'll always need to be prepared to detect and fence off work
         | done by a previous leader.
        
         | setheron wrote:
         | I remember at Oracle they built systems to shut down the
         | previous presumed leader to definitively know it wasn't
         | ghosting.
        
           | tanelpoder wrote:
           | Yep, the "STONITH" technique [1]. But programmatically
           | resetting one node over a network/RPC call might not work, if
           | internode-network comms are down for that node, but it can
           | still access shared storage via other networks... The
           | Oracle's HA fencing doc mentions other methods too, like IPMI
           | LAN fencing and SCSI persistent reservations [2].
           | 
           | [1] https://en.wikipedia.org/wiki/STONITH
           | 
           | [2] https://docs.oracle.com/en/operating-systems/oracle-
           | linux/8/...
        
             | setheron wrote:
             | They had access to the ILOM and had some much more durable
             | way to STONITH. Of course every link can "technically" fail
             | but it brought it to some unreasonable amount of 9s that it
             | felt unwarranted to consider.
        
               | tanelpoder wrote:
               | Yep and ILOM access probably happens over the management
               | network and can hardware-reset the machine, so the
               | dataplane internode network issues and any OS level
               | brownouts won't get in the way.
        
         | woooooo wrote:
         | They both reduce to a paxos style atomic broadcast, which is in
         | fact possible although the legend is that Leslie Lamport was
         | trying to prove it impossible and accidentally found a way.
        
         | butterisgood wrote:
         | Indeed a slow node looks like a dead node for a while until it
         | isn't.
         | 
         | At some point distributed systems that work well vs others that
         | do not is an "art of tuning timeouts and retries".
         | 
         | Also nothing in production is perfect - so we should consider
         | failures always when writing code in distributed systems and
         | the impacts.
         | 
         | And we will still make mistakes...
        
         | p1necone wrote:
         | > which is proven to be impossible.
         | 
         | 'Technically' intractable problems are solvable just fine in a
         | way that is almost as useful as solving them completely if you
         | can achieve one of two things:
         | 
         | * Reliably identify when you've encountered an unsolvable case
         | (usefulness of this approach depends on the exact problem
         | you're solving).
         | 
         | or
         | 
         | * Reduce the probability of unsolvable cases/incorrect
         | solutions to a level low enough to not actually happen in
         | practice.
         | 
         | 'Technically' GUIDs are impossible, reliable network
         | communication (TCP) is impossible, O^2 time complexity
         | functions will grow to unusably large running times - but in
         | practice all of these things are used constantly to solve real
         | problems.
        
       | Spivak wrote:
       | Why are people going for this guys throat-- distributed locking
       | might be impossible but what is being described, distributed
       | leasing, is totally possible and useful. There's no sins being
       | committed here.
       | 
       | I might choose DynamoDB over S3 to implement this but both are
       | fine.
        
         | otterley wrote:
         | HN being HN, I guess. Nerds like to nitpick.
         | 
         | That said, if you have a choice between implementing
         | idempotency and using a distributed lock, you should always opt
         | for the former. It's far less complex and less error prone.
        
       | shayonj wrote:
       | Very cool case of using s3 conditional writes for distributed
       | leasing
        
       | mgdev wrote:
       | This is a very expensive way to do leader election, at least from
       | an infrastructure perspective.
       | 
       | This is because you're essentially pushing the problem down to
       | S3, which does its own leader election in a way that is waaaay
       | overbuilt for what we're trying to accomplish here.
       | 
       | But... that doesn't mean it isn't cool. :)
        
         | otterley wrote:
         | Since the user is only paying an infinitesimal fraction of the
         | infrastructure cost, does it really matter? From the user's
         | perspective it's extremely inexpensive.
        
           | mgdev wrote:
           | If everyone started using S3 for this purpose, they would
           | shut it down pretty quickly.
           | 
           | Small objects (especially small hot objects) are actually
           | problematic in S3. They cache them in the keymap instead of
           | retrieving from block storage. But many small objects can
           | quickly blow out a keymap shard. The keymap is designed for
           | high-performance lookups. Because of this, it's also more
           | expensive to scale out this class of 'storage' at the keymap
           | layer than to scale out block storage. And you still have to
           | do quorum writes to block storage and then cache eviction at
           | the keymap.
           | 
           | If you're doing this for slow-moving leader election in a
           | small cluster, fine. But if, for example, you started using
           | this for leader-election among end-user-defined workloads
           | that could scale up/down, you might find yourself on the
           | other side of a call from AWS.
        
             | otterley wrote:
             | > If everyone started using S3 for this purpose, they would
             | shut it down pretty quickly
             | 
             | I work at AWS but not on the S3 service team. (Opinions are
             | entirely my own.)
             | 
             | I have little doubt that the team already considered this
             | possibility before releasing the feature. The choice to
             | make new functionality available in a service is considered
             | a "one-way door" that receives a tremendous amount of
             | scrutiny. Once a feature is made available, they will do
             | everything they can to support it as long as possible, even
             | at the cost of convenience to themselves. And even de-
             | emphasized services (keep the lights on) are rarely
             | deprecated completely - existing accounts and customers
             | already using them can frequently continue to do so.
        
               | mgdev wrote:
               | They introduce economic incentives. :)
               | 
               | By way of example, a little over a decade ago a famous
               | online streaming company used to upload all of their
               | video masters for transcoding. This process involved a
               | huge upload and a huge workload, followed by a
               | significant drop in activity.
               | 
               | The problem was that AWS had to provision for peak usage
               | instead of average usage. This resulted in a situation
               | where the peak-to-average ratio was very high for just
               | one or two customers.
               | 
               | To address this issue, the solution was to incentivize
               | these customers to spread out their workload more evenly
               | over time, at least until they were no longer the largest
               | driver of peak/avg.
               | 
               | This is also why things like Reserved Instances and Spot
               | Instances exist.
        
               | otterley wrote:
               | I'm sure we agree that other means such as changing
               | pricing and reaching out to exceptional customers are not
               | the same thing as killing a feature.
        
               | mgdev wrote:
               | I never said they would kill the feature. I said they
               | would shut down the behavior. They can do that through
               | economic incentive.
               | 
               | (Source: I was on the S3 team. Opinions my own, etc.)
        
               | otterley wrote:
               | You said they would "shut it down," which can reasonably
               | be interpreted as killing the feature. If you meant
               | something more specific, you should have said that.
               | Clarity is the duty of the speaker.
        
               | Terretta wrote:
               | >> _"If everyone started using S3 for this purpose, they
               | would shut it down pretty quickly."_
               | 
               | > _"I never said they would kill the feature. I said they
               | would shut down the behavior."_
               | 
               | Seems "it" is standing in for a lot of nuance here.
               | 
               | With these three sentences side by side, still took a
               | while to see how you might be referring to the "everyone
               | using" behavior instead of the "S3 purpose" feature!
               | Usually given ambiguity the nearest plausible reference
               | wins.
               | 
               | Since in tech "shut it down" is more often systems than
               | humans _and_ that was the nearest reference  "it" could
               | refer to, took some doing to see how your assertion "I
               | said the behavior" could be accurate!
        
         | paulgb wrote:
         | One advantage of this approach is that if you're already using
         | S3 (as in the SlateDB case mentioned in the article), it's
         | essentially "free". And it means that a package like SlateDB
         | just needs to be pointed at an S3 bucket, instead of making you
         | point it to an S3 bucket _and_ a DynamoDB instance.
        
       | dekhn wrote:
       | Is chubby a filesystem?
        
         | refulgentis wrote:
         | Google's locking service
        
           | dekhn wrote:
           | it's an ongoing joke inside google: is chubby a filesystem?
           | It's a locking service which exposes an API that resembles
           | files and paths, and even has a File implementation, but the
           | service owners really never wanted people to treat it like a
           | filesystem.
        
       ___________________________________________________________________
       (page generated 2024-08-30 23:00 UTC)