[HN Gopher] Leader Election with S3 Conditional Writes
___________________________________________________________________
Leader Election with S3 Conditional Writes
Author : gunnarmorling
Score : 103 points
Date : 2024-08-26 13:54 UTC (4 days ago)
(HTM) web link (www.morling.dev)
(TXT) w3m dump (www.morling.dev)
| mprime1 wrote:
| Leader election and distributed locking reduce to the same
| problem... which is proven to be impossible. It means in some
| edge case it will fail on you, is your system handling those
| cases?
|
| I didn't read past this:
|
| > Systems like Apache ZooKeeper or Postgres (via Advisory Locks)
| provide the required building blocks for this
|
| Zookeeper is the original sin. Convincing a whole generation of
| programmers that distributed lock are a feasible solution.
|
| This is my biggest pet peeve in distributed systems.
|
| ----
|
| And if you don't believe me, maybe you'll trust Kyle K. of Jepsen
| fame:
|
| > However, perfect failure detectors are impossible in
| asynchronous networks.
|
| Links to:
| https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-...
|
| https://jepsen.io/analyses/datomic-pro-1.0.7075
| realaleris149 wrote:
| Reliable network communication is also proven to be impossible
| [1], yet it happens all the time. Yes, sometime it fails but it
| still "works".
|
| [1] https://en.wikipedia.org/wiki/Two_Generals%27_Problem
| EGreg wrote:
| I literally had this argument with David Schwartz of Ripple
| about our lack of a consensus protocol.
| akira2501 wrote:
| > sometime it fails
|
| That failures are a possibility does not concern me. That the
| failures are not fully characterized does.
| withinboredom wrote:
| For me, I almost stopped reading at the assertion that clock
| drift doesn't matter. They clearly didn't think through the
| constant fight that would occur over who the leader actually
| was and just hand-wave it away as 'not an issue.' They need to
| remove time from their equation completely if they want clock
| drift to not matter.
| pluto_modadic wrote:
| part of me thinks that clock drift would be reliably biased
| toward a particular node, not leapfrogging between two nodes.
| sillysaurusx wrote:
| It's deeper than that. See the paper on time in distributed
| systems by Lamport:
| https://lamport.azurewebsites.net/pubs/time-clocks.pdf
|
| There's a relativity-like issue where it's impossible to
| have a globally consistent view of time.
|
| See IR2 for how to synchronize physical time in a
| distributed system.
|
| "(a) If Pg sends a message m at physical time t, then m
| contains a timestamp Tm= C/(t). (b) Upon receiving a
| message m at time t', process P/ sets C/(t') equal to
| maximum (Cj(t' - 0), Tm + /Zm)."
|
| I _think_ the formula is saying that the clocks will only
| ever increase (i.e. drift upwards). If so, then you could
| imagine two processes leapfrogging if one of them sends a
| message that bumps the other's clock, then that one sends a
| message back that bumps the first.
|
| But I'm curious how it behaves if one of the clocks is
| running faster, e.g. a satellite has a physically different
| sense of time than an observer on the ground.
|
| Also note the paper claims you can't get rid of clocks if
| you want to totally order the events.
| rhaen wrote:
| It's a truly fantastic paper, but like many things the
| idea that it's impossible to have a _perfectly_
| consistent global view of time doesn 't mean it's not
| possible to have a near-perfect one. The blog mentioned
| AWS's Timesync which lets you be synchronized into the
| microseconds (https://aws.amazon.com/about-aws/whats-
| new/2023/11/amazon-ti...), and Google's TrueTime is used
| to give ranges of times that you're guaranteed to be
| within (https://cloud.google.com/spanner/docs/true-time-
| external-con...).
| withinboredom wrote:
| It's near-perfect ... until it isn't and everything goes
| to shit in a cascading wtafc (what-the-actual-fuck-
| catastrophe).
|
| 1ms is an extremely long time in computer land where
| decisions are made in nanoseconds. There are 1000
| nanoseconds per millisecond.
| withinboredom wrote:
| > Also note the paper claims you can't get rid of clocks
| if you want to totally order the events.
|
| The order of events isn't important here. We only care
| about the 'last event' which everyone can agree on; they
| just can't agree on when it happened.
|
| In other words, they can all agree that there is a
| leader; we just don't know if that leader is still alive
| or not. The best thing to do is simply make the leader
| election deterministically random:
|
| 1. 'seed' the file with a 'no leader' file.
|
| 2. at randomly short intervals (max latency you want to
| deal with, so like once every few seconds or so), try to
| claim the file by using the hash of the current file as
| your etag. The winner is the leader.
|
| 3. Once there is a leader and you are the leader, every N
| seconds, update the file with a new number, using the
| hash of the current file as the etag.
|
| 4. If you are not the leader, every N*(random jitter)*C
| (where C>N*2, adjust for latency), attempt to take the
| file using the same method above. If you fail, retry
| again in N*(random jitter)*C.
|
| 5. If the leader is taken, you are not the leader until
| you've held leadership for at least N seconds.
|
| This doesn't necessarily remove the clock. However, most
| -- if not all -- modern clocks will agree that the length
| of a second is within a few nanoseconds of any other
| clock, regardless of how accurate its total count of
| seconds is since the epoch. It also probably doesn't
| work, but it probably isn't _that_ far away from one that
| does.
| orf wrote:
| > ...which is proven to be impossible
|
| For some definition of impossible, given that many systems
| utilise them effectively. Not all corner cases or theoretical
| failure modes are relevant to everyone.
| lijok wrote:
| Many systems are buggy, periodically running into corner
| cases that are borderline impossible to debug.
| ahoka wrote:
| Sometimes it's enough to detect these cases and reset the
| system to its stable state.
|
| For example bicycle wheels are bistable systems, but they
| usually stay in their useful state so it does not matter in
| practice.
| Spivak wrote:
| Yes, and yet those systems still work and deliver real
| value all day every day. If every company Rollbar I've ever
| seen is the measure good software can have millions of
| faults and still work for users.
| everforward wrote:
| > Convincing a whole generation of programmers that distributed
| lock are a feasible solution.
|
| I too hate this. Not just because the edge cases exist, but
| also because of the related property: it makes the system very
| hard to reason about.
|
| Questions that should be simple become complicated. What
| happens when the distributed locking system is down? What
| happens when we reboot all the nodes at once? What if they
| don't come down at exactly the same time and there's leader
| churn for like 2 minutes? Etc, etc.
|
| Those questions should be fairly simple, but become something
| where a senior dev is having to trace codepaths and draw on a
| whiteboard to figure it out. It's not even enough to understand
| how a single node works in-depth, they have to figure out how
| this node works but also how this node's state might impact
| another node's.
|
| All of this is much simpler in leaderless systems (where the
| leader system is replaced with idempotency or a scheduler or
| something else).
|
| I very strongly prefer avoiding leader systems; it's a method
| of last resort when literally nothing else will work. I would
| much rather scale a SQL database to support the queries for
| idempotency than deal with a leader system.
|
| I've never seen an idempotent system switch to a leader system,
| but I've sure seen the reverse a few times.
| anothername12 wrote:
| >> Convincing a whole generation of programmers that
| distributed lock are a feasible solution.
|
| > I too hate this. Not just because the edge cases exist, but
| also because of the related property: it makes the system
| very hard to reason about.
|
| I think this is a huge problem with the way we're developing
| software now. Distributed systems are extremely difficult for
| a lot of reasons, yet it's often or first choice when
| developing even small systems!
|
| At $COMPANY we have hundreds of lambdas, DocumentDB (btw,
| that is hell in case you're considering it) and other cloud
| storage and queuing components. On call and bugs basically
| are quests in finding some corner case race condition/timing
| problem, read after write assumption etc.
|
| I'm ashamed to say, we have reads wrapped in retry loops
| everywhere.
|
| The whole thing could have been a Rails app with a fraction
| of the team size and a massive increase in reliability and
| easier to reason about/better time delivering features.
|
| You could say we're doing it wrong, and you'd probably be
| partly right for sure, but I've done consulting for a decade
| at dozens of other places and it always seems like this.
| icedchai wrote:
| I see the same. All this complexity to handle a few
| requests/second... but at least we can say it's "cloud
| native."
| giovannibonetti wrote:
| Same thing where I work now. Many experienced developers
| waste a huge chunk of their time trying to wrap their heads
| around their Django micro services communication patterns
| and edge cases. Much more complex than an equivalent Rails
| monolith, even though Ruby and Rails both have their issues
| and could be replaced by more modern tech in 2024.
| everforward wrote:
| > You could say we're doing it wrong, and you'd probably be
| partly right for sure, but I've done consulting for a
| decade at dozens of other places and it always seems like
| this.
|
| The older I get, the more I think this is a result of
| Conway's law and that a lot of this architectural cruft
| stems from designing systems around communication
| boundaries rather than things that make technical sense.
|
| Monolithic apps like Rails only happen under a single team
| or teams that are so tightly coupled people wonder whether
| they should just merge.
|
| Distributed apps are very loosely coupled, so it's what you
| would expect to get from two teams that are far apart on
| the org chart.
|
| Anecdotally, it mirrors what I've seen in practice. Closely
| related teams trust each other and are willing to make a
| monolith under an assumption that their partner team won't
| make it a mess. Distantly related teams play games around
| ensuring that their portion is loosely coupled enough that
| it can have its own due dates, reliability, etc.
|
| Queues are the king of distantly coupled systems. A team's
| part of a queue-based app can be declared "done" before the
| rest of it is even stood up. "We're dumping stuff into the
| queue, they just need to consume it" or the inverse "we're
| consuming, they just need to produce". Both sides of the
| queue are basically blind to each other. That's not to say
| that all queues are bad, but I have seen a fair few queues
| that existed basically just to create an ownership
| boundary.
|
| I once saw an app that did bidirectional RPC over message
| queues because one team didn't believe the other
| could/would do retries, on an app that handled single digit
| QPS. It still boggles my mind that they thought it was
| easier to invent a paradigm to match responses to requests
| than it was to remind the other team to do retries, or
| write them a library with retries built in, or just
| participate in bleeping code reviews.
| silasdavis wrote:
| This is rather misleading, the FLP theorem talks about fully
| asynchronous networks with unbounded delay. Partial synchrony
| is a perfectly reasonable assumption and allows atomic
| broadcast and locking to work perfectly well even if there is
| an unknown but finite bound on network delay.
| gunnarmorling wrote:
| Had you read on, you'd have seen that I am discussing this very
| point:
|
| > leader election will only ever be eventually correct... So
| you'll always need to be prepared to detect and fence off work
| done by a previous leader.
| setheron wrote:
| I remember at Oracle they built systems to shut down the
| previous presumed leader to definitively know it wasn't
| ghosting.
| tanelpoder wrote:
| Yep, the "STONITH" technique [1]. But programmatically
| resetting one node over a network/RPC call might not work, if
| internode-network comms are down for that node, but it can
| still access shared storage via other networks... The
| Oracle's HA fencing doc mentions other methods too, like IPMI
| LAN fencing and SCSI persistent reservations [2].
|
| [1] https://en.wikipedia.org/wiki/STONITH
|
| [2] https://docs.oracle.com/en/operating-systems/oracle-
| linux/8/...
| setheron wrote:
| They had access to the ILOM and had some much more durable
| way to STONITH. Of course every link can "technically" fail
| but it brought it to some unreasonable amount of 9s that it
| felt unwarranted to consider.
| tanelpoder wrote:
| Yep and ILOM access probably happens over the management
| network and can hardware-reset the machine, so the
| dataplane internode network issues and any OS level
| brownouts won't get in the way.
| woooooo wrote:
| They both reduce to a paxos style atomic broadcast, which is in
| fact possible although the legend is that Leslie Lamport was
| trying to prove it impossible and accidentally found a way.
| butterisgood wrote:
| Indeed a slow node looks like a dead node for a while until it
| isn't.
|
| At some point distributed systems that work well vs others that
| do not is an "art of tuning timeouts and retries".
|
| Also nothing in production is perfect - so we should consider
| failures always when writing code in distributed systems and
| the impacts.
|
| And we will still make mistakes...
| p1necone wrote:
| > which is proven to be impossible.
|
| 'Technically' intractable problems are solvable just fine in a
| way that is almost as useful as solving them completely if you
| can achieve one of two things:
|
| * Reliably identify when you've encountered an unsolvable case
| (usefulness of this approach depends on the exact problem
| you're solving).
|
| or
|
| * Reduce the probability of unsolvable cases/incorrect
| solutions to a level low enough to not actually happen in
| practice.
|
| 'Technically' GUIDs are impossible, reliable network
| communication (TCP) is impossible, O^2 time complexity
| functions will grow to unusably large running times - but in
| practice all of these things are used constantly to solve real
| problems.
| Spivak wrote:
| Why are people going for this guys throat-- distributed locking
| might be impossible but what is being described, distributed
| leasing, is totally possible and useful. There's no sins being
| committed here.
|
| I might choose DynamoDB over S3 to implement this but both are
| fine.
| otterley wrote:
| HN being HN, I guess. Nerds like to nitpick.
|
| That said, if you have a choice between implementing
| idempotency and using a distributed lock, you should always opt
| for the former. It's far less complex and less error prone.
| shayonj wrote:
| Very cool case of using s3 conditional writes for distributed
| leasing
| mgdev wrote:
| This is a very expensive way to do leader election, at least from
| an infrastructure perspective.
|
| This is because you're essentially pushing the problem down to
| S3, which does its own leader election in a way that is waaaay
| overbuilt for what we're trying to accomplish here.
|
| But... that doesn't mean it isn't cool. :)
| otterley wrote:
| Since the user is only paying an infinitesimal fraction of the
| infrastructure cost, does it really matter? From the user's
| perspective it's extremely inexpensive.
| mgdev wrote:
| If everyone started using S3 for this purpose, they would
| shut it down pretty quickly.
|
| Small objects (especially small hot objects) are actually
| problematic in S3. They cache them in the keymap instead of
| retrieving from block storage. But many small objects can
| quickly blow out a keymap shard. The keymap is designed for
| high-performance lookups. Because of this, it's also more
| expensive to scale out this class of 'storage' at the keymap
| layer than to scale out block storage. And you still have to
| do quorum writes to block storage and then cache eviction at
| the keymap.
|
| If you're doing this for slow-moving leader election in a
| small cluster, fine. But if, for example, you started using
| this for leader-election among end-user-defined workloads
| that could scale up/down, you might find yourself on the
| other side of a call from AWS.
| otterley wrote:
| > If everyone started using S3 for this purpose, they would
| shut it down pretty quickly
|
| I work at AWS but not on the S3 service team. (Opinions are
| entirely my own.)
|
| I have little doubt that the team already considered this
| possibility before releasing the feature. The choice to
| make new functionality available in a service is considered
| a "one-way door" that receives a tremendous amount of
| scrutiny. Once a feature is made available, they will do
| everything they can to support it as long as possible, even
| at the cost of convenience to themselves. And even de-
| emphasized services (keep the lights on) are rarely
| deprecated completely - existing accounts and customers
| already using them can frequently continue to do so.
| mgdev wrote:
| They introduce economic incentives. :)
|
| By way of example, a little over a decade ago a famous
| online streaming company used to upload all of their
| video masters for transcoding. This process involved a
| huge upload and a huge workload, followed by a
| significant drop in activity.
|
| The problem was that AWS had to provision for peak usage
| instead of average usage. This resulted in a situation
| where the peak-to-average ratio was very high for just
| one or two customers.
|
| To address this issue, the solution was to incentivize
| these customers to spread out their workload more evenly
| over time, at least until they were no longer the largest
| driver of peak/avg.
|
| This is also why things like Reserved Instances and Spot
| Instances exist.
| otterley wrote:
| I'm sure we agree that other means such as changing
| pricing and reaching out to exceptional customers are not
| the same thing as killing a feature.
| mgdev wrote:
| I never said they would kill the feature. I said they
| would shut down the behavior. They can do that through
| economic incentive.
|
| (Source: I was on the S3 team. Opinions my own, etc.)
| otterley wrote:
| You said they would "shut it down," which can reasonably
| be interpreted as killing the feature. If you meant
| something more specific, you should have said that.
| Clarity is the duty of the speaker.
| Terretta wrote:
| >> _"If everyone started using S3 for this purpose, they
| would shut it down pretty quickly."_
|
| > _"I never said they would kill the feature. I said they
| would shut down the behavior."_
|
| Seems "it" is standing in for a lot of nuance here.
|
| With these three sentences side by side, still took a
| while to see how you might be referring to the "everyone
| using" behavior instead of the "S3 purpose" feature!
| Usually given ambiguity the nearest plausible reference
| wins.
|
| Since in tech "shut it down" is more often systems than
| humans _and_ that was the nearest reference "it" could
| refer to, took some doing to see how your assertion "I
| said the behavior" could be accurate!
| paulgb wrote:
| One advantage of this approach is that if you're already using
| S3 (as in the SlateDB case mentioned in the article), it's
| essentially "free". And it means that a package like SlateDB
| just needs to be pointed at an S3 bucket, instead of making you
| point it to an S3 bucket _and_ a DynamoDB instance.
| dekhn wrote:
| Is chubby a filesystem?
| refulgentis wrote:
| Google's locking service
| dekhn wrote:
| it's an ongoing joke inside google: is chubby a filesystem?
| It's a locking service which exposes an API that resembles
| files and paths, and even has a File implementation, but the
| service owners really never wanted people to treat it like a
| filesystem.
___________________________________________________________________
(page generated 2024-08-30 23:00 UTC)