[HN Gopher] Just Say No to Paxos Overhead: Replacing Consensus w...
___________________________________________________________________
Just Say No to Paxos Overhead: Replacing Consensus with Network
Ordering (2016)
Author : yagizdegirmenci
Score : 41 points
Date : 2021-05-04 20:35 UTC (2 hours ago)
(HTM) web link (www.usenix.org)
(TXT) w3m dump (www.usenix.org)
| infogulch wrote:
| Kinda neat. Splits the big problem of consensus into ordering and
| replication, and then leans on a network device like a switch to
| 'solve' the ordering problem in the context of a single data
| center. The key observation is that all those packets are going
| through the switch anyways, and the switch has enough spare
| compute to maintain a counter and add it as a header to packets,
| and it can easily be dynamically programmed with SDN...
|
| I bet public clouds could offer this as a specialized 'ordered
| multicast vnet' infrastructure primitive to build intra-dc
| replicated systems on top of.
| strictfp wrote:
| Reminds me of the hack of using auxiliary information such as
| AWS group metadata to determine ordering.
|
| If there's already an ordering system in place, directly or
| indirectly, why not use it?
| jpgvm wrote:
| They aren't wrong but they are basically just digging up the
| Tandem architecture and adapting it to the current RoCE/FCoE
| enhanced ethernet we have today.
|
| Ironically that is where most of our newer interconnects have
| their heritage. PCI-e is descendant from Infiniband which in turn
| is decendant from ServerNet which was developed at Tandem as the
| internal interconnect for their NonStop systems for the very
| usecases this paper describes.
|
| i.e the very ordered networking this relies on and the
| architectures it enables were invented ~25 years ago.
|
| Unfortunately the reality of today is we don't build data centres
| and especially not for very highly available systems. Instead we
| pay that latency penalty to build geographically redundant
| systems using cheap rented virtual hardware from 3rd parties.
|
| This is in part because of how the always on nature of the
| internet changed the availability requirements for most software
| (which is now delivered as SaaS) from business hours in 1
| timezone to always-on everywhere in the world. It's no longer an
| excuse for a business critical application to be down because "DC
| lost power" or some variation of that.
|
| I'm old and grumpy though so whatever, everything old becomes new
| again eventually.
| AceJohnny2 wrote:
| that's funny, the tech lead at my last company was formerly
| from Tandem, and the architecture of our HA product reflected
| that :)
|
| It was a cool system, with architecture completely alien to the
| consumer products I work on nowadays, and I kinda miss QNX.
|
| Did you know a gregk?
| strictfp wrote:
| When working on networked control systems I realized that
| ordered networking is the reason why automation still uses
| proprietary protocols and interconnects such as Profibus and
| Profinet.
|
| If you want good guarantees you need determinism.
| wahern wrote:
| > The first aspect of our design is network serialization, where
| all OUM packets for a particular group are routed through a
| sequencer on the common path.
|
| This solution actually just shunts the problem to a different
| layer. To be robust to sequencer failure and rollover, you will
| need to rely on an inner consensus protocol to choose sequencers.
| Which is basically how Multi-Paxos, Raft, etc work--you use the
| costly consensus protocol to pick a leader, and thereafter simply
| rely on the leader to serialize and ensure consistency.
|
| It seems like an interesting paper w/ a novel engineering model
| and useful proofs. But from an abstract algorithmic perspective
| it doesn't actually offer anything new, AFAICT. There are an
| infinite number of ways to shuffle things around to minimize the
| components and data paths directly reliant on the consensus
| protocol. None obviate the need for an underlying consensus
| protocol, absent problems where invariants can be maintained
| independently.
| brighton36 wrote:
| These consensus systems are usually solutions in search of a
| problem. It's pretty rare for these consensus systems to offer an
| efficiency in practice...
| pfraze wrote:
| To be clear, this paper is referring to non-decentralized
| (highly-consistent I assume) consensus algorithms where the
| goal is to operate a cluster of machines as one logical system.
| You use this, for instance, to maintain high uptime in the face
| of individual machines going down.
|
| I suspect you were reacting to decentralized consensus
| (blockchains) which is a pretty different space.
| klodolph wrote:
| Yeah, this paper gets dredged up every now and then. Seems a bit
| like a cheap trick to say "just say no to Paxos overhead" and
| then _sequence network requests._ I'm all for exploring
| alternatives to Paxos, but if you're doing something so novel, my
| response is that I'll believe it when I see it operate at scale.
|
| I'm just not sure how you would sequence requests in a typical
| setup, with Clos fabrics everywhere, possibly when your
| redundancy group is spread across different geographical
| locations. Wouldn't you need some kind of queue to reorder
| messages? That queue could get large, and quickly.
|
| Paxos and Raft have the advantage of being simple and easy to
| understand. (Not necessarily easy to incorporate into system
| designs, sure, and the resulting systems can be fairly
| complicated, but Paxos and Raft themselves are simple enough to
| fit on notecards.)
| gfv wrote:
| >in a typical setup, with Clos fabrics everywhere
|
| You choose a single spine switch to carry the multicast
| messages destined to the process group. The paper also
| explicitly notes that different process groups need not to
| share the sequencer.
|
| >when your redundancy group is spread across different
| geographical locations
|
| The paper applicability is limited to data center networks with
| programmable network switching hardware.
___________________________________________________________________
(page generated 2021-05-04 23:00 UTC)