[HN Gopher] About synchronous disk replication
___________________________________________________________________
About synchronous disk replication
Author : tosh
Score : 65 points
Date : 2024-08-31 09:00 UTC (14 hours ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| js4ever wrote:
| DRBD as a service
| rzzzt wrote:
| Most of the small-"c" cloud services are that way. Less
| novelty, more "don't need to figure out how to configure DRBD
| just right, then let the burden of maintaining it fall on my
| shoulders".
| gregwebs wrote:
| This would be great for a lot of purposes but still seems to fall
| short for operating a db in which you don't want any data loss.
|
| Although it is called synchronous replication, it appears to be
| asynchronous. That is the primary disk will acknowledge writes
| while the secondary may not have acknowledged them. To make
| something synchronous usually requires 2 secondaries where writes
| occur on 2/3 disks total: this allows one to fall behind and the
| primary acknowledges writes after replicating to 1/2 secondaries.
|
| this GCP offering allows for monitoring when a secondary falls
| behind. With just one secondary that means when you get your
| alert, there's a potential for data loss. If there was a write to
| 1/2 replicas then when you get the alert, you know you can still
| fail over to the secondary that is caught up and don't have to
| panic while trying to deal with the replica that is falling
| behind.
| karmakaze wrote:
| I see your point, but calling this synchronous doesn't seem any
| farther off than a caching raid controller replying before
| writing.
| kobalsky wrote:
| From linked article:
|
| > If the disk replication status is catching up or degraded,
| then one of the zonal replicas is not updated with all the
| data. Any outage during this time in the zone of the healthy
| replica results in an unavailability of the disk until the
| healthy replica zone is restored.
|
| There isn't a binary log that the replicas can catch up to,
| if the healthy disk goes down you are out of luck.
|
| MySQL's semi-synchronous replication is "more synchronous"
| than this. If it's enabled it won't acknoledge a transaction
| until a replica has the transaction saved in it's binary log.
| Then the replica could be out of sync but if the master
| exploded, the slave would eventually catch up to the master
| using its own binary log.
|
| I'm either misunderstanding Google's service here, because
| the name doesn't seem right.
| nyrikki wrote:
| MySQL semisynchronous is similar to a Cassandra consistency
| level of 1. That is typically a lens that helps.
|
| As block devices aren't ACID, their challenges are greater.
|
| But note the following from the MySQL docs, similar
| problems exist when you have to fail over with uncommitted
| transactions.
|
| > With semisynchronous replication, if the source crashes
| and a failover to a replica is carried out, the failed
| source should not be reused as the replication source, and
| should be discarded. It could have transactions that were
| not acknowledged by any replica, which were therefore not
| committed before the failover.
| Balinares wrote:
| Nope, this is in fact synchronous. It just degrades gracefully
| when one of the disaster domains is impacted. The main purpose
| here AIUI is availability, rather than durability.
|
| For durability, you'll indeed want something like 3-way
| replication. But that's a distinct problem. If durability is
| your concern but you're fine with the availability SLOs of a
| single disaster domain, then you don't need regional
| replication.
| foota wrote:
| Is the idea that you'll check replication status after making
| each write? Or I guess you could do this at the application
| level, like after completing a transaction to e.g., sqlite
| check that there's at least one other zone caught up before
| acking whatever call the user made?
| leg wrote:
| Many enterprise storage systems have the
| durability/availability tradeoff like these replicated
| disks when replicating outside of a single datacenter.
| (Oracle calls it "max availability": try to synchronously
| replicate, but if the remote side is offline, allow
| transactions to commit.) Real world banks run on these
| sorts of systems.
|
| Users don't continuously check replication status. They
| rely on it being synchronous almost all the time.
|
| 3 way quorum replication is great, but you then need to
| send to more data centers, potentially affecting
| performance. There's a tradeoff.
|
| (I work on GCP storage)
| gregwebs wrote:
| It's synchronous much of the time but can become asynchronous
| at any moment? I would still call that asynchronous.
|
| The availability story isn't incredible either. Once the
| secondary is behind an outage of the primary means the system
| is no longer operational until that primary can be restored.
|
| Three-way replication with raft and 2/3 or 3/5 acks is what
| modern distributed databases use. It's for both availability
| and durability.
| jiggawatts wrote:
| Azure has had zone-redundant disks with 3-way synchronous writes
| since 2021: https://learn.microsoft.com/en-gb/azure/virtual-
| machines/dis...
|
| The same underlying system is also available for blob storage.
|
| It's strange that GCP and AWS are so far behind on this feature.
| leg wrote:
| Regional Persistent Disk was in beta in 2019. Usability hiccups
| and other annoyances meant it only GA'd in 2023, but it's been
| used under CloudSQL for quite a while.
|
| (I work on GCP storage)
| jiggawatts wrote:
| To be fair, it's notable that Azure doesn't publish any
| information about consistency guarantees (or lack thereof).
|
| I did notice in an article about using blob witness for SQL
| clusters that they're not all interchangeable.
| londons_explore wrote:
| I really don't know why Google doesn't just let users pick the
| how many copies to keep, and how many writes must be ACK'ed
| before the VM sees a write as complete.
|
| Then users can decide the cost vs reliability vs durability of
| data written milliseconds before an outage.
|
| Perhaps give users a web-based calculator where you can put the
| numbers, and see how much it would cost in $ per gb per day, the
| mean time to committed data loss based on historic data in
| years/centuries, and the typical increase in write latency (loss
| of performance) compared to a single replica.
|
| Then the user can decide.
| kccqzy wrote:
| If you were a team working at Google internally, then yes you
| can pick these parameters. You can even pick between straight
| up replication versus Reed Solomon codes. I just don't
| understand why this is not exposed in their cloud offering.
| foota wrote:
| This is replication between zones, not within a single zone.
| londons_explore wrote:
| The form would look like this: CREATE
| PERSISTENT DISK ----------------------
| Replicas: W Zones to split replication across: X
| How many replicas must complete a write to allow the VM to
| continue: Y How many zones must complete a write
| to allow the VM to continue: Z With the above
| settings, you can expect approximately: Write
| latency 5-95%: 0.5-2.5 milliseconds Mean time to
| committed data loss: 37 years Mean time to
| failure to write: 3 years Mean time to data loss
| of data over 1 minute old: 18327 years. Cost:
| $0.18/GB/month
|
| The user would set those 4 parameters, with help text for
| guidance and a big red warning if you set the parameters to
| something insanely slow/unreliable.
| gtirloni wrote:
| And the customer support teams would waste countless
| hours debugging scenarios that the customer had no idea
| what they were doing. All of that with a SLO breathing
| down their neck.
| londons_explore wrote:
| Easy enough to say "You had the replication parameters
| set to 1 replica, so data loss was expected every 1 month
| on average, and now some of your data has been lost. You
| can try to recover data with these opensource tools, or
| you can delete the disk and start again. We recommend
| replication of at least 1.15 to get a mean time between
| data loss of 150 years."
| Balinares wrote:
| > I just don't understand why this is not exposed in their
| cloud offering.
|
| Probably because the mental model for the underlying
| implementation that you are basing this idea on does not
| accurately reflect the actual reality of the implementation.
|
| As a very broad rule of thumb, whenever you find yourself
| saying, "Given <thing A>, I don't understand <thing B>," you
| probably want to re-examine Thing A.
| dangoodmanUT wrote:
| the problem with these solutions is how non-native they are to
| workloads. For zonal outages, you have to force-move things over.
| Even in k8s it doesn't seem like you can automatically failover
| statefulsets to another zone.
|
| It feels better to me to have the replication at the data store
| level (e.g. database) rather than try and hide it under APIs that
| really aren't meant for it
| time4tea wrote:
| Memories of EMC Symmetrix SRDF. Was a bit more expensive then
| though, but it was quite nice.
___________________________________________________________________
(page generated 2024-08-31 23:01 UTC)