[HN Gopher] About synchronous disk replication
       ___________________________________________________________________
        
       About synchronous disk replication
        
       Author : tosh
       Score  : 65 points
       Date   : 2024-08-31 09:00 UTC (14 hours ago)
        
 (HTM) web link (cloud.google.com)
 (TXT) w3m dump (cloud.google.com)
        
       | js4ever wrote:
       | DRBD as a service
        
         | rzzzt wrote:
         | Most of the small-"c" cloud services are that way. Less
         | novelty, more "don't need to figure out how to configure DRBD
         | just right, then let the burden of maintaining it fall on my
         | shoulders".
        
       | gregwebs wrote:
       | This would be great for a lot of purposes but still seems to fall
       | short for operating a db in which you don't want any data loss.
       | 
       | Although it is called synchronous replication, it appears to be
       | asynchronous. That is the primary disk will acknowledge writes
       | while the secondary may not have acknowledged them. To make
       | something synchronous usually requires 2 secondaries where writes
       | occur on 2/3 disks total: this allows one to fall behind and the
       | primary acknowledges writes after replicating to 1/2 secondaries.
       | 
       | this GCP offering allows for monitoring when a secondary falls
       | behind. With just one secondary that means when you get your
       | alert, there's a potential for data loss. If there was a write to
       | 1/2 replicas then when you get the alert, you know you can still
       | fail over to the secondary that is caught up and don't have to
       | panic while trying to deal with the replica that is falling
       | behind.
        
         | karmakaze wrote:
         | I see your point, but calling this synchronous doesn't seem any
         | farther off than a caching raid controller replying before
         | writing.
        
           | kobalsky wrote:
           | From linked article:
           | 
           | > If the disk replication status is catching up or degraded,
           | then one of the zonal replicas is not updated with all the
           | data. Any outage during this time in the zone of the healthy
           | replica results in an unavailability of the disk until the
           | healthy replica zone is restored.
           | 
           | There isn't a binary log that the replicas can catch up to,
           | if the healthy disk goes down you are out of luck.
           | 
           | MySQL's semi-synchronous replication is "more synchronous"
           | than this. If it's enabled it won't acknoledge a transaction
           | until a replica has the transaction saved in it's binary log.
           | Then the replica could be out of sync but if the master
           | exploded, the slave would eventually catch up to the master
           | using its own binary log.
           | 
           | I'm either misunderstanding Google's service here, because
           | the name doesn't seem right.
        
             | nyrikki wrote:
             | MySQL semisynchronous is similar to a Cassandra consistency
             | level of 1. That is typically a lens that helps.
             | 
             | As block devices aren't ACID, their challenges are greater.
             | 
             | But note the following from the MySQL docs, similar
             | problems exist when you have to fail over with uncommitted
             | transactions.
             | 
             | > With semisynchronous replication, if the source crashes
             | and a failover to a replica is carried out, the failed
             | source should not be reused as the replication source, and
             | should be discarded. It could have transactions that were
             | not acknowledged by any replica, which were therefore not
             | committed before the failover.
        
         | Balinares wrote:
         | Nope, this is in fact synchronous. It just degrades gracefully
         | when one of the disaster domains is impacted. The main purpose
         | here AIUI is availability, rather than durability.
         | 
         | For durability, you'll indeed want something like 3-way
         | replication. But that's a distinct problem. If durability is
         | your concern but you're fine with the availability SLOs of a
         | single disaster domain, then you don't need regional
         | replication.
        
           | foota wrote:
           | Is the idea that you'll check replication status after making
           | each write? Or I guess you could do this at the application
           | level, like after completing a transaction to e.g., sqlite
           | check that there's at least one other zone caught up before
           | acking whatever call the user made?
        
             | leg wrote:
             | Many enterprise storage systems have the
             | durability/availability tradeoff like these replicated
             | disks when replicating outside of a single datacenter.
             | (Oracle calls it "max availability": try to synchronously
             | replicate, but if the remote side is offline, allow
             | transactions to commit.) Real world banks run on these
             | sorts of systems.
             | 
             | Users don't continuously check replication status. They
             | rely on it being synchronous almost all the time.
             | 
             | 3 way quorum replication is great, but you then need to
             | send to more data centers, potentially affecting
             | performance. There's a tradeoff.
             | 
             | (I work on GCP storage)
        
           | gregwebs wrote:
           | It's synchronous much of the time but can become asynchronous
           | at any moment? I would still call that asynchronous.
           | 
           | The availability story isn't incredible either. Once the
           | secondary is behind an outage of the primary means the system
           | is no longer operational until that primary can be restored.
           | 
           | Three-way replication with raft and 2/3 or 3/5 acks is what
           | modern distributed databases use. It's for both availability
           | and durability.
        
       | jiggawatts wrote:
       | Azure has had zone-redundant disks with 3-way synchronous writes
       | since 2021: https://learn.microsoft.com/en-gb/azure/virtual-
       | machines/dis...
       | 
       | The same underlying system is also available for blob storage.
       | 
       | It's strange that GCP and AWS are so far behind on this feature.
        
         | leg wrote:
         | Regional Persistent Disk was in beta in 2019. Usability hiccups
         | and other annoyances meant it only GA'd in 2023, but it's been
         | used under CloudSQL for quite a while.
         | 
         | (I work on GCP storage)
        
           | jiggawatts wrote:
           | To be fair, it's notable that Azure doesn't publish any
           | information about consistency guarantees (or lack thereof).
           | 
           | I did notice in an article about using blob witness for SQL
           | clusters that they're not all interchangeable.
        
       | londons_explore wrote:
       | I really don't know why Google doesn't just let users pick the
       | how many copies to keep, and how many writes must be ACK'ed
       | before the VM sees a write as complete.
       | 
       | Then users can decide the cost vs reliability vs durability of
       | data written milliseconds before an outage.
       | 
       | Perhaps give users a web-based calculator where you can put the
       | numbers, and see how much it would cost in $ per gb per day, the
       | mean time to committed data loss based on historic data in
       | years/centuries, and the typical increase in write latency (loss
       | of performance) compared to a single replica.
       | 
       | Then the user can decide.
        
         | kccqzy wrote:
         | If you were a team working at Google internally, then yes you
         | can pick these parameters. You can even pick between straight
         | up replication versus Reed Solomon codes. I just don't
         | understand why this is not exposed in their cloud offering.
        
           | foota wrote:
           | This is replication between zones, not within a single zone.
        
             | londons_explore wrote:
             | The form would look like this:                   CREATE
             | PERSISTENT DISK         ----------------------
             | Replicas:   W         Zones to split replication across:  X
             | How many replicas must complete a write to allow the VM to
             | continue:  Y         How many zones must complete a write
             | to allow the VM to continue: Z              With the above
             | settings, you can expect approximately:              Write
             | latency 5-95%:   0.5-2.5 milliseconds         Mean time to
             | committed data loss:   37 years         Mean time to
             | failure to write:    3 years         Mean time to data loss
             | of data over 1 minute old:    18327 years.         Cost:
             | $0.18/GB/month
             | 
             | The user would set those 4 parameters, with help text for
             | guidance and a big red warning if you set the parameters to
             | something insanely slow/unreliable.
        
               | gtirloni wrote:
               | And the customer support teams would waste countless
               | hours debugging scenarios that the customer had no idea
               | what they were doing. All of that with a SLO breathing
               | down their neck.
        
               | londons_explore wrote:
               | Easy enough to say "You had the replication parameters
               | set to 1 replica, so data loss was expected every 1 month
               | on average, and now some of your data has been lost. You
               | can try to recover data with these opensource tools, or
               | you can delete the disk and start again. We recommend
               | replication of at least 1.15 to get a mean time between
               | data loss of 150 years."
        
           | Balinares wrote:
           | > I just don't understand why this is not exposed in their
           | cloud offering.
           | 
           | Probably because the mental model for the underlying
           | implementation that you are basing this idea on does not
           | accurately reflect the actual reality of the implementation.
           | 
           | As a very broad rule of thumb, whenever you find yourself
           | saying, "Given <thing A>, I don't understand <thing B>," you
           | probably want to re-examine Thing A.
        
       | dangoodmanUT wrote:
       | the problem with these solutions is how non-native they are to
       | workloads. For zonal outages, you have to force-move things over.
       | Even in k8s it doesn't seem like you can automatically failover
       | statefulsets to another zone.
       | 
       | It feels better to me to have the replication at the data store
       | level (e.g. database) rather than try and hide it under APIs that
       | really aren't meant for it
        
       | time4tea wrote:
       | Memories of EMC Symmetrix SRDF. Was a bit more expensive then
       | though, but it was quite nice.
        
       ___________________________________________________________________
       (page generated 2024-08-31 23:01 UTC)