[HN Gopher] OpenZFS - dRAID, Finally
       ___________________________________________________________________
        
       OpenZFS - dRAID, Finally
        
       Author : throw0101a
       Score  : 142 points
       Date   : 2021-02-18 13:31 UTC (9 hours ago)
        
 (HTM) web link (klarasystems.com)
 (TXT) w3m dump (klarasystems.com)
        
       | oarsinsync wrote:
       | This is primarily aimed at those running very large vdevs
       | comprised of lots of disks. FTA:
       | 
       | > The dRAID offers a solution for large arrays, vdevs with fewer
       | than 20 spindles will have limited benefits from the new option.
       | The performance and resilver result will be similar to RAIDZ for
       | small numbers of spindles.
       | 
       | If you're doing a 6-8 disk array, this probably isn't relevant to
       | your interests.
       | 
       | If you're doing a 20+ disk array, well firstly, damn, you've been
       | lucky if you've avoided any serious issues. Secondly, you really
       | want to look into this ASAP.
        
         | mbreese wrote:
         | I'm incredibly interested in this. I run multiple 60+ disk
         | arrays, but never with enough of a budget for the extra servers
         | required for a better distributed system. Basically these are
         | single large JBODs that store large genomics data. (Cheap and
         | big are the only priorities, speed is a distant third).
         | 
         | Right now, I use ZFS with multiple raidz2 or raidz3 vdevs with
         | a number of hot spares waiting to take over. Resilvering can
         | take a day for a 4TB disk. I'm terrified how long a 12TB disk
         | will take (hence the move to raidz3). Disk loss is a normal
         | fact of life, and it's something we prepare for. But this would
         | help me sleep better at night.
         | 
         | I'm seriously excited about this and I'm definitely the target
         | audience for this.
        
           | secabeen wrote:
           | I'm in the same boat, but I don't run hot-spares. If a drive
           | fails, I want a human in the loop before a resilver begins.
           | Ive seen too many communication/cabling errors over the
           | years, and I'd rather try to confirm the root cause before
           | assuming that the problem is a bad disk.
        
         | tutfbhuf wrote:
         | > If you're doing a 20+ disk array ...
         | 
         | Then you might want to consider ceph.
        
           | secabeen wrote:
           | You might, but you might not. I have an 140 disk array that
           | just works. Ceph would require multiple nodes, lots of
           | tuning, maintenance, etc.
        
             | tutfbhuf wrote:
             | But multiple nodes offers you resilience against node
             | failures not just drive failures. And with Ceph you have
             | the flexibility to add or remove single disks as you like.
             | And you have multiple storage interface not just FS, but
             | also S3 or block device if you like.
        
       | fps_doug wrote:
       | I felt a bit stupid; I re-read the important parts several times
       | but failed to visualize what that looks like on the disks, and
       | how blocks are reconstructed on disk failure. A little googling
       | lead me to a paper by Intel[0] which has a simple visualization
       | of how this works on page 11, in case I'm not the only one.
       | 
       | [0] https://docplayer.net/117362530-Draid-declustered-raid-
       | for-z...
        
         | aaronax wrote:
         | Yeah that is a good link; I especially liked the text on page
         | 7.
         | 
         | I believe Windows Storage Spaces has long used a similar
         | technique[0] too, though it has been a few years since I was
         | looking into it. I wonder if ZFS DRAID will eventually allow
         | full utilization of disks of mixed sizes too, that was always a
         | very attractive feature of Storage Spaces to me.
         | 
         | [0] https://techcommunity.microsoft.com/t5/storage-at-
         | microsoft/...
        
         | robert_foss wrote:
         | Ah, so a distributed redundancy drive instead of a dedicated
         | one as with normal raid.
        
         | knorker wrote:
         | Oh? When I first saw that illustration it hugely frustrated me
         | about how poor that illustration is.
         | 
         | The numbers are not explained, nor are the rows, or columns.
         | And the numbers repeat.
         | 
         | I saw the diagrams first outside of that pdf, though. Still,
         | the way it goes "4, 8, 2, 16, 1, and so on..." is a boatload of
         | "wat?".
        
       | 7e wrote:
       | Doesn't this violate Isilon's patent on the virtual hot spare?
       | https://patents.google.com/patent/US7146524
       | 
       | It expires in nine months, which perhaps explains the timing.
        
         | pinewurst wrote:
         | I don't think so - 3PAR used a very similar paradigm starting
         | in 2000 or so. Plus declustered RAID in general was invented by
         | Garth Gibson (and used by Panasas) before that parent.
        
       | the8472 wrote:
       | > This will saturate the replacement disk with writes while
       | scattering seeks over the rest of the vdev. For 14 wide RAID-Z2
       | vdevs using 12TB spindles, rebuilds can take weeks.
       | 
       | Didn't sequential resilver already solve the read half of the
       | problem? Because that would already cut it down from weeks.
        
         | magicalhippo wrote:
         | When replacing a disk with a hotspare in a regular RAID-Z vdev,
         | the process is limited to the sequential write speed of that
         | single new disk.
         | 
         | The point with DRAID is that the spare capacity is distributed
         | across all the disks in the vdev, hence the name. Thus all
         | disks are read from _and_ written to when  "replacing" a failed
         | disk with the spare capacity.
         | 
         | The combined throughput of the multiple disks is typically much
         | higher than that of the single disk.
         | 
         | In addition they have an additional constraint which allows
         | DRAID to also support sequential resilver.
         | 
         | At least that's my understanding.
         | 
         | There's been several presentations of DRAID on the OpenZFS
         | channel, but the latest one is here[1]. There's also a writeup
         | here[2], though not sure if it's 100% up to date, though should
         | be close enough for getting the concept.
         | 
         | [1]: https://www.youtube.com/watch?v=jdXOtEF6Fh0
         | 
         | [2]: https://openzfs.github.io/openzfs-
         | docs/Basic%20Concepts/dRAI...
        
           | FeepingCreature wrote:
           | That kinda makes no sense though, because the whole
           | replacement disk needs to be written anyway. Presumably
           | before the failure you were using some part of the old disk,
           | so you need to fill up the new disk to the same level before
           | you're in the same state.
        
             | modderation wrote:
             | To clarify, you're redistributing the missing data across
             | all of the other disks in the array simultaneously, rather
             | than to one disk sequentially.
             | 
             | Let's say you've got an array with 10 disks and 100TB of
             | data. You experience a single-disk failure.
             | 
             | If you're running a RaidZ, you have to write the lost data
             | (100TB / 10 disks) to the single replacement disk. Writing
             | 10TB of data to a single disk takes quite a while,
             | especially on inexpensive SATA disks.
             | 
             | If you were running a dRAID with equivalent redundancy
             | settings, you have to write (100 / 10)TB to the 9 remaining
             | disks in the array. Each disk will have to write about 1.2
             | TB of data in parallel.
             | 
             | Writing 1.2TB to each disk should be roughly ~5-10x faster
             | than writing 10TB sequentially.
        
             | mbreese wrote:
             | I'm thinking of this in terms of hot spares, not cold
             | spares.
             | 
             | If you insert a cold spare to replace a disk -- yes, you
             | have to write the full disk.
             | 
             | If you migrate a disk from a hot spare to an active disk,
             | it's already largely populated, so you only have to write
             | the missing chunk.
             | 
             | At least, that's my limited understanding/reading. I'm not
             | sure I fully grok what is happening as there aren't
             | necessarily dedicated spares in this setup.
             | 
             | But I think the difference in your scenario is hot vs cold
             | spares.
        
             | rincebrain wrote:
             | The primary benefit of dRAID is when you're doing a replace
             | with a virtual "distributed hot spare", where it uses spare
             | capacity distributed across all the disks in the vdev to
             | replace a failed disk, thereby letting the rebuild scale
             | with the total number of disks in the vdev.
             | 
             | When you do a rebuild with a real, distinct physical disk,
             | there's no special benefit, you're still almost certainly
             | going to bottleneck on the write speed of that single disk.
        
               | jsmith45 wrote:
               | While I can see the hot-spare advantage here, I do have
               | to wonder about the impact of replacing the broken drive.
               | 
               | I suppose it works out as follows. With traditional raid,
               | you are rebuilding while degraded from a safety
               | perspective. So obviously, if your configuration
               | guarantees losing `n` drives while healthy does not cause
               | data loss, you could lose data during rebuild if `n`
               | drives failed.
               | 
               | With the distributed hot spare approach, you are no
               | longer degraded from a safety perspective as soon as the
               | distributed spare's data is written. This is faster than
               | traditional raid, as the writes are distributed among
               | many disks. So time-wise you get out of data safety
               | degraded state faster. (It does seem like it might be
               | riskier though, since you are doing heavy writes to many
               | drives while safety degraded, instead of only writes to
               | the spare).
               | 
               | So now you replace the failed drive, and the array
               | rebuilds to make this new spare's capacity distributed.
               | This will take a long time, since you will be writing to
               | basically this whole drive as part of distributing its
               | spare space. (Plus probably shuffling chunks around
               | between many other drives). This could possibly take even
               | longer than traditional raid rebuild. However during this
               | whole process, your array is not in degraded state from a
               | data safety point of view.
               | 
               | Or to summarize: during the long single-drive-write-
               | limited part of replacing the failed drive with a new hot
               | spare, you are not degraded from a safety of data point
               | of view.
        
               | mbreese wrote:
               | _> This will take a long time, since you will be writing
               | to basically this whole drive as part of distributing its
               | spare space. (Plus probably shuffling chunks around
               | between many other drives)_
               | 
               | Yes. With the caveat that you'll be *reading* from all of
               | the disks to satisfy those writes. In a traditional
               | setup, you'd have multiple striped raidz{,2,3} vdevs.
               | When you swap in a spare, you're only pulling reads from
               | the disks in the same vdev. This will drive the
               | performance from this vdev to the ground. In this draidz
               | setup, you'll be pulling from all of the disks, so the
               | entire pool is stressed at the same rate.
               | 
               | I think the way I'm starting to think about it is that
               | each drive has X amount of spare capacity or buckets. But
               | they are otherwise used. When a drive goes out, the
               | primary data from that drive is distributed across the
               | array to the already-in-place spare capacity. So, yes,
               | after the primary data is mirrored out, the pool is no
               | longer degraded. But, I think the point being made here
               | is that you'll still want to replace the drive, which
               | will still cause *that* drive to be fully written.
        
               | bbarnett wrote:
               | Not sure I like this, for many spinning disk raids of
               | that scale I have are mostly read, rarely write.
               | 
               | With raid 6 for example, I pop in a new drive, and all
               | the write/repair hits only that drive.
               | 
               | With a virtual spare, all drives end up with massive
               | writes, meaning I might hit write errors, thus borking a
               | raid which would have been ok for reads.
               | 
               | I realise there are 100 'what ifs' and scenarios here,
               | but when my raids tend to drop a drive, it's on heavy
               | write ops, not reads. I'd say 90% of the time.
               | 
               | I imagine ssd cascading/eol write failures might make
               | this virtual spare failure scenario even more likely.
               | 
               | (yes I am aware of sector reallocation delays on spinning
               | disks, I am referring to real EOL scenarios on drives
               | during writes, being far more common for me than reads)
        
           | the8472 wrote:
           | What I was questioning was that it shouldn't be taking weeks
           | because the write speed of a single disk is enough to fill it
           | in a day or two with sequential IO. I.e. the sequential
           | resilver alone should be enough to fix the "can take weeks"
           | problem.
           | 
           | Additional speed on top of that is nice of course, if you're
           | using such wide pools.
        
             | e12e wrote:
             | I get that using hot virtual spares is much better, I also
             | wonder at a week to write 12tb. 12tb at 200 mb/s is roughly
             | 20 hours.
             | 
             | Is it the fact that you get 12tb of essentially non
             | sequencial io mixed with random reads during the rebuild?
             | Still 20 mb/s seems 1990s slow writing tio a mechanical HD?
             | 
             | Ed: ok, ok, from tfa:
             | 
             | > For 14 wide RAID-Z2 vdevs using 12TB spindles, rebuilds
             | can take weeks. Resilver I/O activity is deprioritized when
             | the system has not been idle for a minimum period. Full
             | zpools get fragmented and require additional I/O's to
             | recalculate data during reslivering.
             | 
             | Obviously writing 12 tb at 0 mb/s will take time..
        
             | magicalhippo wrote:
             | With DRAID you'd cut down the time to an hour or so though.
             | 
             | Not sure how much that matters in practice, I've just got a
             | home NAS with 6 disks to play with.
        
       | KaiserPro wrote:
       | Oooo this is exciting. One of the annoying things about
       | "traditional" raid is that making a raid group gives you
       | performance, but it also kneecaps your rebuild time.
       | 
       | This is because you need to do a coordinated read from each disk
       | in the raid stripe by stripe. a rebuild is effectively a single
       | threaded synchronous operation
       | 
       | We don't really need to have such rigid mapping of data to raid
       | groups anymore. which is why most large storage systems partition
       | your data up into small chunks (a few megs) apply some forward
       | error correction to it, and smear it all over the place.
       | 
       | This means that you can do multi-threaded rebuilds. This means
       | that as your array gets bigger, not only does the rebuild time
       | get smaller, but its much more resilient to data loss.
       | 
       | A good example of this is GPFS's native raid:
       | https://www.usenix.org/legacy/events/lisa11/tech/slides/deen...
       | (https://www.youtube.com/watch?v=2g5rx4gP6yU)
       | 
       | A word of warning though. Don't do this for small arrays (<50
       | disks) its not worth the hassle, yet.
        
       | syoc wrote:
       | dRAID looks super cool, was not familiar with it. It sadly looks
       | like, according to 3 min googleing, like you will not be able to
       | expand ZFS dRAID vdevs.
        
       | hinkley wrote:
       | After reading about Reed Solomon codes I was a bit surprised to
       | learn that parity disk was a literal rather than a figurative
       | thing in RAID5 and RAID6.
       | 
       | After diving into consistent hashing I was surprised that all
       | blocks are striped across all disks instead of redundancy+1,2,3
       | disks, in pretty much any drive array solution I could find
       | architecture docs for. Seems like bigger blocks on fewer disks
       | would have better utilization numbers. You could be doing 2
       | unrelated reads at the same time from different spindles and
       | tracks.
       | 
       | With a dead drive, you could do nothing and wait for a new drive,
       | or start making backup copies across the rest of the array as if
       | the array were now n-1 disks, or some of both, like this solution
       | seems to do.
        
       | cbsmith wrote:
       | Curiously, Facebook's "Australian filter" is blocking posting of
       | this article.
        
       | Naac wrote:
       | Aren't the parity blocks distributed across the disks in raidz2?
       | So resilvering/rebuilding already takes advantage of reading from
       | all the disks? If so what's the different between that and dRAID?
        
         | diegocg wrote:
         | It's not about parity
         | 
         | The essence of draid is that, instead of keeping a spare drive
         | unused in case one of the working drives fail, it incorporates
         | the spare drive to the array and uses it, but one drive worth
         | of free space is reserved randomly across the entire array.
         | 
         | That way, if one disk fails, the reserved space is used to
         | write the data necessary to keep the array consistent. Because
         | the free space is distributed randomly across the array, the
         | write performance of a single drive doesn't become a
         | bottleneck.
        
           | louwrentius wrote:
           | This is exactly how most 'old-school' enterprise storage
           | solutions work.
        
           | Naac wrote:
           | So if my understanding is correct this is orthogonal to the
           | raidz type one is using?
           | 
           | For example I can have a hot spare with a dRAID setup as part
           | of my raidz2 pool?
        
       | gbrown_ wrote:
       | It's going to be so good to finally have an open source option
       | for large storage building blocks. It's of particular interest to
       | those of us in HPC as if you are building a large parallel file
       | system and you don't want re-build times to suck your options are
       | basically one of the following.
       | 
       | - IBM's GPFS
       | 
       | - Panasas' PanFS
       | 
       | - Xyratex's -> Seagate's -> Cray's -> HPE's GridRAID in
       | Clusterstor/Sonexion prodcuts
       | 
       | Though ZFS and dRAID will likely show up in that last one in the
       | future.
        
         | [deleted]
        
         | throw0101a wrote:
         | Dell-EMC Isilon OneFS
        
           | iforgotpassword wrote:
           | It's what we're mainly using currently. Apart from a serious
           | bug after an update that kept crashing nfsd on nodes about
           | two years ago, it has been a very pleasant experience
           | overall. We also use some netapp, but it's a relatively small
           | setup. We tried ceph before, and it seemed to support two
           | modes of operation: Completely broken or rebalancing and thus
           | very slow.
        
         | fh973 wrote:
         | Quobyte DCFS does erasure coding of files across machines (no
         | local RAID needed).
        
         | p_l wrote:
         | Lustre+ZFS seems to have its fans, and iirc ZFSonLinux was
         | heavily funded by HPC use case
        
           | gbrown_ wrote:
           | Yup in particular Brian Behlendorf from Lawrence Livermore
           | National Laboratory has been a major contributor to
           | ZoL/OpenZFS.
        
           | KaiserPro wrote:
           | It makes sense for lustre, as you can remove the hardware
           | raid controller and just use "dumb" jbods.
           | 
           | As lustre is mostly a network raid0 (its got better now) it
           | makes sense as you really shouldn't be using it for
           | unreconstructable data.(gross simplification alert)
        
       | matheusmoreira wrote:
       | Is this feature related to the proposed RAID-Z expansion?
       | 
       | https://github.com/openzfs/zfs/pull/8853
       | 
       | It's mentioned in the pull request comments. Maybe it will help?
        
         | rincebrain wrote:
         | Not...really? I think they reused some overlapping work
         | involving more flexible data structures, but that's all, as far
         | as I know.
         | 
         | (cf. slide 11 of
         | https://docs.google.com/presentation/d/1uo0nBfY84HIhEqGWEx-T...
         | )
        
       ___________________________________________________________________
       (page generated 2021-02-18 23:01 UTC)