[HN Gopher] OpenZFS - dRAID, Finally
___________________________________________________________________
OpenZFS - dRAID, Finally
Author : throw0101a
Score : 142 points
Date : 2021-02-18 13:31 UTC (9 hours ago)
(HTM) web link (klarasystems.com)
(TXT) w3m dump (klarasystems.com)
| oarsinsync wrote:
| This is primarily aimed at those running very large vdevs
| comprised of lots of disks. FTA:
|
| > The dRAID offers a solution for large arrays, vdevs with fewer
| than 20 spindles will have limited benefits from the new option.
| The performance and resilver result will be similar to RAIDZ for
| small numbers of spindles.
|
| If you're doing a 6-8 disk array, this probably isn't relevant to
| your interests.
|
| If you're doing a 20+ disk array, well firstly, damn, you've been
| lucky if you've avoided any serious issues. Secondly, you really
| want to look into this ASAP.
| mbreese wrote:
| I'm incredibly interested in this. I run multiple 60+ disk
| arrays, but never with enough of a budget for the extra servers
| required for a better distributed system. Basically these are
| single large JBODs that store large genomics data. (Cheap and
| big are the only priorities, speed is a distant third).
|
| Right now, I use ZFS with multiple raidz2 or raidz3 vdevs with
| a number of hot spares waiting to take over. Resilvering can
| take a day for a 4TB disk. I'm terrified how long a 12TB disk
| will take (hence the move to raidz3). Disk loss is a normal
| fact of life, and it's something we prepare for. But this would
| help me sleep better at night.
|
| I'm seriously excited about this and I'm definitely the target
| audience for this.
| secabeen wrote:
| I'm in the same boat, but I don't run hot-spares. If a drive
| fails, I want a human in the loop before a resilver begins.
| Ive seen too many communication/cabling errors over the
| years, and I'd rather try to confirm the root cause before
| assuming that the problem is a bad disk.
| tutfbhuf wrote:
| > If you're doing a 20+ disk array ...
|
| Then you might want to consider ceph.
| secabeen wrote:
| You might, but you might not. I have an 140 disk array that
| just works. Ceph would require multiple nodes, lots of
| tuning, maintenance, etc.
| tutfbhuf wrote:
| But multiple nodes offers you resilience against node
| failures not just drive failures. And with Ceph you have
| the flexibility to add or remove single disks as you like.
| And you have multiple storage interface not just FS, but
| also S3 or block device if you like.
| fps_doug wrote:
| I felt a bit stupid; I re-read the important parts several times
| but failed to visualize what that looks like on the disks, and
| how blocks are reconstructed on disk failure. A little googling
| lead me to a paper by Intel[0] which has a simple visualization
| of how this works on page 11, in case I'm not the only one.
|
| [0] https://docplayer.net/117362530-Draid-declustered-raid-
| for-z...
| aaronax wrote:
| Yeah that is a good link; I especially liked the text on page
| 7.
|
| I believe Windows Storage Spaces has long used a similar
| technique[0] too, though it has been a few years since I was
| looking into it. I wonder if ZFS DRAID will eventually allow
| full utilization of disks of mixed sizes too, that was always a
| very attractive feature of Storage Spaces to me.
|
| [0] https://techcommunity.microsoft.com/t5/storage-at-
| microsoft/...
| robert_foss wrote:
| Ah, so a distributed redundancy drive instead of a dedicated
| one as with normal raid.
| knorker wrote:
| Oh? When I first saw that illustration it hugely frustrated me
| about how poor that illustration is.
|
| The numbers are not explained, nor are the rows, or columns.
| And the numbers repeat.
|
| I saw the diagrams first outside of that pdf, though. Still,
| the way it goes "4, 8, 2, 16, 1, and so on..." is a boatload of
| "wat?".
| 7e wrote:
| Doesn't this violate Isilon's patent on the virtual hot spare?
| https://patents.google.com/patent/US7146524
|
| It expires in nine months, which perhaps explains the timing.
| pinewurst wrote:
| I don't think so - 3PAR used a very similar paradigm starting
| in 2000 or so. Plus declustered RAID in general was invented by
| Garth Gibson (and used by Panasas) before that parent.
| the8472 wrote:
| > This will saturate the replacement disk with writes while
| scattering seeks over the rest of the vdev. For 14 wide RAID-Z2
| vdevs using 12TB spindles, rebuilds can take weeks.
|
| Didn't sequential resilver already solve the read half of the
| problem? Because that would already cut it down from weeks.
| magicalhippo wrote:
| When replacing a disk with a hotspare in a regular RAID-Z vdev,
| the process is limited to the sequential write speed of that
| single new disk.
|
| The point with DRAID is that the spare capacity is distributed
| across all the disks in the vdev, hence the name. Thus all
| disks are read from _and_ written to when "replacing" a failed
| disk with the spare capacity.
|
| The combined throughput of the multiple disks is typically much
| higher than that of the single disk.
|
| In addition they have an additional constraint which allows
| DRAID to also support sequential resilver.
|
| At least that's my understanding.
|
| There's been several presentations of DRAID on the OpenZFS
| channel, but the latest one is here[1]. There's also a writeup
| here[2], though not sure if it's 100% up to date, though should
| be close enough for getting the concept.
|
| [1]: https://www.youtube.com/watch?v=jdXOtEF6Fh0
|
| [2]: https://openzfs.github.io/openzfs-
| docs/Basic%20Concepts/dRAI...
| FeepingCreature wrote:
| That kinda makes no sense though, because the whole
| replacement disk needs to be written anyway. Presumably
| before the failure you were using some part of the old disk,
| so you need to fill up the new disk to the same level before
| you're in the same state.
| modderation wrote:
| To clarify, you're redistributing the missing data across
| all of the other disks in the array simultaneously, rather
| than to one disk sequentially.
|
| Let's say you've got an array with 10 disks and 100TB of
| data. You experience a single-disk failure.
|
| If you're running a RaidZ, you have to write the lost data
| (100TB / 10 disks) to the single replacement disk. Writing
| 10TB of data to a single disk takes quite a while,
| especially on inexpensive SATA disks.
|
| If you were running a dRAID with equivalent redundancy
| settings, you have to write (100 / 10)TB to the 9 remaining
| disks in the array. Each disk will have to write about 1.2
| TB of data in parallel.
|
| Writing 1.2TB to each disk should be roughly ~5-10x faster
| than writing 10TB sequentially.
| mbreese wrote:
| I'm thinking of this in terms of hot spares, not cold
| spares.
|
| If you insert a cold spare to replace a disk -- yes, you
| have to write the full disk.
|
| If you migrate a disk from a hot spare to an active disk,
| it's already largely populated, so you only have to write
| the missing chunk.
|
| At least, that's my limited understanding/reading. I'm not
| sure I fully grok what is happening as there aren't
| necessarily dedicated spares in this setup.
|
| But I think the difference in your scenario is hot vs cold
| spares.
| rincebrain wrote:
| The primary benefit of dRAID is when you're doing a replace
| with a virtual "distributed hot spare", where it uses spare
| capacity distributed across all the disks in the vdev to
| replace a failed disk, thereby letting the rebuild scale
| with the total number of disks in the vdev.
|
| When you do a rebuild with a real, distinct physical disk,
| there's no special benefit, you're still almost certainly
| going to bottleneck on the write speed of that single disk.
| jsmith45 wrote:
| While I can see the hot-spare advantage here, I do have
| to wonder about the impact of replacing the broken drive.
|
| I suppose it works out as follows. With traditional raid,
| you are rebuilding while degraded from a safety
| perspective. So obviously, if your configuration
| guarantees losing `n` drives while healthy does not cause
| data loss, you could lose data during rebuild if `n`
| drives failed.
|
| With the distributed hot spare approach, you are no
| longer degraded from a safety perspective as soon as the
| distributed spare's data is written. This is faster than
| traditional raid, as the writes are distributed among
| many disks. So time-wise you get out of data safety
| degraded state faster. (It does seem like it might be
| riskier though, since you are doing heavy writes to many
| drives while safety degraded, instead of only writes to
| the spare).
|
| So now you replace the failed drive, and the array
| rebuilds to make this new spare's capacity distributed.
| This will take a long time, since you will be writing to
| basically this whole drive as part of distributing its
| spare space. (Plus probably shuffling chunks around
| between many other drives). This could possibly take even
| longer than traditional raid rebuild. However during this
| whole process, your array is not in degraded state from a
| data safety point of view.
|
| Or to summarize: during the long single-drive-write-
| limited part of replacing the failed drive with a new hot
| spare, you are not degraded from a safety of data point
| of view.
| mbreese wrote:
| _> This will take a long time, since you will be writing
| to basically this whole drive as part of distributing its
| spare space. (Plus probably shuffling chunks around
| between many other drives)_
|
| Yes. With the caveat that you'll be *reading* from all of
| the disks to satisfy those writes. In a traditional
| setup, you'd have multiple striped raidz{,2,3} vdevs.
| When you swap in a spare, you're only pulling reads from
| the disks in the same vdev. This will drive the
| performance from this vdev to the ground. In this draidz
| setup, you'll be pulling from all of the disks, so the
| entire pool is stressed at the same rate.
|
| I think the way I'm starting to think about it is that
| each drive has X amount of spare capacity or buckets. But
| they are otherwise used. When a drive goes out, the
| primary data from that drive is distributed across the
| array to the already-in-place spare capacity. So, yes,
| after the primary data is mirrored out, the pool is no
| longer degraded. But, I think the point being made here
| is that you'll still want to replace the drive, which
| will still cause *that* drive to be fully written.
| bbarnett wrote:
| Not sure I like this, for many spinning disk raids of
| that scale I have are mostly read, rarely write.
|
| With raid 6 for example, I pop in a new drive, and all
| the write/repair hits only that drive.
|
| With a virtual spare, all drives end up with massive
| writes, meaning I might hit write errors, thus borking a
| raid which would have been ok for reads.
|
| I realise there are 100 'what ifs' and scenarios here,
| but when my raids tend to drop a drive, it's on heavy
| write ops, not reads. I'd say 90% of the time.
|
| I imagine ssd cascading/eol write failures might make
| this virtual spare failure scenario even more likely.
|
| (yes I am aware of sector reallocation delays on spinning
| disks, I am referring to real EOL scenarios on drives
| during writes, being far more common for me than reads)
| the8472 wrote:
| What I was questioning was that it shouldn't be taking weeks
| because the write speed of a single disk is enough to fill it
| in a day or two with sequential IO. I.e. the sequential
| resilver alone should be enough to fix the "can take weeks"
| problem.
|
| Additional speed on top of that is nice of course, if you're
| using such wide pools.
| e12e wrote:
| I get that using hot virtual spares is much better, I also
| wonder at a week to write 12tb. 12tb at 200 mb/s is roughly
| 20 hours.
|
| Is it the fact that you get 12tb of essentially non
| sequencial io mixed with random reads during the rebuild?
| Still 20 mb/s seems 1990s slow writing tio a mechanical HD?
|
| Ed: ok, ok, from tfa:
|
| > For 14 wide RAID-Z2 vdevs using 12TB spindles, rebuilds
| can take weeks. Resilver I/O activity is deprioritized when
| the system has not been idle for a minimum period. Full
| zpools get fragmented and require additional I/O's to
| recalculate data during reslivering.
|
| Obviously writing 12 tb at 0 mb/s will take time..
| magicalhippo wrote:
| With DRAID you'd cut down the time to an hour or so though.
|
| Not sure how much that matters in practice, I've just got a
| home NAS with 6 disks to play with.
| KaiserPro wrote:
| Oooo this is exciting. One of the annoying things about
| "traditional" raid is that making a raid group gives you
| performance, but it also kneecaps your rebuild time.
|
| This is because you need to do a coordinated read from each disk
| in the raid stripe by stripe. a rebuild is effectively a single
| threaded synchronous operation
|
| We don't really need to have such rigid mapping of data to raid
| groups anymore. which is why most large storage systems partition
| your data up into small chunks (a few megs) apply some forward
| error correction to it, and smear it all over the place.
|
| This means that you can do multi-threaded rebuilds. This means
| that as your array gets bigger, not only does the rebuild time
| get smaller, but its much more resilient to data loss.
|
| A good example of this is GPFS's native raid:
| https://www.usenix.org/legacy/events/lisa11/tech/slides/deen...
| (https://www.youtube.com/watch?v=2g5rx4gP6yU)
|
| A word of warning though. Don't do this for small arrays (<50
| disks) its not worth the hassle, yet.
| syoc wrote:
| dRAID looks super cool, was not familiar with it. It sadly looks
| like, according to 3 min googleing, like you will not be able to
| expand ZFS dRAID vdevs.
| hinkley wrote:
| After reading about Reed Solomon codes I was a bit surprised to
| learn that parity disk was a literal rather than a figurative
| thing in RAID5 and RAID6.
|
| After diving into consistent hashing I was surprised that all
| blocks are striped across all disks instead of redundancy+1,2,3
| disks, in pretty much any drive array solution I could find
| architecture docs for. Seems like bigger blocks on fewer disks
| would have better utilization numbers. You could be doing 2
| unrelated reads at the same time from different spindles and
| tracks.
|
| With a dead drive, you could do nothing and wait for a new drive,
| or start making backup copies across the rest of the array as if
| the array were now n-1 disks, or some of both, like this solution
| seems to do.
| cbsmith wrote:
| Curiously, Facebook's "Australian filter" is blocking posting of
| this article.
| Naac wrote:
| Aren't the parity blocks distributed across the disks in raidz2?
| So resilvering/rebuilding already takes advantage of reading from
| all the disks? If so what's the different between that and dRAID?
| diegocg wrote:
| It's not about parity
|
| The essence of draid is that, instead of keeping a spare drive
| unused in case one of the working drives fail, it incorporates
| the spare drive to the array and uses it, but one drive worth
| of free space is reserved randomly across the entire array.
|
| That way, if one disk fails, the reserved space is used to
| write the data necessary to keep the array consistent. Because
| the free space is distributed randomly across the array, the
| write performance of a single drive doesn't become a
| bottleneck.
| louwrentius wrote:
| This is exactly how most 'old-school' enterprise storage
| solutions work.
| Naac wrote:
| So if my understanding is correct this is orthogonal to the
| raidz type one is using?
|
| For example I can have a hot spare with a dRAID setup as part
| of my raidz2 pool?
| gbrown_ wrote:
| It's going to be so good to finally have an open source option
| for large storage building blocks. It's of particular interest to
| those of us in HPC as if you are building a large parallel file
| system and you don't want re-build times to suck your options are
| basically one of the following.
|
| - IBM's GPFS
|
| - Panasas' PanFS
|
| - Xyratex's -> Seagate's -> Cray's -> HPE's GridRAID in
| Clusterstor/Sonexion prodcuts
|
| Though ZFS and dRAID will likely show up in that last one in the
| future.
| [deleted]
| throw0101a wrote:
| Dell-EMC Isilon OneFS
| iforgotpassword wrote:
| It's what we're mainly using currently. Apart from a serious
| bug after an update that kept crashing nfsd on nodes about
| two years ago, it has been a very pleasant experience
| overall. We also use some netapp, but it's a relatively small
| setup. We tried ceph before, and it seemed to support two
| modes of operation: Completely broken or rebalancing and thus
| very slow.
| fh973 wrote:
| Quobyte DCFS does erasure coding of files across machines (no
| local RAID needed).
| p_l wrote:
| Lustre+ZFS seems to have its fans, and iirc ZFSonLinux was
| heavily funded by HPC use case
| gbrown_ wrote:
| Yup in particular Brian Behlendorf from Lawrence Livermore
| National Laboratory has been a major contributor to
| ZoL/OpenZFS.
| KaiserPro wrote:
| It makes sense for lustre, as you can remove the hardware
| raid controller and just use "dumb" jbods.
|
| As lustre is mostly a network raid0 (its got better now) it
| makes sense as you really shouldn't be using it for
| unreconstructable data.(gross simplification alert)
| matheusmoreira wrote:
| Is this feature related to the proposed RAID-Z expansion?
|
| https://github.com/openzfs/zfs/pull/8853
|
| It's mentioned in the pull request comments. Maybe it will help?
| rincebrain wrote:
| Not...really? I think they reused some overlapping work
| involving more flexible data structures, but that's all, as far
| as I know.
|
| (cf. slide 11 of
| https://docs.google.com/presentation/d/1uo0nBfY84HIhEqGWEx-T...
| )
___________________________________________________________________
(page generated 2021-02-18 23:01 UTC)