[HN Gopher] OpenZFS - add disks to existing RAIDZ
___________________________________________________________________
OpenZFS - add disks to existing RAIDZ
Author : shrubble
Score : 158 points
Date : 2023-08-19 16:35 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| atmosx wrote:
| Finally :-)
| lyjia wrote:
| Wow, I've been hearing this has been in the works for a while. I
| am glad to see it released! My RAIDZ array awaits new disks!
| cassianoleal wrote:
| What do you mean by released? It hasn't even been merged yet.
| :)
| znpy wrote:
| It bothers me so much that zfs is not into mainline linux. I know
| it's due to the license incompatibility... :(
| Nextgrid wrote:
| Has the whole license incompatibility thing actually been
| tested/litigated in court? I heard Canonical has (at least at
| some point) shipped prebuilt ZFS. I understand that in-tree
| inclusion brings its own set of problems, but I'm just asking
| about redistribution of binaries - the same binaries you are
| allowed to build locally.
|
| It would be nice to have a precedent deciding on this bullshit
| argument once and for all so distros can freely ship prebuilt
| binary modules and bring Linux to the modern times when it
| comes to filesystems.
|
| The whole situation is ridiculous. I'd understand if this was
| about money, but who exactly gets hurt by users getting a
| prebuilt module from somewhere, vs building exactly the same
| thing locally from freely-available source?
| 2OEH8eoCRo0 wrote:
| Because it will be hella fun backing ZFS out of the kernel
| source once it's been there for a few years.
| rincebrain wrote:
| I don't think Linux would like to ship a mass of code that
| size that's not GPLed, even if court cases say CDDL is GPL-
| compatible.
| jonhohle wrote:
| It isn't compatible because it adds additional restrictions
| regarding patented code. The CDDL was made to be used in
| mixed license distributions and only affects individual
| files, but the GPL taints anything linked (which is why
| LGPL exists). Since the terms of the CDDL can't be
| respected in a GPL'd distribution, I can't see a way for it
| to ever be included in the kernel repo.
|
| I don't think there's any issue with canonical shipping a
| kmod, but similar to 3 party binary drivers, it would need
| to be treated as a different "work".
| askiiart wrote:
| I can confirm that Ubuntu ships prebuilt ZFS, I used a live
| Ubuntu USB to copy some data off a ZFS pool just a couple
| weeks ago.
| chungy wrote:
| The only entity to start a litigation cycle is Oracle, and
| they've either been uninterested or know they can't win.
| Canonical is the only entity they have any chance of going
| after; Canonical's lawyers already decided there was no
| license conflict.
|
| Linus Torvalds doesn't feel like being the guinea pig by
| risking ZFS in the mainline kernel. A totally reasonable
| position while the CDDL+GPL resolution is still ultimately
| unknown. (And honestly, with OpenZFS supporting a very wide
| range of Linux versions and FreeBSD at the same time, I have
| the feeling that mainline inclusion in Linux might not be the
| best outcome anyway.)
| Nextgrid wrote:
| I wonder what Oracle would litigate over though? My
| understanding is that licenses are generally used by
| copyright holders to restrict what others can do with a
| work so the holder can profit off the work and/or keep a
| competitive advantage.
|
| Here I do not see this argument applying since the source
| is freely available to use and extend; the license
| explicitly allows someone to compile it and use it. In this
| case providing prebuilt binaries is more akin to providing
| a "cache" for something you can (and are allowed to) build
| locally (using ZFS-DKMS for example) using source you are
| once again allowed to acquire and use.
|
| What prejudice does it cause to Oracle that the "make"
| command is ran on Ubuntu's build servers as opposed to
| users' individual machines? Have similar cases been
| litigated before where the argument was about who runs the
| make command, with source that either party has otherwise a
| right to download & use?
| Dylan16807 wrote:
| Copying and distributing those binaries is covered by
| copyright. You have to follow the license if you want to
| do it. It doesn't matter that end users could legally get
| the same files in some other manner. Distribution itself
| is part of the legal framework.
| bubblethink wrote:
| Not to mention, bcachefs is making progress towards
| mainline.
| gigatexal wrote:
| Same. Maybe one day. Though I don't know what will happen
| first: it gets mainlined into the kernel or we get HL3
| Filligree wrote:
| So long as the kernel developers are actively hostile to
| ZFS...
|
| You will take your Btrfs and you will like it.
| gigatexal wrote:
| I'm holding out for bcachefs and still building ZFS via
| dkms on current kernels like a madman.
| Filligree wrote:
| I'm running bcachefs on my desktop right now.
|
| It's promising, but there's... bugs. Right now only
| performance oriented ones, that I've noticed, but I'd
| wait a bit longer.
| tux3 wrote:
| We may get a successor filesystem before that particular
| situation is sorted out..
|
| By all accounts mainline is at best not interested, if not
| actively against ZFS on Linux. The last few kerfuffles around
| symbols used by the out-of-tree module laid out the position
| rather unambiguously.
| Nextgrid wrote:
| > The last few kerfuffles around symbols used by the out-of-
| tree module laid out the position rather unambiguously.
|
| Source, for someone who isn't following kernel mailing lists?
| tux3 wrote:
| I was thinking of this thread from 5.0 in particular (2019,
| time flies!)
|
| https://lore.kernel.org/all/20190110182413.GA6932@kroah.com
| /
| Nextgrid wrote:
| It's sad to see that free software under a license (and
| movement) that was born out of someone's frustration with
| closed-source printer drivers (acting as DRM, albeit
| inadvertently) appears to include similar DRM whose sole
| purpose is to restrict usage of a (seemingly arbitrary)
| selection of symbols.
| tux3 wrote:
| It is. That said, I can't fault people too much for being
| afraid of the lawnmower. People have been mowed for much
| less.
| betaby wrote:
| > successor filesystem
|
| Which one?
| pa7ch wrote:
| bcachefs presumably
| j16sdiz wrote:
| we have heard the same with btrfs.
| chungy wrote:
| bcachefs has had time to actually mature instead of being
| kneecapped early on by an angry Linus Torvalds when
| btrfs's on disk format changed and broke his Fedora
| install.
| matheusmoreira wrote:
| So happy to see this. Incremental expansion is extremely
| important for consumers, homelabs. Now we can gradually expand
| capacity.
| shrubble wrote:
| "This feature allows disks to be added one at a time to a RAID-Z
| group, expanding its capacity incrementally. This feature is
| especially useful for small pools (typically with only one RAID-Z
| group), where there isn't sufficient hardware to add capacity by
| adding a whole new RAID-Z group (typically doubling the number of
| disks)."
|
| A feature I am excited to see is being added!
| Dwedit wrote:
| So is it safe to use btrfs for a basic Raid-1 yet?
| wtallis wrote:
| Yeah, the non-parity RAID modes have been safe for a pretty
| long time, as long as you RTFM when something goes wrong
| instead of assuming the recovery procedures match what you'd
| expect coming from a background of traditional RAID or how ZFS
| does it. I've been using RAID-1 (with RAID1c3 for metadata
| since that feature became available) on a NAS for over a decade
| now without loss of data despite loss of more drives over the
| years than the array started out with.
| scheme271 wrote:
| This has been floating around for 2 years at this point so might
| be a long while until it gets in. Interesting, QNAP somehow added
| this feature into the code that their QuTS Hero NASes uses. I'm
| not sure how solid or tested the QNAP code is but it's solid
| enough that they're shipping it in production.
| jtriangle wrote:
| Qnap does recommend a full backup before doing so, which tells
| me it's not exactly production ready as you and I would think
| of it.
| rincebrain wrote:
| QNAP's source drops are also kind of wild, in that they
| branched a looooooooong time ago and have been implementing
| their own versions of features they wanted since, AFAICT.
| phpisthebest wrote:
| I would expect any storage array to recommend a full backup
| any time you are messing with the physical disks. even
| "production ready" features one would not add, remove, or do
| anything with the array with out a full backup.
|
| "Production" systems should not even be considered production
| unless you have a backup of them,
| crote wrote:
| > After the expansion completes, old blocks remain with their old
| data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2
| parity), but distributed among the larger set of disks. New
| blocks will be written with the new data-to-parity ratio (e.g. a
| 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data
| to 2 parity).
|
| Does anyone know why this is the case? When expanding an array
| which is getting full this will result in a far smaller capacity
| gain than desired.
|
| Let's assume we are using 5x 10TB disks which are 90% full.
| Before the process, each disk will contain 5.4TB of data, 3.6TB
| of parity, and 1TB of free space. After the process and
| converting it to 6x 10TB, each disk will contain 4.5TB of data,
| 3TB of parity, and 2.5TB of free space. We can fill this free
| space with 1.66TB of data and 0.83TB of parity per disk - after
| which our entire array will contain 36.96TB of data.
|
| If we made a new 6-wide Z2 array, it would be able to contain
| 40TB of data - so adding a disk this way made us lose over 3TB in
| capacity! Considering the process is already reading and
| rewriting basically the entire array, why not recalculate the
| parity as well?
| louwrentius wrote:
| The key issue is that you basically have to rewrite all
| existing data to regain that lost 3TB. This takes a huge amount
| of time and the ZFS developers have decided not to automate
| this as part of this feature.
|
| You can do this yourself though when convenient to get those
| lost TB back.
|
| The RAID VDEV expansion feature actually was quite stale and
| wasn't being worked on afaik until this sponsorship.
| allanjude wrote:
| That is not how this will work.
|
| The reason the parity ratio stays the same, is that all of the
| references to the data are by DVA (Data Virtual Address,
| effectively the LBA within the RAID-Z vdev).
|
| So the data will occupy the same amount of space and parity as
| it did before.
|
| All stripes in RAID-Z are dynamic, so if your stripe is 5 wide
| and your array is 6 wide, the 2nd stripe will start on the last
| disk and wrap around.
|
| So if your 5x10 TB disks are 90% full, after the expansion they
| will contain the same 5.4 TB of data and 3.6 TB of parity, and
| the pool will now be 10 TB bigger.
|
| New writes, will be 4+2 instead, but the old data won't change
| (they is how this feature is able to work without needing
| block-pointer rewrite).
|
| See this presentation:
| https://www.youtube.com/watch?v=yF2KgQGmUic
| magicalhippo wrote:
| > Considering the process is already reading and rewriting
| basically the entire array, why not recalculate the parity as
| well?
|
| Because snapshots might refer to the old blocks. Sure you could
| recompute, but then any snapshots would mean those old blocks
| would have to stay around so now you've taken up ~twice the
| space.
| mustache_kimono wrote:
| > Does anyone know why this is the case? > Considering the
| process is already reading and rewriting basically the entire
| array, why not recalculate the parity as well?
|
| IANA expert but my guess is -- because, here, you don't have to
| modify block pointers, etc.
|
| ZFS RAIDZ is not like traditional RAID, as it's not just a
| sequence of arbitrary bits, data plus parity. RAIDZ stripe
| width is variable/dynamic, written in blocks (imagine a 128K
| block, compressed to ~88K), and there is no way to quickly tell
| where the parity data is within a written block, where the end
| of any written block is, etc.
|
| If you had to instead, modify the block pointers, I'd assume
| you have to also change each block in the live tree and all
| dependent (including snapshot) blocks at the same time? That
| sounds extraordinarily complicated (and this is the data
| integrity FS!), and much slower, than just blasting through the
| data, in order.
|
| To do what you want, you can do what one could always do -- zfs
| send/recv between a filesystem between and old and new
| filesystem.
| jedberg wrote:
| I wish Apple and Oracle would have just sorted things out and
| made ZFS the main filesystem for the Mac. Way back in the day
| when they first designed Time Machine, it was supposed to just be
| a GUI for ZFS snapshots.
|
| How cool would it be if we had a great GUI for ZFS (snapshots,
| volume management, etc.). I could buy a new external disk, add it
| to a pool, have seamless storage expansion.
|
| It would be great. Ah, what could have been.
| xoa wrote:
| Yes, this will always depress me, and to me personally be one
| of the ultimate evils of the court invented idea of "software
| patents". I've used ZFS with Macs sicne 2011, but without Apple
| onboard it's never been as smooth as it should have been and
| has gotten more difficult in some respects. There was a small
| window where we might have had a really universal, really solid
| FS with great data guarantees and features. All sorts of things
| with OS could be so much better, trivial (and even automated)
| update rollbacks for example. Sigh :(. I hope zvols finally
| someday get some performance attention so that if nothing else
| running APFS (or any other OS of course) on top of one is a
| better experience.
| phs318u wrote:
| I had the same thoughts about Apple and (at the time) Sun, 15
| years ago.
|
| https://macoverdrive.blogspot.com/2008/10/using-zfs-to-manag...
| risho wrote:
| yeah imagine a world where you could use time machine to backup
| 500gb parallels volumes where only the diff was stored between
| snapshots rather than needing to back up the whole 500gb volume
| every single time.
| jedberg wrote:
| Right, that would be nice wouldn't it?
|
| As a workaround, you can create a sparsevolume to store your
| parallels volume. Sparsevolumes are stores in bands, and only
| bands that change get backed up. It might be slightly more
| efficient.
| risho wrote:
| wow that sounds like an interesting solution!
| globular-toast wrote:
| TrueNAS? You can manage the volumes with a web UI. It has a
| thing you can enable for SMB shares that let you see the
| previous versions of files on Windows. Or perhaps I don't
| understand what you're after.
| jedberg wrote:
| I'm looking for that kind of featureset for my root
| filesystem on MacOs. :/
| Terretta wrote:
| > _How cool would it be if we had a great GUI for ZFS
| (snapshots, volume management, etc.). I could buy a new
| external disk, add it to a pool, have seamless storage
| expansion._
|
| See QNAP HERO 5:
|
| https://www.qnap.com/static/landing/2021/quts-hero-5.0/en/in...
|
| NAS Options:
|
| https://www.qnap.com/en-us/product/?conditions=4-3
| jedberg wrote:
| As far as I can tell I can't use that as my root filesystem,
| right?
| mustache_kimono wrote:
| > How cool would it be if we had a great GUI for ZFS
| (snapshots, volume management, etc.).
|
| How cool would it be if we had a great _TUI_ for ZFS...
|
| Live in the now: https://github.com/kimono-koans/httm
| ErneX wrote:
| I'm not an expert whatsoever but what I've been doing for my NAS
| is using mirrored VDEVs. Started with one and later on added a
| couple more drives for a second mirror.
|
| Coincidentally one of the drives of my 1st mirror died few days
| ago after rebooting the host machine for updates and I replaced
| it today, it's been resilvering for a while.
| edmundsauto wrote:
| I've read this is suboptimal because you are now stressing the
| drive that has the only copy of your data to rebuild. what are
| your thoughts?
| deadbunny wrote:
| Mirrored vdevs resilver a lot faster than zX vdevs. Much less
| chance of the remaining drive dying during a resilver if it
| takes hours rather than days.
| tambourine_man wrote:
| If you don't need real-time redundancy, you maybe better served
| by something like SnapRAID. It's more flexible, can mismatch disk
| sizes, performance requirements are much lower, etc.
| hinkley wrote:
| I'm frustrated because this feature was mentioned by Schwartz
| when it was still in beta. I thought a new era of home computing
| was about to start. It didn't, and instead we got The Cloud,
| which feels like decentralization but is in fact massive
| centralization (organizational, rather than geographical).
|
| Some of us think people should be hosting stuff from home,
| accessible from their mobile devices. But the first and to me one
| of the biggest hurdles is managing storage. And that requires a
| storage appliance that is simpler than using a laptop, not
| requiring the skills of an IT professional.
|
| Drobo tried to make a storage appliance, but once you got to the
| fine print it had the same set of problems that ZFS still does.
|
| All professional storage solutions are built on an assumption of
| symmetry of hardware. I have n identical (except not the same
| batch?) drives which I will smear files out across.
|
| Consumers will _never_ have drive symmetry. That's a huge
| expenditure that few can justify, or much afford. My Synology
| didn't like most of my old drives so by the time I had a working
| array I'd spent practically a laptop on it. For a weirdly shaped
| computer I couldn't actually use directly. I'm a developer, I can
| afford it. None of my friends can. Mom definitely can't.
|
| A consumer solution needs to assume drive asymmetry. That day it
| is first plugged in, it will contain a couple new drives, and
| every hard drive the consumer can scrounge up from junk drawers -
| save two: their current backup drive and an extra copy. Once the
| array (with one open slot) is built and verified, then one of the
| backups can go into the array for additional space and speed.
|
| From then on, the owner will likely buy one or two new drives
| every year, at whatever price point they're willing to pay, and
| swap out the smallest or slowest drive in the array. Meaning the
| array will always contain 2-3 different generation of hard
| drives. Never the same speed and never the same capacity. And
| they expect that if a rebuild fails, some of their data will
| still be retrievable. Without a professional data recovery
| company.
|
| which rules out all RAID levels except 0, which is nuts. An
| algorithm that can handle this scenario is consistent hashing.
| Weighted consistent hashing can handle disparate resources, by
| assigning more buckets to faster or larger machines. And it can
| grow and shrink (in a drive array, the two are sequential or
| simultaneous).
|
| Small and old businesses begin to resemble consumer purchasing
| patterns. They can't afford a shiny new array all at once. It's
| scrounging and piecemeal. So this isn't strictly about chasing
| consumers.
|
| I thought ZFS was on a similar path, but the delays in sprouting
| these features make me wonder.
| Osiris wrote:
| I completely agree. To build my array I had to buy several
| drives at the same time. To expand I had to buy a new drive,
| move the data onto the array, and then I'm left with the extra
| drive I had too buy to temporarily store the data because I
| can't add it to the array.
|
| I would love to have more options for expandable redundancy.
| gregmac wrote:
| I can afford it, but have a hard time justifying the costs, not
| to mention scrapped (working) hardware and inconvenience (of
| swapping to a whole new array).
|
| I started using snapraid [1] several years ago, after finding
| zfs couldn't expand. Often when I went to add space the "sweet
| spot" disk size (best $/TB) was 2-3x the size of the previous
| biggest disk I ran. This was very economical compared to
| replacing the whole array every couple years.
|
| It works by having "data" and "parity" drives. Data drives are
| totally normal filesystems, and joined with unionfs. In fact
| you can mount them independently and access whatever files are
| on it. Parity drives are just a big file that snapraid updates
| nightly.
|
| The big downside is it's not realtime redundant: you can lose a
| day's worth of data from a (data) drive failure. For my use
| case this is acceptable.
|
| A huge upside is rebuilds are fairly painless. Rebuilding a
| parity drive has zero downtime, just degraded performance.
| Rebuilding a data drive leaves it offline, but the rest work
| fine (I think the individual files are actually accessible as
| they're restored though). In the worst case you can mount each
| data drive independently on any system and recover its
| contents.
|
| I've been running the "same" array for a decade, but at this
| point every disk has been swapped out at least once (for a
| larger one), and it's been in at least two different host
| systems.
|
| [1] https://www.snapraid.it/
| mrighele wrote:
| > I'm a developer, I can afford it. None of my friends can. Mom
| definitely can't.
|
| The only thing that can work for your mom and your friend is,
| in my opinion, a pair of disks in mirror. When the space
| finished, buy another box with two other disk in mirror.
| Anything more than this is not only too complex for the average
| user but also too expensive.
| hinkley wrote:
| Keeping track of a heterogenous drive array is just as big an
| imposition.
| bartvk wrote:
| Even that sounds complex.
|
| If they want network attached storage, I'd just use a single
| disk NAS, and remotely back it up.
| Quekid5 wrote:
| I think I would advise against a direct mirroring -- instead,
| I'd do sync-every-24-hours or something similar.
|
| Both schemes are vulnerable to the (admittedly rarer) errors
| where both drives fail simultaneously (e.g. mobo fried them)
| or are just ... destroyed by a fire or whatever.
|
| A periodic sync (while harder to set up) _will_ occasionally
| save you from the deleting the wrong files which mirroring
| doesn 't.
|
| Either way: Any truly important data (family photos/videos,
| etc.) needs to be saved periodically to remote storage.
| There's no getting around that if you _really_ care about the
| data.
| canvascritic wrote:
| this is a really neat addition to raid-z. i recall setting up my
| zfs pool in the early 2000s and grappling with disk counts
| because of how rigid expansion was. Good times. this would've
| made things so much simpler. small nitpick: in the "during
| expansion" bit, I thought he could have elaborated a touch on
| restoring the "health of the raidz vdev" part, didn't really
| follow his reasoning there. but overall, looking forward to this
| update. nice work.
| eminence32 wrote:
| The big news here seems to be that iXsystems (the company behind
| FreeNAS/TrueNAS) is sponsoring this work now. This PR supersedes
| ones that was opened back in 2021
|
| https://github.com/openzfs/zfs/pull/12225#issuecomment-16101...
| seltzered_ wrote:
| Yep, see also https://freebsdfoundation.org/blog/raid-z-
| expansion-feature-... (2022)
|
| (Via
| https://lobste.rs/s/5ahxj1/raid_z_expansion_feature_for_zfs_...
| )
___________________________________________________________________
(page generated 2023-08-19 23:00 UTC)