[HN Gopher] ZFS 2.3 released with ZFS raidz expansion
___________________________________________________________________
ZFS 2.3 released with ZFS raidz expansion
Author : scrp
Score : 345 points
Date : 2025-01-14 07:08 UTC (15 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| scrp wrote:
| After years in the making ZFS raidz expansaion is finally here.
|
| Major features added in release: - RAIDZ
| Expansion: Add new devices to an existing RAIDZ pool, increasing
| storage capacity without downtime. - Fast Dedup: A
| major performance upgrade to the original OpenZFS deduplication
| functionality. - Direct IO: Allows bypassing the ARC
| for reads/writes, improving performance in scenarios like NVMe
| devices where caching may hinder efficiency. - JSON:
| Optional JSON output for the most used commands. -
| Long names: Support for file and directory names up to 1023
| characters.
| cm2187 wrote:
| But I presume it is still not possible to remove a vdev.
| mustache_kimono wrote:
| Is this possible elsewhere (re: other filesystems)?
| cm2187 wrote:
| It is possible with windows storage space (remove drive
| from a pool) and mdadm/lvm (remove disk from a RAID array,
| remove volume from lvm), which to me are the two major
| alternatives. Don't know about unraid.
| mustache_kimono wrote:
| > It is possible with windows storage space (remove drive
| from a pool) and mdadm/lvm (remove disk from a RAID
| array, remove volume from lvm), which to me are the two
| major alternatives. Don't know about unraid.
|
| Perhaps I am misunderstanding you, but you can offline
| and remove drives from a ZFS pool.
|
| Do you mean WSS and mdadm/lvm will allow an automatic
| live rebalance and then reconfigure the drive topology?
| cm2187 wrote:
| So for instance I have a ZFS pool with 3 HDD data vdevs,
| and 2 SSD special vdevs. I want to convert the two SSD
| vdevs into a single one (or possibly remove one of them).
| From what I read the only way to do that is to destroy
| the entire pool and recreate it (it's in a server in a
| datacentre, don't want to reupload that much data).
|
| In windows, you can set a disk for removal, and as long
| as the other disks have enough space and are compatible
| with the virtual disks (eg you need at least 5 disks if
| you have parity with number of columns=5), it will
| rebalance the blocks onto the other disks until you can
| safely remove the disk. If you use thin provisioning, you
| can also change your mind about the settings of a virtual
| disk, create a new one on the same pool, and move the
| data from one to the other.
|
| Mdadm/lvm will do the same albeit with more of a pain in
| the arse as RAID requires to resilver not just the
| occupied space but also the free space so takes a lot
| more time and IO than it should.
|
| It's one of my beef with ZFS, there are lots of no return
| decisions. That and I ran into some race conditions with
| loading a ZFS array on boot with nvme drives on ubuntu.
| They seem to not be ready, resulting in randomly degraded
| arrays. Fixed by loading the pool with a delay.
| ryao wrote:
| The man page says that your example is doable with zpool
| remove:
|
| https://openzfs.github.io/openzfs-
| docs/man/master/8/zpool-re...
| formerly_proven wrote:
| My understanding is that ZFS does virtual <-> physical
| translation in the vdev layer, i.e. all block references
| in ZFS contain a (vdev, vblock) tuple, and the vdev knows
| how to translate that virtual block offset into actual
| on-disk block offset(s).
|
| This kinda implies that you can't actually _remove_ data
| vdevs, because in practice you can 't rewrite all
| references. You also can't do offline deduplication
| without rewriting references (i.e. actually touching the
| files in the filesystem). And that's why ZFS can't
| deduplicate snapshots after the fact.
|
| On the other hand, reshaping a vdev is possible, because
| that "just" requires shuffling the vblock -> physical
| block associations _inside_ the vdev.
| ryao wrote:
| There is a clever trick that is used to make top level
| removal work. The code will make the vdev readonly. Then
| it will copy its contents into free space on other vdevs
| (essentially, the contents will be stored behind the
| scenes in a file). Finally, it will redirect reads on
| that vdev into the stored vdev. This indirection allows
| you to remove the vdev. It is not implemented for raid-z
| at present though.
| formerly_proven wrote:
| Though the vdev itself still exists after doing that? It
| just happens to be backed by, essentially, a "file" in
| the pool, instead of the original physical block devices,
| right?
| ryao wrote:
| Yes.
| Sesse__ wrote:
| > Do you mean WSS and mdadm/lvm will allow an automatic
| live rebalance and then reconfigure of the drive topo?
|
| mdadm can convert RAID-5 to a larger or smaller RAID-5,
| RAID-6 to a larger or smaller RAID-6, RAID-5 to RAID-6 or
| the other way around, RAID-0 to a degraded RAID-5, and
| many other fairly reasonable operations, while the array
| is online, resistant to power loss and the likes.
|
| I wrote the first version of this md code in 2005
| (against kernel 2.6.13), and Neil Brown rewrote and
| mainlined it at some point in 2006. ZFS is... a bit late
| to the party.
| ryao wrote:
| Doing this with the on disk data in a merkle tree is much
| harder than doing it on more conventional forms of
| storage.
|
| By the way, what does MD do when there is corrupt data on
| disk that makes it impossible to know what the correct
| reconstruction is during a reshape operation? ZFS will
| know what file was damaged and proceed with the undamaged
| parts. ZFS might even be able to repair the damaged data
| from ditto blocks. I don't know what the MD behavior is,
| but its options for handling this are likely far more
| limited.
| Sesse__ wrote:
| Well, then they made a design choice in their RAID
| implementation that made fairly reasonable things hard.
|
| I don't know what md does if the parity doesn't match up,
| no. (I've never ever had that happen, in more than 25
| years of pretty heavy md use on various disks.)
| ryao wrote:
| I am not sure if reshaping is a reasonable thing. It is
| not so reasonable in other fields. In architecture, if
| you build a bridge and then want more lanes, you usually
| build a new bridge, rather than reshape the bridge. The
| idea of reshaping a bridge while cars are using it would
| sound insane there, yet that is what people want from
| storage stacks.
|
| Reshaping traditional storage stacks does not consider
| all of the ways things can go wrong. Handling all of them
| well is hard, if not impossible to do in traditional
| RAID. There is a long history of hardware analogs to MD
| RAID killing parity arrays when they encounter silent
| corruption that makes it impossible to know what is
| supposed to be stored there. There is also the case where
| things are corrupted such that there is a valid
| reconstruction, but the reconstruction produces something
| wrong silently.
|
| Reshaping certainly is easier to do with MD RAID, but the
| feature has the trade off that edge cases are not handled
| well. For most people, I imagine that risk is fine until
| it bites them. Then it is not fine anymore. ZFS made an
| effort to handle all of the edge cases so that they do
| not bite people and doing that took time.
| Sesse__ wrote:
| > I am not sure if reshaping is a reasonable thing.
|
| Yet people are celebrating when ZFS adds it. Was it all
| for nothing?
| ryao wrote:
| People wanted it, but it was very hard to do safely.
| While ZFS now can do it safely, many other storage
| solutions cannot.
|
| Those corruption issues I mentioned, where the RAID
| controller has no idea what to do, affect far more than
| just reshaping. They affect traditional RAID arrays when
| disks die and when patrol scrubs are done. I have not
| tested MD RAID on edge cases lately, but the last time I
| did, I found MD RAID ignored corruption whenever
| possible. It would not detect corruption in normal
| operation because it assumed all data blocks are good
| unless SMART said otherwise. Thus, it would randomly
| serve bad data from corrupted mirror members and always
| serve bad data from RAID 5/6 members whenever the data
| blocks were corrupted. This was particularly tragic on
| RAID 6, where MD RAID is hypothetically able to detect
| and correct the corruption if it tried. Doing that would
| come with such a huge performance overhead that it is
| clear why it was not done.
|
| Getting back to reshaping, while I did not explicitly
| test it, I would expect that unless a disk is missing or
| disappears during a reshape, MD RAID would ignore any
| corruption that can be detected using parity and assume
| all data blocks are good just like it does in normal
| operation. It does not make sense for MD RAID to look for
| corruption during a reshape operation, since not only
| would it be slower, but even if it finds corruption, it
| has no clue how to correct the corruption unless RAID 6
| is used, there are no missing/failed members and the
| affected stripe does not have any read errors from SMART
| detecting a bad sector that would effectively make it as
| if there was a missing disk.
|
| You could do your own tests. You should find that ZFS
| handles edge cases where the wrong thing is in a spot
| where something important should be gracefully while MD
| RAID does not. MD RAID is a reimplementation of a
| technology from the 1960s. If 1960s storage technology
| handled these edge cases well, Sun Microsystems would not
| have made ZFS to get away from older technologies.
| amluto wrote:
| I've experienced bit rot on md. It was not fun, and the
| tooling was of approximately no help recovering.
| TiredOfLife wrote:
| Storage Spaces doesn't dedicate drive to single purpose.
| It operates in chunks (256MB i think). So one drive can,
| at the same time, be part of a mirror and raid-5 and
| raid-0. This allows fully using drives with various
| sizes. And choosing to remove drive will cause it to
| redistribute the chunks to other available drives,
| without going offline.
| cm2187 wrote:
| And as a user it seems to me to be the most elegant
| design. The quality of the implementation (parity write
| performance in particular) is another matter.
| lloeki wrote:
| IIUC the ask (I have a hard time wrapping my head around
| zfs vernacular), btrfs allows this at least in some
| cases.
|
| If you can convince _btrfs balance_ to not use the dev to
| remove it will simply rebalance data to the other devs
| and then you can _btrfs device remove_.
| c45y wrote:
| Bcachefs allows it
| eptcyka wrote:
| Cool, just have to wait before it is stable enough for
| daily use of mission critical data. I am personally
| optimistic about bcachefs, but incredibly pessimistic
| about changing filesystems.
| ryao wrote:
| It seems easier to copy data to a new ZFS pool if you
| need to remove RAID-Z top level vdevs. Another
| possibility is to just wait for someone to implement it
| in ZFS. ZFS already has top level vdev removal for other
| types of vdevs. Support for top level raid-z vdev removal
| just needs to be implemented on top of that.
| unixhero wrote:
| Btrfs
| tw04 wrote:
| Except you shouldn't use btrfs for any parity based raid
| if you value your data at all. In fact, I'm not aware if
| any vendor that has implemented btrfs with parity based
| raid, they all resort to btrfs on md.
| ryao wrote:
| That was added a while ago:
|
| https://openzfs.github.io/openzfs-docs/man/master/8/zpool-
| re...
|
| It works by making a readonly copy of the vdev being removed
| inside the remaining space. The existing vdev is then
| removed. Data can still be accessed from the copy, but new
| writes will go to an actual vdev while data no longer needed
| on the copy is gradually reclaimed as free space as the old
| data is no longer needed.
| lutorm wrote:
| Although "Top-level vdevs can only be removed if the
| primary pool storage does not contain a top-level raidz
| vdev, all top-level vdevs have the same sector size, and
| the keys for all encrypted datasets are loaded."
| ryao wrote:
| I forgot we still did not have that last bit implemented.
| However, it is less important now that we have expansion.
| cm2187 wrote:
| And in my case all the vdevs are raidz
| jdboyd wrote:
| The first 4 seem like really big deals.
| snvzz wrote:
| The fifth is also, once you consider non-ascii names.
| GeorgeTirebiter wrote:
| Could someone show a legit reason to use 1000-character
| filenames? Seems to me, when filenames are long like that,
| they are actually capturing several KEYS that can be easily
| searched via ls & re's. e.g.
|
| 2025-Jan-14-1258.93743_Experiment-2345_Gas-
| Flow-375.3_etc_etc.dat
|
| But to me this stuff should be in metadata. It's just that
| we don't have great tools for grepping the metadata.
|
| Heck, the original Macintosh FS had no subdirectories -
| they were faked by burying subdirectory names in the (flat
| filesysytem) filename. The original Macintosh File System
| (MFS), did not support true hierarchical subdirectories.
| Instead, the illusion of subdirectories was created by
| embedding folder-like names into the filenames themselves.
|
| This was done by using colons (:) as separators in
| filenames. A file named Folder:Subfolder:File would appear
| to belong to a subfolder within a folder. This was entirely
| a user interface convention managed by the Finder.
| Internally, MFS stored all files in a flat namespace, with
| no actual directory hierarchy in the filesystem structure.
|
| So, there is 'utility' in "overloading the filename space".
| But...
| p_l wrote:
| > Could someone show a legit reason to use 1000-character
| filenames?
|
| 1023 _byte_ names can mean less than 250 characters due
| to use of unicode and utf-8. Add to it unicode
| normalization which might "expand" some characters into
| two or more combining characters, deliberate use of
| combining characters, emoji, rare characters, and you
| might end up with many "characters" taking more than 4
| bytes. A single "country flag" character will be usually
| 8 bytes, usually most emoji will be at least 4 bytes,
| skin tone modifiers will add 4 bytes, etc.
|
| this ' ' takes 27 bytes in my terminal, '' takes 28,
| another combo I found is 35 bytes.
|
| And that's on top of just getting a long title using
| let's say one of CJK or other less common scripts - an
| early manuscript of somewhat successful Japanese novel
| has a non-normalized filename of 119 byte, and it's
| nowhere close to actually long titles, something that
| someone might reasonably have on disk. A random find on
| the internet easily points to a book title that takes
| over 300 bytes in non-normalized utf8.
|
| P.S. proper title of "Robinson Crusoe" if used as
| filename takes at least 395 bytes...
| p_l wrote:
| hah. Apparently HN eradicated the carefully pasted
| complex unicode emojis.
|
| The first was "man+woman kissing" with skin tone
| modifier, then there was few flags
| eatbitseveryday wrote:
| > RAIDZ Expansion: Add new devices to an existing RAIDZ pool,
| increasing storage capacity without downtime.
|
| More specifically:
|
| > A new device (disk) can be attached to an existing RAIDZ vdev
| BodyCulture wrote:
| How well tested is this in combination with encryption?
|
| Is the ZFS team handling encryption as a first class priority
| at all?
|
| ZFS on Linux inherited a lot of fame from ZFS on Solaris, but
| everyone using it in production should study the issue tracker
| very well for a realistic impression of the situation.
| p_l wrote:
| Main issue with encryption is occasional attempts by certain
| (specific) Linux kernel developer to lockout ZFS out of
| access to advanced instruction set extensions (far from the
| only weird idea of that specific developer).
|
| The way ZFS encryption is layered, the features should be
| pretty much orthogonal from each other, but I'll admit that
| there's a bit of lacking with ZFS native encryption (though
| mainly in upper layer tooling in my experience rather than
| actual on-disk encryption parts)
| uniqueuid wrote:
| It's good to see that they were pretty conservative about the
| expansion.
|
| Not only is expansion completely transparent and resumable, it
| also maintains redundancy throughout the process.
|
| That said, there is one tiny caveat people should be aware of:
|
| > After the expansion completes, old blocks remain with their old
| data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2
| parity), but distributed among the larger set of disks. New
| blocks will be written with the new data-to-parity ratio (e.g. a
| 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data
| to 2 parity).
| chungy wrote:
| I'm not sure that's really a caveat, it just means old data
| might be in an inoptimal layout. Even with that, you still get
| the full benefits of raidzN, where up to N disks can completely
| fail and the pool will remain functional.
| stavros wrote:
| Is that the case? What if I expand a 3-1 array to 3-2? Won't
| the old blocks remain 3-1?
| Timshel wrote:
| I don't believe it supports adding parity drives only data
| drives.
| stavros wrote:
| Ahh interesting, thanks.
| bmicraft wrote:
| Since preexisting blocks are kept at their current parity
| ratio and not modified (only redistributed among all
| devices), increasing the parity level of new blocks won't
| really be useful in practice anyway.
| crote wrote:
| I think it's a huge caveat, because it makes upgrades a lot
| less efficient than you'd expect.
|
| For example, home users generally don't want to buy all of
| their storage up front. They want to add additional disks as
| the array fills up. Being able to start with a 2-disk raidz1
| and later upgrade that to a 3-disk and eventually 4-disk
| array is amazing. It's a lot less amazing if you end up with
| a 55% storage efficiency rather than 66% you'd ideally get
| from a 2-disk to 3-disk upgrade. That's 11% of your total
| disk capacity wasted, without any benefit whatsoever.
| chungy wrote:
| It still seems pretty minor. If you want extreme
| optimization, feel free to destroy the pool and create it
| new, or create it with the ideal layout from the beginning.
|
| Old data still works fine, the same guarantees RAID-Z
| provides still hold. New data will be written with the new
| data layout.
| bmicraft wrote:
| Well, when you start a raidz with 2 devices you've already
| done goofed. Start with a mirror or at least 3 devices.
|
| Also, if you don't wait to upgrade until the disks are at
| 100% utilization (which you should never do! you're
| creating massive fragmentation upwards of ~85%) efficiency
| in the real world will be better.
| ryao wrote:
| You have a couple options:
|
| 1. Delete the snapshots and rewrite the files in place like
| how people do when they want to rebalance a pool.
|
| 2. Use send/receive inside the pool.
|
| Either one will make the data use the new layout. They both
| carry the caveat that reflinks will not survive the
| operation, such that if you used reflinks to deduplicate
| storage, you will find the deduplication effect is gone
| afterward.
| rekoil wrote:
| Yaeh it's a pretty huge caveat to be honest.
| Da1 Db1 Dc1 Pa1 Pb1 Da2 Db2 Dc2 Pa2 Pb2 Da3 Db3
| Dc3 Pa3 Pb3 ___ ___ ___ Pa4 Pb4
|
| ___ represents free space. After expansion by one disk you
| would logically expect something like: Da1
| Db1 Dc1 Da2 Pa1 Pb1 Db2 Dc2 Da3 Db3 Pa2 Pb2 Dc3
| ___ ___ ___ Pa3 Pb3 ___ ___ ___ ___ Pa4 Pb4
|
| But as I understand it it would actually expand to:
| Da1 Db1 Dc1 Dd1 Pa1 Pb1 Da2 Db2 Dc2 Dd2 Pa2 Pb2
| Da3 Db3 Dc3 Dd3 Pa3 Pb3 ___ ___ ___ ___ Pa4 Pb4
|
| Where the Dd1-3 blocks are just wasted. Meaning by adding a new
| disk to the array you're only expanding _free storage_ by
| 25%... So say you have 8TB disks for a total of 24TB of storage
| free originally, and you have 4TB free before expansion, you
| would have 5TB free after expansion.
|
| Please tell me I've misunderstood this, because to me it is a
| pretty useless implementation if I haven't.
| ryao wrote:
| ZFS RAID-Z does not have parity disks. The parity and data is
| interleaved to allow data reads to be done from all disks
| rather than just the data disks.
|
| The slides here explain how it works:
|
| https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
|
| Anyway, you are not entirely wrong. The old data will have
| the old parity:data ratio while new data will have the new
| parity:data ratio. As old data is freed from the vdev, new
| writes will use the new parity:data ratio. You can speed this
| up by doing send/receive, or by deleting all snapshots and
| then rewriting the files in place. This has the caveat that
| reflinks will not survive the operation, such that if you
| used reflinks to deduplicate storage, you will find the
| deduplication effect is gone afterward.
| chungy wrote:
| To be fair, RAID5/6 don't have parity disks either. RAID2,
| RAID3, and RAID4 do, but they're all effectively dead
| technology for good reason.
|
| I think it's easy for a lot of people to conceptualize
| RAID5/6 and RAID-Zn as having "data disks" and "parity
| disks" to wrap around the complicated topic of how it
| works, but all of them truly interleave and compute parity
| data across all disks, allowing any single disk to die.
|
| I've been of two minds on the persistent myth of "parity
| disks" but I usually ignore it, because it's a convenient
| lie to understand your data is safe, at least. It's also a
| little bit the same way that raidz1 and raidz2 are
| sometimes talked about as "RAID5" and "RAID6"; the
| effective benefits are the same, but the implementation is
| totally different.
| magicalhippo wrote:
| Unless I misunderstood you, you're describing more how
| classical RAID would work. The RAID-Z expansion works like
| you note you would logically expect. You added a drive with
| four blocks of free space, and you end up with four blocks
| more of free space afterwards.
|
| You can see this in the presentation[1] slides[2].
|
| The reason this is sub-optimal post-expansion is because, in
| your example, the old maximal stripe width is lower than the
| post-expansion maximal stripe width.
|
| Your example is a bit unfortunate in terms of allocated
| blocks vs layout, but if we tweak it slightly, then
| Da1 Db1 Dc1 Pa1 Pb1 Da2 Db2 Dc2 Pa2 Pb2 Da3
| Db3 Pa3 Pb3 ___
|
| would after RAID-Z expansion would become
| Da1 Db1 Dc1 Pa1 Pb1 Da2 Db2 Dc2 Pa2 Pb2 Da3 Db3
| Pa3 Pb3 ___ ___ ___ ___
|
| Ie you added a disk with 3 new blocks, and so total free
| space after is 1+3 = 4 blocks.
|
| However if the same data was written in the post-expanded
| vdev configuration, it would have become
| Da1 Db1 Dc1 Dd1 Pa1 Pb1 Da2 Db2 Dc2 Dd2 Pa2 Pb2
| ___ ___ ___ ___ ___ ___
|
| Ie, you'd have 6 free blocks not just 4 blocks.
|
| Of course this doesn't count for writes which end up taking
| less than the maximal stripe width.
|
| [1]: https://www.youtube.com/watch?v=tqyNHyq0LYM
|
| [2]:
| https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
| ryao wrote:
| Your diagrams have some flaws too. ZFS has a variable
| stripe size. Let's say you have a 10 disk raid-z2 vdev that
| is ashift=12 for 4K columns. If you have a 4K file, 1 data
| block and 2 parity blocks will be written. Even if you
| expand the raid-z vdev, there is no savings to be had from
| the new data:parity ratio. Now, let's assume that you have
| a 72K file. Here, you have 18 data blocks and 6 parity
| blocks. You would benefit from rewriting this to use the
| new data:parity ratio. In this case, you would only need 4
| parity blocks. ZFS does not rewrite it as part of the
| expansion, however.
|
| There are already good diagrams in your links, so I will
| refrain from drawing my own with ASCII. Also, ZFS will vary
| which columns get parity, which is why the slides you
| linked have the parity at pseudo-random locations. It was
| not a quirk of the slide's author. The data is really laid
| out that way.
| magicalhippo wrote:
| What are the errors? I tried to show exactly what you
| talk about.
|
| edit: ok, I didn't consider the exact locations of the
| parity, I was only concerned with space usage.
|
| The 8 data blocks need three stripes on a 3+2 RAID-Z2
| setup both pre and post expansion, the last being a
| partial stripe, but when written in the 4+2 setup only
| needs 2 full stripes, leading to more total free space.
| wjdp wrote:
| Caveat is very much expected, you should expect ZFS features to
| not rewrite blocks. Changes to settings only apply to new data
| for example.
| cgeier wrote:
| This is huge news for ZFS users (probably mostly those in the
| hobbyist/home use space, but still). raidz expansion has been one
| of the most requested features for years.
| jfreax wrote:
| I'm not yet familiar with zfs and couldn't find it in the
| release note: Does expansion only works with disk of the same
| size? Or is adding are bigger/smaller disks possible or do all
| disk need to have the same size?
| shiroiushi wrote:
| As far as I understand, ZFS doesn't work at all with disks of
| differing sizes (in the same array). So if you try it, it
| just finds the size of the smallest disk, and uses that for
| all disks. So if you put an 8TB drive in an array with a
| bunch of 10TB drives, they'll all be treated as 8TB drives,
| and the extra 2TB will be ignored on those disks.
|
| However, if you replace the smallest disk with a new, larger
| drive, and resilver, then it'll now use the new smallest disk
| as the baseline, and use that extra space on the other
| drives.
|
| (Someone please correct me if I'm wrong.)
| mustache_kimono wrote:
| > As far as I understand, ZFS doesn't work at all with
| disks of differing sizes (in the same array).
|
| This might be misleading, however, it may only be my
| understanding of word "array".
|
| You can use 2x10TB mirrors as vdev0, and 6x12TB in RAIDZ2
| as vdev1 in the same pool/array. You can also stack as many
| unevenly sized disks as you want in a pool. The actual
| problem is when you want a different drive topology within
| a pool or vdev, or you want to mismatch, say, 3 oddly sized
| drives to create some synthetic redundancy level (2x4TB and
| 1x8TB to achieve two copies on two disks) like btrfs
| does/tries to do.
| tw04 wrote:
| This is the case with any parity based raid, they just hide
| it or lie to you in various ways. If you have two 6TB dives
| and two 12TB drives in a single raid-6 array, it is
| physically impossible to have two drive parity once you
| exceed 12TB of written capacity. BTRFS and bcachefs can't
| magically create more space where none exists on your 6TB
| drives. They resort to dropping to mirror protection for
| the excess capacity which you could also do manually with
| ZFS by giving it partitions instead of the whole drive.
| chasil wrote:
| IIRC, you could always replace drives in a raidset with
| larger devices. When the last drive is replaced, then the new
| space is recognized.
|
| This new operation seems somewhat more sophisticated.
| zelcon wrote:
| You need to buy the same exact drive with the same capacity
| and speed. Your raidz vdev be as small and as slow as your
| smallest and slowest drive.
|
| btrfs and the new bcachefs can do RAID with mixed drives, but
| I can't trust either of them with my data yet.
| Mashimo wrote:
| > You need to buy the same exact drive
|
| AFAIK you can add larger and faster drives, you will just
| not get any benefits from it.
| bpye wrote:
| You can get read speed benefits with faster drives, but
| your writes will be limited by your slowest.
| hda111 wrote:
| It doesn't have to be the same exact drive. Mixing drives
| from different manufacturers (with the same capacity) is
| often used to prevent correlated failure. ZFS is not using
| the whole disk, so different disks can be mixed, because
| the disk often have varying capacity.
| unixhero wrote:
| Just have backups. I used btrfs and zfs for different
| purposes. Never had any lost data or downtime with btrfs
| since 2016. I only use raid 0 and raid 1 and compression.
| Btrfs does not havr a hungry ram requirement.
| tw04 wrote:
| Neither does zfs, that's a widely repeated red herring
| from people trying to do dedup in the very early days,
| and people who misunderstood how it used ram to do
| caching.
| tw04 wrote:
| You can run raid-z across partitions to utilize the full
| drive just like synology does with their "hybrid raid" -
| you just shouldn't.
| ryao wrote:
| You can use different sized disks, but RAID-Z will truncate
| the space it uses to the lowest common denominator. If you
| increase the lowest common denominator, RAID-Z should auto-
| expand to use the additional space. All parity RAID
| technologies truncate members to the lowest common
| denominator, rather than just ZFS.
| zelcon wrote:
| Been running it since rc2. It's insane how long this took to
| finally ship.
| senectus1 wrote:
| Would love to use ZFS, but unfortunately Fedora just cant keep up
| with it...
| vedranm wrote:
| If you delay upgrading the kernel on occasions, it is more or
| less fine.
| kawsper wrote:
| Not sure if it helps you at all, but I have a simple Ruby
| script that I use to build kernels on Fedora with a specified
| ZFS version.
|
| https://github.com/kaspergrubbe/fedora-kernel-compilation/bl...
|
| It builds on top of the exploded fedora kernel tree, adds zfs
| and spits out a .rpm that you can install with rpm -ivh.
|
| It doesn't play well with dkms because it tries to interfere,
| so I disable it on my system.
| _factor wrote:
| I could never getting working on rpm-ostree distros.
| klauserc wrote:
| I've been running Fedora on top of the excellent ZFSBootMenu[1]
| for about a year. You need to pay attention to the kernel
| versions supported by OpenZFS and might have to wait for
| support for a couple of weeks. The setup works fine otherwise.
|
| [1] https://docs.zfsbootmenu.org
| endorphine wrote:
| Can someone describe why they would use ZFS (or similar) for home
| usage?
| lutorm wrote:
| Apart from just peace of mind from bitrot, I use it for the
| snapshotting capability which makes it super easy to do
| backups. You can snapshot and send the snapshots to other
| storage with e.g zfs-autobackup and it's trivial and you can't
| screw it up. If the snapshots exist on the other drive, you
| know you have a backup.
| vedranm wrote:
| Several reasons, but major ones (for me) are reliability
| (checksums and self-healing) and portability (no other modern
| filesystem can be read and written on Linux, FreeBSD, Windows,
| and macOS).
|
| Snapshots ("boot environments") are also supported by Btrfs (my
| Linux installations use that so I don't have to worry about
| having the 3rd party kernel module to read my rootfs).
| Performance isn't that great either and, assuming Linux, XFS is
| a better choice if that is your main concern.
| nesarkvechnep wrote:
| I'm trying to find a reason not to use ZFS at home.
| dizhn wrote:
| Requirement for enterprise quality disks, huge RAM (1 gig per
| TB), ECC, at least x5 disks of redundancy. None of these are
| things, but people will try to educate you anyway. So use it
| but keep it to yourself. :)
| craftkiller wrote:
| No need to keep it to yourself. As you've mentioned, all of
| these requirements are misinformation so you can ignore
| people who repeat them (or even better, tell them to stop
| spreading misinformation).
|
| For those not in the know:
|
| You don't need to use enterprise quality disks. There is
| nothing in the ZFS design that requires enterprise quality
| disks any more than any other file system. In fact, ZFS has
| saved my data through multiple consumer-grade HDD failures
| over the years thanks to raidz.
|
| The 1 gig per TB figure is ONLY for when using the ZFS
| dedup feature, which the ZFS dedup feature is widely
| regarded as a bad idea except in VERY specific use cases.
| 99.9% of ZFS users should not and will not use dedup and
| therefore they do not need ridiculous piles of ram.
|
| There is nothing in the design of ZFS any more dangerous to
| run without ECC than any other filesystem. ECC is a good
| idea regardless of filesystem but its certainly not a
| requirement.
|
| And you don't need x5 disks of redundancy. It runs great
| and has benefits even on single-disk systems like laptops.
| Naturally, having parity drives is better in case a drive
| fails but on single disk systems you still benefit from the
| checksumming, snapshotting, boot environments, transparent
| compression, incremental zfs send/recv, and cross-platform
| native encryption.
| bbatha wrote:
| > The 1 gig per TB figure is ONLY for when using the ZFS
| dedup feature, which the ZFS dedup feature is widely
| regarded as a bad idea except in VERY specific use cases.
| 99.9% of ZFS users should not and will not use dedup and
| therefore they do not need ridiculous piles of ram.
|
| You also really don't need a 1GB for RAM unless you have
| a very high write volume. YMMV but my experience is that
| its closer to 1GB for 10TB.
| tpetry wrote:
| The interesting part about the enterprise quality disk
| misinformation is how so wrong it is. The core idea of ZFS
| was to detect issues when those drives or their drivers are
| faulty. And this was more happening with cheap non-
| enterprise disks at that time.
| Mashimo wrote:
| It's relatively easy, and yet powerful. Before that I had MDADM
| + LVM + dm-crypt + ext4, which also worked but all the layers
| got me into a headache.
|
| Automated snapshots are super easy and fast. Also easy to
| access if you deleted a file, you don't have to restore the
| whole snapshot, you can just cp from the hidden .zfs/ folder.
|
| I run it on 6x 8TB disk for a couple of years now. I run it in
| a raidz2, which means up to 2 disk can die. Would I use it on a
| single disk on a Desktop? Probably not.
| redundantly wrote:
| > Would I use it on a single disk on a Desktop? Probably not.
|
| I do. Snapshots and replication and checksumming are awesome.
| chromakode wrote:
| I replicate my entire filesystem to a local NAS every 10
| minutes using zrepl. This has already saved my bacon once when
| a WD_BLACK SN850 suddenly died on me [1]. It's also recovered
| code from some classic git blunders. It shouldn't be possible
| any more to lose data to user error or single device failure.
| We have the technology.
|
| [1]: https://chromakode.com/post/zfs-recovery-with-zrepl/
| mrighele wrote:
| Good reasons for me:
|
| Checksums: this is even more important in home usage as the
| hardware is usually of lower quality. Faulty controllers,
| crappy cables, hard disks stored in a higher than advised
| temperature... many reasons for bogus data to be saved, and zfs
| handles that well and automatically (if you have redundancy)
|
| Snapshots: very useful to make backups and quickly go back to
| an older version of a file when mistakes are made
|
| Ease of mind: compared to the alternatives, I find that zfs is
| easier to use and makes it harder to make a mistake that could
| bring data loss (e.g. remove by mistake the wrong drive when
| replacing a faulty one, pool becomes unusable, "ops!", put the
| disk back, pool goes back to work as nothing happened). Maybe
| it is different now with mdadm, ma when I used it years ago I
| was always worried to make a destructive mistake.
| EvanAnderson wrote:
| > Snapshots: very useful to make backups and quickly go back
| to an older version of a file when mistakes are made
|
| Piling on here: Sending snapshots to remote machines (or
| removable drives) is very easy. That makes snapshots viable
| as a backup mechanism (because they can exist off-site and
| offline).
| ryao wrote:
| To give an answer that nobody else has given, ZFS is great for
| storing Steam games. Set recordsize=1M and compression=zstd and
| you can often store about 33% more games in the same space.
|
| A friend uses ZFS to store his Steam games on a couple of hard
| drives. He gave ZFS a SSD to use as L2ARC. ZFS automatically
| caches the games he likes to run on the SSD so that they load
| quickly. If he changes which games he likes to run, ZFS will
| automatically adapt to cache those on the SSD instead.
| chillfox wrote:
| The compression and ARC will make games load much master than
| they would on NTFS even without having a separate drive for
| the ARC.
| bmicraft wrote:
| As I understand, L2ARC doesn't work across reboots which
| unfortunately makes it almost useless for systems that get
| rebooted regularly, like desktops.
| NamTaf wrote:
| I used it on my home NAS (4x3TB drives, holding all of my
| family's backups, etc.) for the data security / checksumming
| features. IMO it's performant, robust and well-designed in ways
| that give me reassurance regarding data integrity and help
| prevent me shooting myself in the foot.
| PaulKeeble wrote:
| I have a home built NAS that uses ZFS for the storage array and
| the checksumming has been really quite useful in detecting and
| correcting bit rot. In the past I used MDADM and EXT over the
| top and that worked but it didn't defend against bit rot. I
| have considered BTRFS since it would get me the same
| checksumming without the rest of ZFS but its not considered
| reliable for systems with parity yet (although now I think it
| likely is more than reliable enough now).
|
| I do occasionally use snapshots and the compression feature is
| handy on quite a lot of my data set but I don't use the user
| and group limitations or remote send and receive etc. ZFS does
| a lot more than I need but it also works really well and I
| wouldn't move away from a checksumming filesystem now.
| mshroyer wrote:
| I use it on a NAS for:
|
| - Confidence in my long-term storage of some data I care about,
| as zpool scrub protects against bit rot
|
| - Cheap snapshots that provide both easy checkpoints for work
| saved to my network share, and resilience against ransomware
| attacks against my other computers' backups to my NAS
|
| - Easy and efficient (zfs send) replication to external hard
| drives for storage pool backup
|
| - Built-in and ergonomic encryption
|
| And it's really pretty easy to use. I started with FreeNAS (now
| TrueNAS), but eventually switched to just running FreeBSD + ZFS
| + Samba on my file server because it's not that complicated.
| zbentley wrote:
| I use ZFS for boot and storage volumes on my main workstation,
| which is primarily that--a workstation, not a server or NAS.
| Some benefits:
|
| - Excellent filesystem level backup facility. I can transfer
| snapshots to a spare drive, or send/receive to a remote (at
| present a spare computer, but rsync.net looks better every year
| I have to fix up the spare).
|
| - Unlike other fs-level backup solutions, the flexibility of
| zvols means I can easily expand or shrink the scope of what's
| backed up.
|
| - It's incredibly easy to test (and restore) backups. Pointing
| my to-be-backed-up volume, or my backup volume, to a previous
| backup snapshot is instant, and provides a complete view of the
| filesystem at that point in time. No "which files do you want
| to restore" hassles or any of that, and then I can re-point
| back to latest and keep stacking backups. Only Time Machine has
| even approached that level of simplicity in my experience, and
| I have tried a _lot_ of backup tools. In general, backup tools
| /workflows that uphold "the test process _is_ the restoration
| process, so we made the restoration process as easy and
| reversible as possible " are the best ones.
|
| - Dedup occasionally comes in useful (if e.g. I'm messing
| around with copies of really large AI training datasets or many
| terabytes of media file organization work). It's RAM-expensive,
| yes, but what's often not mentioned is that you can turn it on
| and off for a volume--if you rewrite data. So if I'm looking
| ahead to a week of high-volume file wrangling, I can turn dedup
| on where I need it, start a snapshot-and-immediately-restore of
| my data (or if it's not that many files, just cp them back and
| forth), and by the next day or so it'll be ready. Turning it
| off when I'm done is even simpler. I imagine that the copy cost
| and unpredictable memory usage mean that this kind of "toggled"
| approach to dedup isn't that useful for folks driving servers
| with ZFS, but it's outstanding on a workstation.
|
| - Using ZFSBootMenu outside of my OS means I can be extremely
| cavalier with my boot volume. Not sure if an experimental
| kernel upgrade is going to wreck my graphics driver? Take a
| snapshot and try it! Not sure if a curl | bash invocation from
| the internet is going to rm -rf /? Take a snapshot and try it!
| If my boot volume gets ruined, I can roll it back to a snapshot
| _in the bootloader_ from outside of the OS. For extra paranoia
| I have a ZFSBootMenu EFI partition on a USB drive if I ever
| wreck the bootloader as well, but the odds are that if I ever
| break the system that bad the boot volume is damaged at the
| block level and can 't restore local snapshots. In that case,
| I'd plug in the USB drive and restore a snapshot from the
| adjacent data volume, or my backup volume ... all without
| installing an OS or leaving the bootloader. The benefits of
| this to mental health are huge; I can tend towards a more
| "college me" approach to trying random shit from StackOverflow
| for tweaking my system without having to worry about "adult
| professional me" being concerned that I don't _know_ what
| running some random garbage will do to my system. Being able to
| experiment first, and then learn what 's really going on once I
| find what works, is very relieving and makes tinkering a much
| less fraught endeavor.
|
| - Being able to per-dataset enable/disable ARC and ZIL means
| that I can selectively make some actions really fast. My Steam
| games, for example, are in a high-ARC-bias dataset that starts
| prewarming (with throttled IO) in the background on boot. Game
| load times are extremely fast--sometimes at better than single-
| ext4-SSD levels--and I'm storing all my game installs on
| spinning rust for $35 (4x 500GB + 2x 32GB cheap SSD for cache)!
| tbrownaw wrote:
| > _describe why they would use ZFS (or similar) for home usage_
|
| Mostly because it's there, but also the snapshots have a `diff`
| feature that's occasionally useful.
| klauserc wrote:
| I use it on my work laptop. Reasons:
|
| - a single solution that covers the entire storage domain (I
| don't have to learn multiple layers, like logical volume
| manager vs. ext4 vs. physical partitions) - cheap/free
| snapshots. I have been glad to have been able to revert
| individual files or entire file systems to an earlier state.
| E.g., create a snapshot before doing a major distro update. -
| easy to configure/well documented
|
| Like others have said, at this point I would need a good
| reason, NOT to use ZFS on a system.
| FrostKiwi wrote:
| FINALLY!
|
| You can do borderline insane single-vdev setups like RAID-Z3 with
| 4 disks (3 Disks worth of redundancy) of the most expensive and
| highest density hard drives money can buy right now, for an
| initial effective space usage of 25% and then keep buying and
| expanding Disk by Disk, with the space demand growing, up to
| something like 12ish disks. Disk prices dropping as time goes on
| and a spread out failure chance with disks being added at
| different times.
| uniqueuid wrote:
| Yes but see my sibling comment.
|
| When you expand your array, your _existing data_ will not be
| stored any more efficiently.
|
| To get the new parity/data ratios, you would have to force
| copies of the data and delete the old, inefficient versions,
| e.g. with something like this [1]
|
| My personal take is that it's a much better idea to buy
| individual complete raid-z configurations and add new ones /
| replace old ones (disk by disk!) as you go.
|
| [1] https://github.com/markusressel/zfs-inplace-rebalancing
| Mashimo wrote:
| I wish something like this would be build into ZFS, so
| snapshots and current access would not be broken.
| uniqueuid wrote:
| True, but I have a gut feeling that a lot of these thorny
| issues would come up again:
|
| https://github.com/openzfs/zfs/issues/3582
| averageRoyalty wrote:
| Worth noting that TrueNAS already supports this[0] (I assuming
| using 2.3.0rc3?). Not sure about the stability, but very
| exciting.
|
| https://www.truenas.com/blog/electric-eel-openzfs-23/
| poisonborz wrote:
| I just don't get it how the Windows world - by far the largest PC
| platform per userbase - still doesn't have any answer to ZFS.
| Microsoft had WinFS and then ReFS but it's on the backburner and
| while there is active development (Win11 ships some bits time to
| time) release is nowhere in sight. There are some lone warriors
| trying the giant task of creating a ZFS compatibility layer with
| some projects, but they are far from being mature/usable.
|
| How come that Windows still uses a 32 year old file system?
| mustache_kimono wrote:
| > I just don't get it how the Windows world - by far the
| largest PC platform per userbase - still doesn't have any
| answer to ZFS.
|
| The mainline Linux kernel doesn't either, and I think the
| answer is because it's hard and high risk with a return mostly
| measured in technical respect?
| ffsm8 wrote:
| Technically speaking, bcachefs has been merged into the Linux
| Kernel - that makes your initial assertion wrong.
|
| But considering it's had two drama events within 1 year of
| getting merged... I think we can safely confirm your
| conclusion of it being really hard
| mustache_kimono wrote:
| > Technically speaking, bcachefs has been merged into the
| Linux Kernel - that makes your initial assertion wrong.
|
| bcachefs doesn't implement its erasure coding/RAID yet?
| Doesn't implement send/receive. Doesn't implement
| scrub/fsck. See: https://bcachefs.org/Roadmap,
| https://bcachefs.org/Wishlist/
|
| btrfs is still more of a legit competitor to ZFS these days
| and it isn't close to touching ZFS where it matters. If the
| perpetually half-finished bcachefs and btrfs are the
| "answer" to ZFS that seems like too little, too late to me.
| koverstreet wrote:
| Erasure coding is almost done; all that's missing is some
| of the device evacuate and reconstruct paths, and people
| have been testing it and giving positive feedback
| (especially w.r.t. performance).
|
| It most definitely does have fsck and has since the
| beginning, and it's a much more robust and dependable
| fsck than btrfs's. Scrub isn't quite done - I actually
| was going to have it ready for this upcoming merge window
| except for a nasty bout of salmonella :)
|
| Send/recv is a long ways off, there might be some low
| level database improvements needed before that lands.
|
| Short term (next year or two) priorities are finishing
| off online fsck, more scalability work (upcoming version
| for this merge window will do 50PB, but now we need to up
| the limit on number of drives), and quashing bugs.
| ryao wrote:
| Hearing that it is missing some code for reconstruction
| makes it sound like it is missing something fairly
| important. The original purpose of parity RAID is to
| support reconstruction.
| koverstreet wrote:
| We can do reconstruct reads, what's missing is the code
| to rewrite missing blocks in a stripe after a drive dies.
|
| In general, due to the scope of the project, I've been
| prioritizing the functionality that's needed to validate
| the design and the parts that are needed for getting the
| relationships between different components correct.
|
| e.g. recently I've been doing a bunch of work on
| backpointers scalability, and that plus scrub are leading
| to more back and forth iteration on minor interactions
| with erasure coding.
|
| So: erasure coding is complete enough to know that it
| works and for people to torture test it, but yes you
| shouldn't be running it in production yet (and it's
| explicitly marked as such). What's remaining is trivial
| but slightly tedious stuff that's outside the critical
| path of the rest of the design.
|
| Some of the code I've been writing for scrub is turning
| out to also be what we want for reconstruct, so maybe
| we'll get there sooner rather than later...
| BSDobelix wrote:
| >except for a nasty bout of salmonella
|
| Did the Linux Foundation send you some "free" sushi? ;)
|
| However keep the good work rolling, super happy about a
| good, usable and modern Filesystem native to Linux.
| pdimitar wrote:
| FYI: the main reason I gave up on bcachefs is that I
| can't use devices with native 16K blocks.
|
| Hope that's coming this year. I have a bunch of old HDDs
| and SSDs and I could very easily assemble a spare storage
| server with about 4TB capacity. Already tested bcachefs
| with most of the drives and it performed very well.
|
| Also lack of ability to reconstruct seems like another
| worrying omission.
| koverstreet wrote:
| I wasn't aware there were actual users needing bs > ps
| yet. Cool :)
|
| That should be a completely trivial for bcachefs to
| support, it'll mostly just be a matter of finding or
| writing the tests.
| pdimitar wrote:
| Seriously? But... NVMe drives! I stopped testing because
| I only have one spare NVMe and couldn't use it with
| bcachefs.
|
| If you or others can get it done I'm absolutely starting
| to use bcachefs the month after. I do need fast storage
| servers in my home office.
| ryao wrote:
| You can do this on ZFS today with `zpool create -o
| ashift=14 ...`.
| pdimitar wrote:
| Yeah I know, thanks. But ZFS still mostly requires drives
| with the same sizes. My main NAS is like that but I can't
| expand it even though I want to, with drives of different
| sizes I have lying around, and I am not keen on spending
| for new HDDs right now. So I thought I'll make a
| secondary NAS with bcachefs and all the spare drives I
| have.
|
| As for ZFS, I'll be buying some extra drives later this
| year and will make use of direct_io so I can use another
| NVMe spare for faster access.
| ryao wrote:
| If you don't care about redundancy, you could add all of
| them as top level vdevs and then ZFS will happily use all
| of the space on them until one fails. Performance should
| be great until there is a failure. Just have good
| backups.
| mafuy wrote:
| Thank you, looking forward to it!
| kwanbix wrote:
| Honest question. As an end user that uses Windows and Linux and
| does not uses ZFS, what I am missing?
| madeofpalk wrote:
| I'm missing file clones/copy-on-write.
| poisonborz wrote:
| Way better data security, resilience against file rotting.
| This goes for both HDDs or SSDs. Copy-on-write, snapshots,
| end to end integrity. Also easier to extend the storage for
| safety/drive failure (and SSDs corrupt in a more sneaky way)
| with pools.
| wil421 wrote:
| How many of us are using single disks on our laptops? I
| have a NAS and use all of the above but that doesn't help
| people with single drive systems. Or help me understand why
| I would want it on my laptop.
| ekianjo wrote:
| It provides encryption by default without having to deal
| with LUKS. And no need to ever do fsck again.
| Twey wrote:
| Except that swap on OpenZFS still deadlocks 7 years later
| (https://github.com/openzfs/zfs/issues/7734) so you're
| still going to need LUKS for your swap anyway.
| ryao wrote:
| Another option is to go without swap. I avoid swap on my
| machines unless I want hibernation support.
| ryao wrote:
| My thinkpad from college uses ZFS as its rootfs. The
| benefits are: * If the hard drive / SSD
| corrupted blocks, the corruption would be identified.
| * Ditto blocks allow for self healing. Usually, this only
| applies to metadata, but if you set copies=2, you can get
| this on data too. It is a poor man's RAID. * ARC
| made the desktop environment very responsive since unlike
| the LRU cache, ARC resists cold cache effects from
| transient IO workloads. * Transparent compression
| allowed me to store more on the laptop than otherwise
| possible. * Snapshots and rollback allowed me to do
| risky experiments and undo them as if nothing happened.
| * Backups were easy via send/receive of snapshots.
| * If the battery dies while you are doing things, you can
| boot without any damage to the filesystem.
|
| That said, I use a MacBook these days when I need to go
| outside. While I miss ZFS on it, I have not felt
| motivated to try to get a ZFS rootfs on it since the last
| I checked, Apple hardcoded the assumption that the rootfs
| is one of its own filesystems into the XNU kernel and
| other parts of the system.
| CoolCold wrote:
| NTFS had compression since mot even sure when.
|
| For other stuff, let that nerdy CorpIT handle your
| system.
| ryao wrote:
| NTFS compression is slow and has a low compression ratio.
| ZFS has both zstd and lz4.
| adgjlsfhk1 wrote:
| yes but NTFS is bad enough that no one needs to be told
| how bad it is.
| rabf wrote:
| Not ever having to deal with partitions and instead using
| data sets each of which can have their own properties
| such as compression, size quota, encryption etc is
| another benefit. Also using zfsbootmenu instead of grub
| enables booting from different datasets or snapshots as
| well as mounting and fixing data sets all from the
| bootloader!
| yjftsjthsd-h wrote:
| If the single drive in your laptop corrupts data, you
| won't know. ZFS can't _fix_ corruption without extra
| copies, but it 's still useful to catch the problem and
| notify the user.
|
| Also snapshots are great regardless.
| Polizeiposaune wrote:
| In some circumstances it can.
|
| Every ZFS block pointer has room for 3 disk addresses; by
| default, the extras are used only for redundant metadata,
| but they can also be used for user data.
|
| When you turn on ditto blocks for data (zfs set copies=2
| rpool/foo), zfs can fix corruption even on single-drive
| systems at the cost of using double or triple the space.
| Note that (like compression), this only affects blocks
| written after the setting is in place, but (if you can
| pause writes to the filesystem) you can use zfs send|zfs
| recv to rewrite all blocks to ensure all blocks are
| redundant.
| jeroenhd wrote:
| The data security and rot resilience only goes for systems
| with ECC memory. Correct data with a faulty checksum will
| be treated the same as incorrect data with a correct
| checksum.
|
| Windows has its own extended filesystem through Storage
| Spaces, with many ZFS features added as lesser used Storage
| Spaces options, especially when combined with ReFS.
| abrookewood wrote:
| Please stop repeating this, it is incorrect. ECC helps
| with any system, but it isn't necessary for ZFS checksums
| to work.
| BSDobelix wrote:
| On zfs there is the ARC (adaptive read cache), on non-zfs
| systems this "read cache" is called buffer, both reside
| in memory, so ECC is equally important for both systems.
|
| Rot means changing bits without accessing those bits, and
| that's ~not possible with zfs, additionally you can
| enable check-summing IN the ARC (disabled by default),
| and with that you can say that ECC and "enterprise"
| quality hardware is even more important for non-ZFS
| systems.
|
| >Correct data with a faulty checksum will be treated the
| same as incorrect data with a correct checksum.
|
| There is no such thing as "correct" data, only a block
| with a correct checksum, if the checksum is not correct,
| the block is not ok.
| _factor wrote:
| This has nothing to do with ZFS as a filesystem. It has
| integrity verification on duplicated raid configurations.
| If the system memory flips a bit, it will get written to
| disk like all filesystems. If a bit flips on a disk,
| however, it can be detected and repaired. Without ECC,
| your source of truth can corrupt, but this true of any
| system.
| mrb wrote:
| _" data security and rot resilience only goes for systems
| with ECC memory."_
|
| No. Bad HDDs/SSDs or bad SATA cables/ports cause a lot
| more data corruption than bad RAM. And ZFS will correct
| these cases even without ECC memory. It's a myth that the
| data healing properties of ZFS are useless without ECC
| memory.
| elseless wrote:
| Precisely this. And don't forget about bugs in
| virtualization layers/drivers -- ZFS can very often save
| your data in those cases, too.
| ryao wrote:
| I once managed to use ZFS to detect a bit flip on a
| machine that did not have ECC RAM. All python programs
| started crashing in libpython.so on my old desktop one
| day. I thought it was a bug in ZFS, so I started
| debugging. I compared the in-memory buffer from ARC with
| the on-disk buffer for libpython.so and found a bit flip.
| At the time, accessing a snapshot through .zfs would
| duplicate the buffer in ARC, which made it really easy to
| compare the in-memory buffer against the on-disk buffer.
| I was in shock as I did not expect to ever see one in
| person. Since then, I always insist on my computers
| having ECC.
| e12e wrote:
| Cross platform native encryption with sane fs for removable
| media.
| lazide wrote:
| Who would that help?
|
| MacOS also defaults to a non-portable FS for likely similar
| reasons, if one was being cynical.
| chillfox wrote:
| Much faster launch of applications/files you use regularly.
| Ability to always rollback updates in seconds if they cause
| issues thanks to snapshots. Fast backups with snapshots + zfs
| send/receive to a remote machine. Compressed disks, this both
| let's you store more on a drive and makes accessing files
| faster. Easy encryption. ability to mirror 2 large usb disks
| so you never have your data corrupted or lose it from drive
| failures. Can move your data or entire os install to a new
| computer easily by using a live disk and just doing a
| send/receive to the new pc.
|
| (I have never used dedup, but it's there if you want I guess)
| wkat4242 wrote:
| Snapshots (Note: NTFS does have this in the way of Volume
| Shadow Copy but it's not as easily accessible as a feature to
| the end user as it is in ZFS). Copy on Write for reliability
| under crashes. Block checksumming for data protection
| (bitrot)
| johannes1234321 wrote:
| For a while I ran Open Solaris with ZFS as root filesystem.
|
| The key feature for me, which I miss, is the snapshotting
| integrated into the package manager.
|
| ZFS allows snapshots more or less for free (due to copy on
| weite) including cron based snapshotting every 15 minutes. So
| if I did a mistake anywhere there was a way to recover.
|
| And that integrated with the update manager and boot manager
| means that on an update a snapshot is created and during boot
| one can switch between states. Never had a broken update, but
| gave a good feeling.
|
| On my home server I like the raid features and on Solaris it
| was nicely integrated with NFS etc so that one can easily
| create volumes and export them and set restrictions (max size
| etc.) on it.
| hoherd wrote:
| Online filesystem checking and repair.
|
| Reading any file will tell you with 100% guarantee if it is
| corrupt or not.
|
| Snapshots that you can `cd` into, so you can compare any
| prior version of your FS with the live version of your FS.
|
| Block level compression.
| badgersnake wrote:
| NTFS is good enough for most people, who have a laptop with one
| SSD in it.
| wkat4242 wrote:
| The benefits of ZFS don't need multiple drives to be useful.
| I'm running ZFS on root for years now and snapshots have
| saved my bacon several times. Also with block checksums you
| can at least detect bitrot. And COW is always useful.
| zamadatix wrote:
| Windows manages volume snapshots on NTFS through VSS. I
| think ZFS snapshots are a bit "cleaner" of a design, and
| the tooling is a bit friendlier IMO, but the functionality
| to snapshot, rollback, and save your bacon is there
| regardless. Outside of the automatically enabled "System
| Restore" (which only uses VSS to snapshot specific system
| files during updates) I don't think anyone bothers to use
| it though.
|
| CoW, advanced parity, and checksumming are the big ones
| NTFS lacks. CoW is just inherently not how NTFS is designed
| and checksumming isn't there. Anything else (encryption,
| compression, snapshots, ACLs, large scale, virtual devices,
| basic parity) is done through NTFS on Windows.
| wkat4242 wrote:
| Yes I know that NTFS has snapshots, I mentioned that in
| another comment. I don't think NTFS is as relevant in
| comparison though. People who choose windows will have no
| interest in ZFS and vice versa (someone considering ZFS
| will not pick Windows).
|
| And I don't think anyone bothers to use it due to the
| lack of user-facing tooling around it. If it would be as
| easy to create snapshots as it is on ZFS, more people
| would use it, I'm sure. It's just so amazing to try
| something out, screw up my system and just revert :P But
| VSS is more of a system API than a user-facing geature.
|
| VSS is also used by backup software to quiet the
| filesystem by the way.
|
| But yeah the others are great features. My main point was
| though that almost all the features of ZFS are very
| beneficial even on a single drive. You don't need an
| array to take advantage of Snapshots, the crash
| reliability that CoW offers, and checksumming (though you
| will lack the repair option obviously)
| EvanAnderson wrote:
| > I don't think NTFS is as relevant in comparison though.
| People who choose windows will have no interest in ZFS
| and vice versa (someone considering ZFS will not pick
| Windows).
|
| ZFS on Windows, as a first-class supported-by-Microsoft
| option would be killer. It won't ever happen, but it
| would be great. (NTFS / VSS with filesystem/snapshot
| send/receive would "scratch" a lot of that "itch", too.)
|
| > And I don't think anyone bothers to use it due to the
| lack of user-facing tooling around it. If it would be as
| easy to create snapshots as it is on ZFS, more people
| would use it, I'm sure. It's just so amazing to try
| something out, screw up my system and just revert :P But
| VSS is more of a system API than a user-facing geature.
|
| VSS on NTFS is handy and useful but in my experience
| brittle compared to ZFS snapshots. Sometimes VSS just
| doesn't work. I've had repeated cases over the years
| where accessing a snapshot failed (with traditional
| unhelpful Microsoft error messages) until the host
| machine was rebooted. Losing VSS snapshots on a volume is
| much easier than trashing a ZFS volume.
|
| VSS straddles the filesystem and application layers in a
| way that ZFS doesn't. I think that contributes to some of
| the jank (VSS writers becoming "unstable", for example).
| It also straddles hardware interfaces in a novel way that
| ZFS doesn't (using hardware snapshot functionality--
| somewhat like using a GPU versus "software rendering"). I
| think that also opens up a lot of opportunity for jank,
| as compared to ZFS treating storage as dumb blocks.
| GuB-42 wrote:
| To be honest, the situation with Linux is barely better.
|
| ZFS has license issues with Linux, preventing full integration,
| and Btrfs is 15 years in the making and still doesn't match ZFS
| in features and stability.
|
| Most Linux distros still use ext4 by default, which is 19 years
| old, but ext4 is little more than a series of extensions on top
| of ext2, which is the same age as NTFS.
|
| In all fairness, there are few OS components that are as
| critical as the filesystem, and many wouldn't touch filesystems
| that have less than a decade of proven track record in
| production.
| lousken wrote:
| as far as stability goes, btrfs is used by meta, synology and
| many others, so I wouldn't say it's not stable, but some
| features are lacking
| fourfour3 wrote:
| Do Synology actually use the multi-device options of btrfs,
| or are they using linux softraid + lvm underneath?
|
| I know Synology Hybrid RAID is a clever use of LVM + MD
| raid, for example.
| phs2501 wrote:
| I believe Synology runs btrfs on top of regular mdraid +
| lvm, possibly with patches to let btrfs checksum failures
| reach into the underlying layers to find the right data
| to recover.
|
| Related blog post: https://daltondur.st/syno_btrfs_1/
| azalemeth wrote:
| My understanding is that single-disk btrfs is good, but
| raid is decidedly dodgy;
| https://btrfs.readthedocs.io/en/latest/btrfs-
| man5.html#raid5... states that:
|
| > The RAID56 feature provides striping and parity over
| several devices, same as the traditional RAID5/6.
|
| > There are some implementation and design deficiencies
| that make it unreliable for some corner cases and *the
| feature should not be used in production, only for
| evaluation or testing*.
|
| > The power failure safety for metadata with RAID56 is not
| 100%.
|
| I have personally been bitten once (about 10 years ago) by
| btrfs just failing horribly on a single desktop drive. I've
| used either mdadm + ext4 (for /) or zfs (for large /data
| mounts) ever since. Zfs is fantastic and I genuinely don't
| understand why it's not used more widely.
| lousken wrote:
| I was assuming OP wants to highlight filesystem use on a
| workstation/desktop, not for a file server/NAS. I had
| similar experience decade ago, but these days single
| drives just work, same with mirroring. For such setups
| btrfs should be stable. I've never seen a workstation
| with raid5/6 setup. Secondly, filesystems and volume
| managers are something else, even if e.g. btrfs and ZFS
| are essentialy both.
|
| For a NAS setup I would still prefer ZFS with truenas
| scale (or proxmox if virtualization is needed), just
| because all these scenarios are supported as well. And as
| far as ZFS goes, encryption is still something I am not
| sure about especially since I want to use snapshots
| sending those as a backup to remote machine.
| hooli_gan wrote:
| RAID5/6 is not needed with btrfs. One should use RAID1,
| which supports striping the same data onto multiple
| drives in a redundant way.
| johnmaguire wrote:
| How can you achieve 2-disk fault tolerance using btrfs
| and RAID 1?
| Dalewyn wrote:
| By using three drives.
|
| RAID1 is just making literal copies, so each additional
| drive in a RAID1 is a self-sufficient copy. You want two
| drives of fault tolerance? Use three drives, so if you
| lose two copies you still have one left.
|
| This is of course hideously inefficient as you scale
| larger, but that is not the question posed.
| ryao wrote:
| Btrfs did not support that until Linux 5.5 when it added
| RAID1c3. On its mirror devices instead of doing
| mirroring, it just stores 2 copies, no matter how many
| mirror members you have.
| johnmaguire wrote:
| > This is of course hideously inefficient as you scale
| larger, but that is not the question posed.
|
| It's not just inefficient, you literally can't scale
| larger. Mirroring is all that RAID 1 allows for. To
| scale, you'd have to switch to RAID 10, which doesn't
| allow two-disk fault tolerance (you can get lucky if they
| are in different stripes, but this isn't fault
| tolerance.)
|
| But you're right - RAID 1 also scales terribly compared
| to RAID 6, even before introducing striping. Imagine you
| have 6 x 16 TB disks:
|
| With RAID 6, usable space of 64 TB, two-drive fault
| tolerance.
|
| With RAID 1, usable space of 16 TB, five-drive fault
| tolerance.
|
| With RAID 10, usable space of 32 GB, one-drive fault
| tolerance.
| crest wrote:
| One problem with your setup is that ZFS by design can't
| use a traditional *nix filesystem buffer cache. Instead
| it has to use its own ARC (adaptive replacement cache)
| with end-to-end checksumming, transparent compression,
| and copy-on-write semantics. This can lead to annoying
| performance problems when the two types of file system
| caches contest for available memory. There is a back
| pressure mechanism, but it effectively pauses other
| writes while evicting dirty cache entries to release
| memory.
| ryao wrote:
| Traditionally, you have the page cache on top of the FS
| and the buffer cache below the FS, with the two being
| unified such that double caching is avoided in
| traditional UNIX filesystems.
|
| ZFS goes out of its way to avoid the buffer cache,
| although Linux does not give it the option to fully opt
| out of it since the block layer will buffer reads done by
| userland to disks underneath ZFS. That is why ZFS began
| to purge the buffer cache on every flush 11 years ago:
|
| https://github.com/openzfs/zfs/commit/cecb7487fc8eea3508c
| 3b6...
|
| That is how it still works today:
|
| https://github.com/openzfs/zfs/blob/fe44c5ae27993a8ff53f4
| cef...
|
| If I recall correctly, the page cache is also still above
| ZFS when mmap() is used. There was talk about fixing it
| by having mmap() work out of ARC instead, but I don't
| believe it was ever done, so there is technically double
| caching done there.
| taskforcegemini wrote:
| what's the best way to deal with this then? disable
| filecache of linux? I've tried disabling/minimizing arc
| in the past to avoid the oom reaper, but the arc was
| stubborn and its RAM usage remained as is
| ssl-3 wrote:
| I didn't have any trouble limiting zfs_arc_max to 3GB on
| one system where I felt that it was important. I ran it
| that way for a fair number of years and it always stayed
| close to that bound (if it was ever exceeded, it wasn't
| by a noteworthy amount at any time when I was looking).
|
| At the time, I had it this way because I had fear of OOM
| events causing [at least] unexpected weirdness.
|
| A few months ago I discovered weird issues with a fairly
| big, persistent L2ARC being ignored at boot due to
| insufficient ARC. So I stopped arbitrarily limiting
| zfs_arc_max and just let it do its default self-managed
| thing.
|
| So far, no issues. For me. With my workload.
|
| Are you having issues with this, or is it a theoretical
| problem?
| ryao wrote:
| These days, ZFS frees memory fast enough when Linux
| requests memory to be freed that you generally do not see
| OOM because of ZFS, but if you have a workload where it
| is not fast enough, you can limit the maximum arc size to
| try to help:
|
| https://openzfs.github.io/openzfs-
| docs/Performance%20and%20T...
| brian_cunnie wrote:
| > I have personally been bitten once (about 10 years ago)
| by btrfs just failing horribly on a single desktop drive.
|
| Me, too. The drive was unrecoverable. I had to reinstall
| from scratch.
| worthless-trash wrote:
| Licensing incompatibilities.
| _joel wrote:
| I'm similar to some other people here, I guess once they've
| been bitten by data loss due to btrfs, it's difficult to
| advocate for it.
| lousken wrote:
| I am assuming almost everybody at some point experienced
| data loss because they pulled out a flash drive too
| early. Is it safe to assume that we stopped using flash
| drives because of it?
| _joel wrote:
| I'm not sure we have stopped using flash, judging by the
| pile of USB sticks on my desk :) In relation to the fs
| analogy if you used a flash drive that you know corrupted
| your data, you'd throw it away for one you know works.
| ryao wrote:
| I once purchased a bunch of flash drives from Google's
| online swag store and just unplugging them was often
| enough to put then in a state where they claimed to be
| 8MB devices and nothing I wrote to them was ever possible
| to read back in my limited tests. I stopped using those
| fast.
| jeltz wrote:
| It is possible to corrupt the file system from user space
| as a normal user with Btrfs. The PostgreSQL devs found that
| when working on async IO. And as fer as I know that issue
| has not been fixed.
|
| https://www.postgresql.org/message-id/CA%2BhUKGL-
| sZrfwcdme8j...
| curt15 wrote:
| LMDB users also unearthed a btrfs data corruption bug
| last year:
| https://bugzilla.redhat.com/show_bug.cgi?id=2169947
| xattt wrote:
| ZFS on OS X was killed because of Oracle licensing drama. I
| don't expect anything better on Windows either.
| ryao wrote:
| There is a third party port here:
|
| https://openzfsonosx.org/wiki/Main_Page
|
| It was actually the NetApp lawsuit that caused problems for
| Apple's adoption of ZFS. Apple wanted indemnification from
| Sun because of the lawsuit, Sun's CEO did not sign the
| agreement before Oracle's acquisition of Sun happened and
| Oracle had no interest in granting that, so the official
| Apple port was cancelled.
|
| I heard this second hand years later from people who were
| insiders at Sun.
| xattt wrote:
| That's a shame re: NetApp/ZFS.
|
| While third-party ports are great, they lack deep
| integration that first-party support would have brought
| (non-kludgy Time Machine which is technically fixed with
| APFS).
| BSDobelix wrote:
| >ZFS on OS X was killed because of Oracle licensing drama.
|
| Naa it was Jobs ego not the license:
|
| >>Only one person at Steve Jobs' company announces new
| products: Steve Jobs.
|
| https://arstechnica.com/gadgets/2016/06/zfs-the-other-new-
| ap...
| bolognafairy wrote:
| It's a cute story that plays into the same old assertions
| about Steve Jobs, but the conclusion is mostly baseless.
| There are many other, more credible, less conspiratorial,
| possible explanations.
| wkat4242 wrote:
| It could have played into it though, but I agree the
| support contract that couldn't be worked out mentioned
| elsewhere in the thread is more likely.
|
| But I think these things are usually a combination. When
| a business relationship sours, agreements are suddenly
| much harder to work out. The negotiators are still people
| and they have feelings that will affect their
| decisionmaking.
| throw0101a wrote:
| > _ZFS on OS X was killed because of Oracle licensing
| drama._
|
| It was killed because Apple and Sun couldn't agree on a
| 'support contract'. From Jeff Bonwick, one of the co-
| creators ZFS:
|
| >> _Apple can currently just take the ZFS CDDL code and
| incorporate it (like they did with DTrace), but it may be
| that they wanted a "private license" from Sun (with
| appropriate technical support and indemnification), and the
| two entities couldn't come to mutually agreeable terms._
|
| > _I cannot disclose details, but that is the essence of
| it._
|
| * https://archive.is/http://mail.opensolaris.org/pipermail/
| zfs...
|
| Sun took DTrace, licensed via CDDL--just like ZFS--and put
| it into the kernel without issue. Of course a file system
| is much more central to an operating system, so they wanted
| much more of a CYA for that.
| mogoh wrote:
| ZFS might be better then any other FS on Linux (I don't judge
| that).
|
| But you must admit that the situation on Linux is quite
| better then on Windows. Linux has so many FS in main branch.
| There is a lot of development. BTRFS had a rocky start, but
| it got better.
| stephen_g wrote:
| I'm interested to know what 'full integration' does look
| like, I use ZFS in Proxmox (Debian-based) and it's really
| great and super solid, but I haven't used ZFS in more vanilla
| Linux distros. Does Proxmox have things that regular Linux is
| missing out on, or are there shortcomings and things I just
| don't realise about Proxmox?
| whataguy wrote:
| The difference is that the ZFS kernel module is included by
| default with Proxmox, whereas with e.g. Debian, you would
| need to install it manually.
| pimeys wrote:
| And you can't follow the latest kernel before the ZFS
| module supports it.
| blibble wrote:
| for Debian that's not exactly a problem
| oarsinsync wrote:
| Unless you're using Debian backports, and they backport a
| new kernel a week before the zfs backport package update
| happens.
|
| Happened to me more than once. I ended up manually
| changing the kernel version limitations the second time
| just to get me back online, but I don't recall if that
| ended up hurting me in the long run or not.
| BSDobelix wrote:
| Try CachyOS https://cachyos.org/ , you can even swap from
| an existing Arch installation:
|
| https://wiki-
| dev.cachyos.org/sk/cachyos_repositories/how_to_...
| ryao wrote:
| There is a trick for this: * Step 1: Make
| friends with a ZFS developer. * Step 2: Guilt him
| into writing patches to add support as soon as a new
| kernel is released. * Step 3: Enjoy
|
| Adding support for a new kernel release to ZFS is usually
| only a few hours of work. I have done it in the past more
| than a dozen times.
| BodyCulture wrote:
| You probably don't realise how important encryption is.
|
| It's still not supported by Proxmox, yes, you can do it
| yourself somehow but you are alone then and miss features
| and people report problems with double or triple file
| system layers.
|
| I do not understand how they have not encryption out of the
| box, this seems to be a problem.
| kevinmgranger wrote:
| I'm not sure about proxmox, but ZFS on Linux does have
| encryption.
| BSDobelix wrote:
| >ZFS has license issues with Linux, preventing full
| integration
|
| No one wants that, openZFS is much healthier without Linux
| and it's "Foundation/Politics".
| bhaney wrote:
| > No one wants that
|
| I want that
| BSDobelix wrote:
| Then let me tell you that FreeBSD or OmniOS is what you
| really want ;)
| bhaney wrote:
| You're now 0 for 2 at telling me what I want
| BSDobelix wrote:
| The customer is not always right, however a good/modern
| Filesystem really would be something for Linux ;)
| ruthmarx wrote:
| > The customer is not always right,
|
| An uninvited door-to-door salesman is rarely, if ever
| right.
| nabla9 wrote:
| License is not a real issue. It must be just distributed in
| separate module. No big hurdle.
| Jnr wrote:
| From my point of view it is a real usability issue.
|
| zfs modules are not in the official repos. You either have
| to compile it on each machine or use unofficial repos,
| which is not exactly ideal and can break things if those
| repos are not up to date. And I guess it also needs some
| additional steps for secureboot setup on some distros?
|
| I really want to try zfs because btrfs has some issues with
| RAID5 and RAID6 (it is not recommended so I don't use it)
| but I am not sure I want to risk the overall system
| stability, I would not want to end up in a situation where
| my machines don't boot and I have to fix it manually.
| chillfox wrote:
| I have been using ZFS on Mint and Alpine Linux for years
| for all drives (including root) and have never had an
| issue. It's been fantastic and is super fast. My
| linux/zfs laptop loads games much faster than an
| identical machine running Windows.
|
| I have never had data corruption issues with ZFS, but I
| have had both xfs and ext4 destroy entire discs.
| harshreality wrote:
| Why are you considering raid5/6? Are you considering
| building a large storage array? If the data will fit
| comfortably (50-60% utilization) on one drive, all you
| need is raid1. Btrfs is fine for raid1 (raid1c3 for extra
| redundancy); it might have hidden bugs, but no filesystem
| is immune from those; zfs had a data loss bug (it was
| rare, but it happened) a year ago.
|
| Why use zfs for a boot partition? Unless you're using
| every disk mounting point and nvme slot for a single
| large raid array, you can use a cheap 512GB nvme drive or
| old spare 2.5" ssd for the boot volume. Or two, in btrfs
| raid1 if you absolutely must... but do you even need
| redundancy or datasum (which can hurt performance) to
| protect OS files? Do you really care if static package
| files get corrupted? Those are easily reinstalled, and
| modern quality brand SSDs are quite reliable.
| crest wrote:
| The main hurdle is hostile Linux kernel developers who
| aren't held accountable intentionally breaking ZFS for
| their own petty ideological reasons e.g. removing the in-
| kernel FPU/SIMD register save/restore API and replacing it
| with a "new" API to do the the same.
|
| What's "new" about the "new" API? Its symbols are GPL2 only
| to deny it's use to non-GPL2 modules (like ZFS). Guess
| that's an easy way to make sure that BTRFS is faster than
| ZFS or set yourself up as the (to be) injured party.
|
| Of course a reimplementation of the old API in terms of the
| new is an evil "GPL condom" violating the kernel license
| right? Why can't you see ZFS's CDDL2 license is the real
| problem here for being the wrong flavour of copyleft
| license. Way to claim the moral high ground you short-
| sighted, bigoted pricks. _sigh_
| GuB-42 wrote:
| It is a problem because most of the internal kernel APIs
| are GPL-only, which limit the abilities of the ZFS module.
| It is a common source of argument between the Linux guys
| and the ZFS on Linux guys.
|
| The reason for this is not just to piss off non-GPL module
| developers. GPL-only internal APIs are subject to change
| without notice, even more so than the rest of the kernel.
| And because the licence may not allow the Linux kernel
| developers to make the necessary changes to the module when
| it happens, there is a good chance it breaks without
| warning.
|
| And even with that, _all_ internal APIs may change, it is
| just a bit less likely than for the GPL-only ones, and
| because ZFS on Linux is a separate module, there is no
| guarantee for it to not break with successive Linux
| versions, in fact, it is more like a guarantee that it will
| break.
|
| Linux is proudly monolithic, and as constantly evolving a
| monolithic kernel, developers need to have control over the
| entire project. It is also community-driven. Combined, you
| need rules to have the community work together, or
| everything will break down, and that's what the GPL is for.
| bayindirh wrote:
| > Most Linux distros still use ext4 by default, which is 19
| years old, but ext4 is little more than a series of
| extensions on top of ext2, which is the same age as NTFS.
|
| However, ext4 and XFS are much more simpler and performant
| than BTRFS & ZFS as root drives on personal systems and small
| servers.
|
| I personally won't use either on a single disk system as root
| FS, regardless of how fast my storage subsystem is.
| ryao wrote:
| ZFS will outscale ext4 in parallel workloads with ease. XFS
| will often scale better than ext4, but if you use L2ARC and
| SLOG devices, it is no contest. On top of that, you can use
| compression for an additional boost.
|
| You might also find ZFS outperforms both of them in read
| workloads on single disks where ARC minimizes cold cache
| effects. When I began using ZFS for my rootfs, I noticed my
| desktop environment became more responsive and I attributed
| that to ARC.
| bayindirh wrote:
| No doubt. I want to reiterate my point. Citing myself:
|
| > "I personally won't use either on a _single disk system
| as root FS_ , regardless of how fast my storage subsystem
| is." (emphasis mine)
|
| We are no strangers to filesystems. I personally
| benchmarked a ZFS7320 extensively, writing a
| characterization report, plus we have a ZFS7420 for a
| very long time, complete with separate log SSDs for read
| and write on every box.
|
| However, ZFS is not saturation proof, plus is nowhere
| near a Lustre cluster performance wise, when scaled.
|
| What kills ZFS and BTRFS on desktop systems are write
| performance, esp. on heavy workloads like system updates.
| If I need a desktop server (performance-wise), I'd
| configure it accordingly and use these, but I'd _never_
| use BTRFS or ZFS on a _single root disk_ due to their
| overhead, to reiterate myself thrice.
| ryao wrote:
| I am generally happy with the write performance of ZFS. I
| have not noticed slow system updates on ZFS (although I
| run Gentoo, so slow is relative here). In what ways is
| the write performance bad?
|
| I am one of the OpenZFS contributors (although I am less
| active as late). If you bring some deficiency to my
| attention, there is a chance I might spend the time
| needed to improve upon it.
|
| By the way, ZFS limits the outstanding IO queue depth to
| try to keep latencies down as a type of QoS, but you can
| tune it to allow larger IO queue depths, which should
| improve write performance. If your issue is related to
| that, it is an area that could use improvement in certain
| situations:
|
| https://openzfs.github.io/openzfs-
| docs/Performance%20and%20T...
|
| https://openzfs.github.io/openzfs-
| docs/Performance%20and%20T...
|
| https://openzfs.github.io/openzfs-
| docs/Performance%20and%20T...
| jeltz wrote:
| Not on most database workloads. There zfs does not scale
| very well.
| ryao wrote:
| Percona and many others who benchmarked this properly
| would disagree with you. Percona found that ext4 and ZFS
| performed similarly when given identical hardware (with
| proper tuning of ZFS):
|
| https://www.percona.com/blog/mysql-zfs-performance-
| update/
|
| In this older comparison where they did not initially
| tune ZFS properly for the database, they found XFS to
| perform better, only for ZFS to outperform it when tuning
| was done and a L2ARC was added:
|
| https://www.percona.com/blog/about-zfs-performance/
|
| This is roughly what others find when they take the time
| to do proper tuning and benchmarks. ZFS outscales both
| ext4 and XFS, since it is a multiple block device
| filesystem that supports tiered storage while ext4 and
| XFS are single block device filesystems (with the
| exception of supporting journals on external drives).
| They need other things to provide them with scaling to
| multiple block devices and there is no block device level
| substitute for supporting tiered storage at the
| filesystem level.
|
| That said, ZFS has a killer feature that ext4 and XFS do
| not have, which is low cost replication. You can snapshot
| and send/recv without affecting system performance very
| much, so even in situations where ZFS is not at the top
| in every benchmark such as being on equal hardware, it
| still wins, since the performance penalty of database
| backups on ext4 and XFS is huge.
| cesarb wrote:
| > Btrfs [...] still doesn't match ZFS in features [...]
|
| Isn't the feature in question (array expansion) precisely one
| which btrfs already had for a long time? Does ZFS have the
| opposite feature (shrinking the array), which AFAIK btrfs
| also already had for a long time?
|
| (And there's one feature which is important to many, "being
| in the upstream Linux kernel", that ZFS most likely will
| never have.)
| wkat4242 wrote:
| ZFS also had expansion for a long time but it was offline
| expansion. I don't know if btrfs has also had online for a
| long time?
|
| And shrinking no, that is a big missing feature in ZFS IMO.
| Understandable considering its heritage (large scale
| datacenters) but nevertheless an issue for home use.
|
| But raidz is rock-solid. Btrfs' raid is not.
| unsnap_biceps wrote:
| Raidz wasn't able to be expanded in place before this.
| You were able to add to a pool that included a raidz
| vdev, but that raidz vdev was immutable.
| wkat4242 wrote:
| Oh ok, I've never done this, but I thought it was already
| there. Maybe this was the original ZFS from Sun? But
| maybe I just remember it incorrectly, sorry.
|
| I've used it on multi-drive arrays but I never had the
| need for expansion.
| ryao wrote:
| You could add top level raidz vdevs or replace the
| members of a raid-z vdev with larger disks to increase
| storage space back then. You still have those options
| now.
| honestSysAdmin wrote:
| https://openzfs.github.io/openzfs-
| docs/Getting%20Started/index.html
|
| ZFS runs on all major Linux distros, the source is compiled
| locally and there is no meaningful license problem. In
| datacenter and "enterprise" environments we compile ZFS
| "statically" with other kernel modules all the time.
|
| For over six years now, there is an "experimental" option
| presented by the graphical Ubuntu installer to install the
| root filesystem on ZFS. Almost everyone I personally know
| (just my anecdote) chooses this "experimental" option. There
| has been an occasion here and there of ZFS snapshots taking
| up too much space, but other than this there have not been
| any problems.
|
| I statically compile ZFS into a kernel that intentionally
| does not support loading modules on some of my personal
| laptops. My experience has been great, others' mileage may
| (certainly will) vary.
| ryao wrote:
| What do you mean by a ZFS compatibility layer? There is a
| Windows port:
|
| https://github.com/openzfsonwindows/openzfs
|
| Note that it is a beta.
| MauritsVB wrote:
| There is occasional talk of moving the Windows implementation
| of OpenZFS
| (https://github.com/openzfsonwindows/openzfs/releases) into an
| officially supported tier, though that will probably come after
| the MacOS version (https://github.com/openzfsonosx) is
| officially supported.
| bayindirh wrote:
| > How come that Windows still uses a 32 year old file system?
|
| Simple. Because most of the burden is taken by the (enterprise)
| storage hardware hosting the FS. Snapshots, block level
| deduplication, object storage technologies, RAID/Resiliency,
| size changes, you name it.
|
| Modern storage appliances are black magic, and you don't need
| much more features from NTFS. You either transparently access
| via NAS/SAN or store your NTFS volumes on capable disk boxes.
|
| On the Linux world, at the higher end, there's Lustre and GPFS.
| ZFS is mostly for resilient, but not performance critical
| needs.
| BSDobelix wrote:
| >ZFS is mostly for resilient, but not performance critical
| needs.
|
| Los Alamos disagrees ;)
|
| https://www.lanl.gov/media/news/0321-computational-storage
|
| But yes, in general you are right, Cern for example uses
| Ceph:
|
| https://indico.cern.ch/event/1457076/attachments/2934445/515.
| ..
| bayindirh wrote:
| I think what LLNL did predates GPUDirect and other new
| technologies came after 2022, but that's a good start.
|
| CERN's Ceph also for their "General IT" needs. Their
| clusters are independent from that. Also CERN's most
| processing is distributed across Europe. We are part of
| that network.
|
| Many, if not all of the HPC centers we talk with uses
| Lustre as their "immediate" storage. Also, there's Weka
| now, a closed source storage system supporting _insane_
| speeds and tons of protocols at the same time. Mostly used
| for and by GPU clusters around the world. You connect
| _terabits_ to that cluster _casually_. It 's all flash, and
| flat out fast.
| ryao wrote:
| Did you confuse LANL for LLNL?
| bayindirh wrote:
| It's just a typo, not a confusion, and I'm well beyond
| the edit window.
| poisonborz wrote:
| So private consumers should just pay cloud subscription if
| they want safer/modern data storage for their PC? (without
| NAS)
| bayindirh wrote:
| I think Microsoft has discontinued Windows 7 backup to
| force people to buy OneDrive subscriptions. They also
| forcefully enabled the feature when they first introduced
| it.
|
| So, I think that your answer for this question is
| "unfortunately, yes".
|
| Not that I support the situation.
| BSDobelix wrote:
| If you need Windows, you can use something like restic
| (checksums and compression) and external drives (more than
| one, stored in more than one place) to make a backup. Plus
| "maybe" but not needed ReFS (on your non-Windows
| partition), which is included in the Workstation/Enterprise
| editions of Windows.
|
| I trust my own backups much more than any subscription, not
| essentially from a technical point of view, but from an
| access point of view (e.g. losing access to your Google
| account).
|
| EDIT: You have to enable check-summing and/or compression
| for data on ReFS manually
|
| https://learn.microsoft.com/en-us/windows-
| server/storage/ref...
| bayindirh wrote:
| > I trust my own backups much more than any subscription,
| not from a technical standpoint but from an access one
| (for example, losing access to your google account).
|
| I personally use cloud storage extensively, but I keep a
| local version with periodic rclone/borg. It allows me
| access from everywhere and sleep well at night.
| qwertox wrote:
| NTFS has Volume Shadow Copy, which is "good enough" for
| private users if they want to create image backups while
| their system is running.
| BSDobelix wrote:
| First of all, that's not a backup, that's a snapshot, and
| NO, that's not "good enough", tell your grandma that all
| her digitised pictures are gone because her hard drive
| exploded, or that one most important jpeg is now
| unwatchable because of bitrot.
|
| Just because someone is a private user doesn't mean that
| the data is less important, often it's quite the
| opposite, for example a family album vs your cloned git
| repository.
| tjoff wrote:
| ... VSS is used to create backups. Re-read parent.
| BSDobelix wrote:
| Not good enough, you can make 10000 backups of bitrotten
| data, if you don't have check-sums on your block (zfs) or
| files (restic) nothing can help you. That's the same
| integrity as to copy stuff on your thump-drive.
| shrubble wrote:
| No, private consumers have a choice, since Linux and
| FreeBSD runs well on their hardware. Microsoft is too busy
| shoveling their crappy AI and convincing OEMs to put a
| second Windows button (the CoPilot button) on their
| keyboards.
| bluGill wrote:
| Probably. There are levels of backups, and a cloud
| subscription SHOULD give you copies in geographical
| separate locations with someone to help you (who probably
| isn't into computers and doesn't want to learn the complex
| details) restore when (NOT IF!) needed.
|
| I have all my backups on a NAS in the next room. This
| covers the vast majority of use cases for backups, but if
| my house burns down everything is lost. I know I'm taking
| that risk, but really I should have better. Just paying
| someone to do it all in the cloud should be better for me
| as well and I keep thinking I should do this.
|
| Of course paying someone assumes they will do their job.
| There are always incompetent companies out there to take
| your money.
| pdimitar wrote:
| My setup is similar to yours, but I also distribute my
| most important data in compressed (<5GB) encrypted
| backups to several free-tier cloud storage accounts. I
| could restore it by copying one key and running one
| script.
|
| I lost faith in most paid operators. Whoops, this thing
| that absolutely can happen to home users and we're
| supposed to protect them from now actually happened to us
| and we were not prepared. We're so sorry!
|
| Nah. Give me access to 5-15 cloud storage accounts, I'll
| handle it myself. Have done so for years.
| NoMoreNicksLeft wrote:
| Having a NAS is life-changing. Doesn't have to be some
| large 20-bay monstrosity, just something that will give you
| redundancy and has an ethernet jack.
| zamadatix wrote:
| NTFS was able to be extended in various way over the years to
| the point what you could do with an NTFS drive 32 years ago
| will feel like talking about a completely different filesystem
| than what you can do with it on current Windows.
|
| Honestly I really like ReFS, particularly in context of storage
| spaces, but I don't think it's relevant to Microsoft's consumer
| desktop OS where users don't have 6 drives they need to pool
| together. Don't get me wrong, I use ZFS because that's what I
| can get running on a Linux server and I'm not going to go run
| Windows Server just for the storage pooling... but ReFS +
| Storage Spaces wins my heart with the 256 MB slab approach.
| This means you can add+remove mixed sized drives and get the
| maximum space utilization for the parity settings of the pool.
| Here ZFS is still getting to online adds of same or larger
| drives 10 years later.
| nickdothutton wrote:
| OS development pretty much stopped around 2000. ZFS is from
| 2001. I don't count a new way to organise my photos or
| integrate with a search engine as "OS" though.
| doctorpangloss wrote:
| The same reason file deduplication is not enabled for client
| Windows: greed.
|
| For example, there are numerous new file systems people use:
| OneDrive, Google Drive, iCloud Storage. Do you get it?
| happosai wrote:
| The annual reminder that if Oracle wanted to contribute
| positively to the Linux ecosystem, they would update the CDDL
| license ZFS uses to GPL compatible.
| ryao wrote:
| This is the annual reply that Oracle cannot change the OpenZFS
| license because OpenZFS contributors removed the "or any later
| version" part of the license from their contributions.
|
| By the way, comments such as yours seem to assume that Oracle
| is somehow involved with OpenZFS. Oracle has no connection with
| OpenZFS outside of owning copyright on the original OpenSolaris
| sources and a few tiny commits their employees contributed
| before Oracle purchased Sun. Oracle has its own internal ZFS
| fork and they have zero interest in bringing it to Linux. They
| want people to either go on their cloud or buy this:
|
| https://www.oracle.com/storage/nas/
| jeroenhd wrote:
| Is there a reason the OpenZFS contributors don't want to
| dual-license their code? I'm not too familiar with the CDDL
| but I'm not sure what advantage it brings to an open source
| project compared to something like GPL? Having to deal with
| DKMS is one of the reasons why I'm sticking with BTRFS for
| doing ZFS-like stuff.
| ryao wrote:
| The OpenZFS code is based on the original OpenSolaris code,
| and the license used is the CDDL because that is what
| OpenSolaris used. Dual licensing that requires the current
| OpenSolaris copyright holder to agree. That is unlikely
| without writing a very big check. Further speculation is
| not a productive thing to do, but since I know a number of
| people assume that OpenSolaris copyright holder is the only
| one preventing this, let me preemptively say that it is not
| so simple. Different groups have different preferred
| licenses. Some groups cannot stand certain licenses. Other
| groups might detest the idea of dual licensing in general
| since it causes community fragmentation whenever
| contributors decide to publish changes only under 1 of the
| 2 licenses.
|
| The CDDL was designed to ensure that if Sun Microsystems
| were acquired by a company hostile to OSS, people could
| still use Sun's open source software. In particular, the
| CDDL has an explicit software patent grant. Some consider
| that to have been invaluable in preempting lawsuits from a
| certain company that would rather have ZFS be closed source
| software.
| abrookewood wrote:
| The only thing Oracle wants to "contribute positively to" is
| Larry's next yacht.
| MauritsVB wrote:
| Oracle changing the license would not make a huge difference to
| OpenZFS.
|
| Oracle only owns the copyright to the original Sun Microsystems
| code. It doesn't apply to all ZFS implementations (probably not
| OracleZFS, perhaps not IllumosZFS) but in the specific case of
| OpenZFS the majority of the code is no longer Sun code.
|
| Don't forget that SunZFS was open sourced in 2005 before Oracle
| bought Sun Microsystems in 2009. Oracle have created their own
| closed source version of ZFS but outside some Oracle shops
| nobody uses it (some people say Oracle has stopped working on
| OracleZFS all together some time ago).
|
| Considering the forks (first from Sun to the various open
| source implementations and later the fork from open source into
| Oracle's closed source version) were such a long time ago,
| there is not that much original code left. A lot of storage
| tech, or even entire storage concepts, did not exist when Sun
| open sourced ZFS. Various ZFS implementations developed their
| own support for TRIM, or Sequential Resilvering, or Zstd
| compression, or Persistent L2ARC, or Native ZFS Encryption, or
| Fusion Pools, or Allocation Classes, or dRAID, or RAIDZ
| expansion long after 2005. That's is why the majority of the
| code in OpenZFS 2 is from long after the fork from Sun code
| twenty years ago.
|
| Modern OpenZFS contains new code contributions from Nexenta
| Systems, Delphix, Intel, iXsystems, Datto, Klara Systems and a
| whole bunch of other companies that have voluntarily offered
| their code when most of the non-Oracle ZFS implementations
| merged to become OpenZFS 2.0.
|
| If you'd want to relicense OpenZFS you could get Oracle to
| agree for the bit under Sun copyright but for the majority of
| the code you'd have to get a dozen or so companies to agree to
| relicensing their contributions (probably not that hard) and
| many hundreds of individual contributors over two decades (a
| big task and probably not worth it).
| abrookewood wrote:
| Can someone provide details on this bit please? "Direct IO:
| Allows bypassing the ARC for reads/writes, improving performance
| in scenarios like NVMe devices where caching may hinder
| efficiency".
|
| ARC is based in RAM, so how could it reduce performance when used
| with NVMe devices? They are fast, but they aren't RAM-fast ...
| nolist_policy wrote:
| Because with a (ARC) cache you have to copy from the app to the
| cache and then dma to disk. With direct io you can dma directly
| from the app ram to the disk.
| philjohn wrote:
| Yes - interested in this too. Is this for both ARC and L2ARC,
| or just L2ARC?
| jakedata wrote:
| Happy to see the ARC bypass for NVMe performance. ZFS really
| fails to exploit NVMe's potential. Online expansion might be
| interesting. I tried to use ZFS for some very busy databases and
| ended up getting bitten badly by the fragmentation bug. The only
| way to restore performance appears to be copying the data off the
| volume, nuking it and then copying it back. Now -perhaps- if I
| expand the zpool then I might be able to reduce fragmentation by
| copying the tablespace on the same volume.
| bitmagier wrote:
| Marvelous!
| wkat4242 wrote:
| Note: This is online expansion. Expansion was always possible but
| you did need to take the array down to do it. You could also move
| to bigger drives but you also had to do that one at a time (and
| only gain the new capacity once all drives were upgraded of
| course)
|
| As far as I know shrinking a pool is still not possible though.
| So if you have a pool with 5 drives and add a 6th, you can't go
| back to 5 drives even if there is very little data in it.
| shepherdjerred wrote:
| How does ZFS compare to btrfs? I'm currently using btrfs for my
| home server, but I've had some strange troubles with it. I'm
| thinking about switching to ZFS, but I don't want to end up in
| the same situation.
| ryao wrote:
| I first tried btrfs 15 years ago with Linux 2.6.33-rc4 if I
| recall. It developed an unlinkable file within 3 days, so I
| stopped using it. Later, I found ZFS. It had a few less
| significant problems, but I was a CS student at the time and I
| thought I could fix them since they seemed minor in comparison
| to the issue I had with btrfs, so over the next 18 months, I
| solved all of the problems that it had that bothered me and
| sent the patches to be included in the then ZFSOnLinux
| repository. My effort helped make it production ready on Linux.
| I have used ZFS ever since and it has worked well for me.
|
| If btrfs had been in better shape, I would have been a btrfs
| contributor. Unfortunately for btrfs, it not only was in bad
| shape back then, but other btrfs issues continued to bite me
| every time I tried it over the years for anything serious (e.g.
| frequent ENOSPC errors when there is still space). ZFS on the
| other hand just works. Myself and many others did a great deal
| of work to ensure it works well.
|
| The main reason for the difference is that ZFS had a very solid
| foundation, which was achieved by having some fantastic
| regression testing facilities. It has a userland version that
| randomly exercises the code to find bugs before they occur in
| production and a test suite that is run on every proposed
| change to help shake out bugs.
|
| ZFS also has more people reviewing proposed changes than other
| filesystems. The Btrfs developers will often state that there
| is a significant man power difference between the two file
| systems. I vaguely recall them claiming the difference was a
| factor of 6.
|
| Anyway, few people who use ZFS regret it, so I think you will
| find you like it too.
| parshimers wrote:
| btrfs has similar aims to ZFS, but is far less mature. i used
| it for my root partitions due to it not needing DKMS, but had
| many troubles. i used it in a fairly simple way, just a mirror.
| one day, of the drives in the array started to have issues- and
| btrfs fell on it's face. it remounted everything read-only if i
| remember correctly, and would not run in degraded mode by
| default. even mdraid would do better than this without
| checksumming and so forth. ZFS also likewise, says that the
| array is faulted, but of course allows it to be used. the fact
| the default behavior was not RAID, because it's literally
| missing the R part for reading the data back, made me lose any
| faith in it. i moved to ZFS and haven't had issues since. there
| is much more of a community and lots of good tooling around it.
___________________________________________________________________
(page generated 2025-01-14 23:01 UTC)