[HN Gopher] ZFS 2.3 released with ZFS raidz expansion
       ___________________________________________________________________
        
       ZFS 2.3 released with ZFS raidz expansion
        
       Author : scrp
       Score  : 345 points
       Date   : 2025-01-14 07:08 UTC (15 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | scrp wrote:
       | After years in the making ZFS raidz expansaion is finally here.
       | 
       | Major features added in release:                 - RAIDZ
       | Expansion: Add new devices to an existing RAIDZ pool, increasing
       | storage capacity without downtime.            - Fast Dedup: A
       | major performance upgrade to the original OpenZFS deduplication
       | functionality.            - Direct IO: Allows bypassing the ARC
       | for reads/writes, improving performance in scenarios like NVMe
       | devices where caching may hinder efficiency.            - JSON:
       | Optional JSON output for the most used commands.            -
       | Long names: Support for file and directory names up to 1023
       | characters.
        
         | cm2187 wrote:
         | But I presume it is still not possible to remove a vdev.
        
           | mustache_kimono wrote:
           | Is this possible elsewhere (re: other filesystems)?
        
             | cm2187 wrote:
             | It is possible with windows storage space (remove drive
             | from a pool) and mdadm/lvm (remove disk from a RAID array,
             | remove volume from lvm), which to me are the two major
             | alternatives. Don't know about unraid.
        
               | mustache_kimono wrote:
               | > It is possible with windows storage space (remove drive
               | from a pool) and mdadm/lvm (remove disk from a RAID
               | array, remove volume from lvm), which to me are the two
               | major alternatives. Don't know about unraid.
               | 
               | Perhaps I am misunderstanding you, but you can offline
               | and remove drives from a ZFS pool.
               | 
               | Do you mean WSS and mdadm/lvm will allow an automatic
               | live rebalance and then reconfigure the drive topology?
        
               | cm2187 wrote:
               | So for instance I have a ZFS pool with 3 HDD data vdevs,
               | and 2 SSD special vdevs. I want to convert the two SSD
               | vdevs into a single one (or possibly remove one of them).
               | From what I read the only way to do that is to destroy
               | the entire pool and recreate it (it's in a server in a
               | datacentre, don't want to reupload that much data).
               | 
               | In windows, you can set a disk for removal, and as long
               | as the other disks have enough space and are compatible
               | with the virtual disks (eg you need at least 5 disks if
               | you have parity with number of columns=5), it will
               | rebalance the blocks onto the other disks until you can
               | safely remove the disk. If you use thin provisioning, you
               | can also change your mind about the settings of a virtual
               | disk, create a new one on the same pool, and move the
               | data from one to the other.
               | 
               | Mdadm/lvm will do the same albeit with more of a pain in
               | the arse as RAID requires to resilver not just the
               | occupied space but also the free space so takes a lot
               | more time and IO than it should.
               | 
               | It's one of my beef with ZFS, there are lots of no return
               | decisions. That and I ran into some race conditions with
               | loading a ZFS array on boot with nvme drives on ubuntu.
               | They seem to not be ready, resulting in randomly degraded
               | arrays. Fixed by loading the pool with a delay.
        
               | ryao wrote:
               | The man page says that your example is doable with zpool
               | remove:
               | 
               | https://openzfs.github.io/openzfs-
               | docs/man/master/8/zpool-re...
        
               | formerly_proven wrote:
               | My understanding is that ZFS does virtual <-> physical
               | translation in the vdev layer, i.e. all block references
               | in ZFS contain a (vdev, vblock) tuple, and the vdev knows
               | how to translate that virtual block offset into actual
               | on-disk block offset(s).
               | 
               | This kinda implies that you can't actually _remove_ data
               | vdevs, because in practice you can 't rewrite all
               | references. You also can't do offline deduplication
               | without rewriting references (i.e. actually touching the
               | files in the filesystem). And that's why ZFS can't
               | deduplicate snapshots after the fact.
               | 
               | On the other hand, reshaping a vdev is possible, because
               | that "just" requires shuffling the vblock -> physical
               | block associations _inside_ the vdev.
        
               | ryao wrote:
               | There is a clever trick that is used to make top level
               | removal work. The code will make the vdev readonly. Then
               | it will copy its contents into free space on other vdevs
               | (essentially, the contents will be stored behind the
               | scenes in a file). Finally, it will redirect reads on
               | that vdev into the stored vdev. This indirection allows
               | you to remove the vdev. It is not implemented for raid-z
               | at present though.
        
               | formerly_proven wrote:
               | Though the vdev itself still exists after doing that? It
               | just happens to be backed by, essentially, a "file" in
               | the pool, instead of the original physical block devices,
               | right?
        
               | ryao wrote:
               | Yes.
        
               | Sesse__ wrote:
               | > Do you mean WSS and mdadm/lvm will allow an automatic
               | live rebalance and then reconfigure of the drive topo?
               | 
               | mdadm can convert RAID-5 to a larger or smaller RAID-5,
               | RAID-6 to a larger or smaller RAID-6, RAID-5 to RAID-6 or
               | the other way around, RAID-0 to a degraded RAID-5, and
               | many other fairly reasonable operations, while the array
               | is online, resistant to power loss and the likes.
               | 
               | I wrote the first version of this md code in 2005
               | (against kernel 2.6.13), and Neil Brown rewrote and
               | mainlined it at some point in 2006. ZFS is... a bit late
               | to the party.
        
               | ryao wrote:
               | Doing this with the on disk data in a merkle tree is much
               | harder than doing it on more conventional forms of
               | storage.
               | 
               | By the way, what does MD do when there is corrupt data on
               | disk that makes it impossible to know what the correct
               | reconstruction is during a reshape operation? ZFS will
               | know what file was damaged and proceed with the undamaged
               | parts. ZFS might even be able to repair the damaged data
               | from ditto blocks. I don't know what the MD behavior is,
               | but its options for handling this are likely far more
               | limited.
        
               | Sesse__ wrote:
               | Well, then they made a design choice in their RAID
               | implementation that made fairly reasonable things hard.
               | 
               | I don't know what md does if the parity doesn't match up,
               | no. (I've never ever had that happen, in more than 25
               | years of pretty heavy md use on various disks.)
        
               | ryao wrote:
               | I am not sure if reshaping is a reasonable thing. It is
               | not so reasonable in other fields. In architecture, if
               | you build a bridge and then want more lanes, you usually
               | build a new bridge, rather than reshape the bridge. The
               | idea of reshaping a bridge while cars are using it would
               | sound insane there, yet that is what people want from
               | storage stacks.
               | 
               | Reshaping traditional storage stacks does not consider
               | all of the ways things can go wrong. Handling all of them
               | well is hard, if not impossible to do in traditional
               | RAID. There is a long history of hardware analogs to MD
               | RAID killing parity arrays when they encounter silent
               | corruption that makes it impossible to know what is
               | supposed to be stored there. There is also the case where
               | things are corrupted such that there is a valid
               | reconstruction, but the reconstruction produces something
               | wrong silently.
               | 
               | Reshaping certainly is easier to do with MD RAID, but the
               | feature has the trade off that edge cases are not handled
               | well. For most people, I imagine that risk is fine until
               | it bites them. Then it is not fine anymore. ZFS made an
               | effort to handle all of the edge cases so that they do
               | not bite people and doing that took time.
        
               | Sesse__ wrote:
               | > I am not sure if reshaping is a reasonable thing.
               | 
               | Yet people are celebrating when ZFS adds it. Was it all
               | for nothing?
        
               | ryao wrote:
               | People wanted it, but it was very hard to do safely.
               | While ZFS now can do it safely, many other storage
               | solutions cannot.
               | 
               | Those corruption issues I mentioned, where the RAID
               | controller has no idea what to do, affect far more than
               | just reshaping. They affect traditional RAID arrays when
               | disks die and when patrol scrubs are done. I have not
               | tested MD RAID on edge cases lately, but the last time I
               | did, I found MD RAID ignored corruption whenever
               | possible. It would not detect corruption in normal
               | operation because it assumed all data blocks are good
               | unless SMART said otherwise. Thus, it would randomly
               | serve bad data from corrupted mirror members and always
               | serve bad data from RAID 5/6 members whenever the data
               | blocks were corrupted. This was particularly tragic on
               | RAID 6, where MD RAID is hypothetically able to detect
               | and correct the corruption if it tried. Doing that would
               | come with such a huge performance overhead that it is
               | clear why it was not done.
               | 
               | Getting back to reshaping, while I did not explicitly
               | test it, I would expect that unless a disk is missing or
               | disappears during a reshape, MD RAID would ignore any
               | corruption that can be detected using parity and assume
               | all data blocks are good just like it does in normal
               | operation. It does not make sense for MD RAID to look for
               | corruption during a reshape operation, since not only
               | would it be slower, but even if it finds corruption, it
               | has no clue how to correct the corruption unless RAID 6
               | is used, there are no missing/failed members and the
               | affected stripe does not have any read errors from SMART
               | detecting a bad sector that would effectively make it as
               | if there was a missing disk.
               | 
               | You could do your own tests. You should find that ZFS
               | handles edge cases where the wrong thing is in a spot
               | where something important should be gracefully while MD
               | RAID does not. MD RAID is a reimplementation of a
               | technology from the 1960s. If 1960s storage technology
               | handled these edge cases well, Sun Microsystems would not
               | have made ZFS to get away from older technologies.
        
               | amluto wrote:
               | I've experienced bit rot on md. It was not fun, and the
               | tooling was of approximately no help recovering.
        
               | TiredOfLife wrote:
               | Storage Spaces doesn't dedicate drive to single purpose.
               | It operates in chunks (256MB i think). So one drive can,
               | at the same time, be part of a mirror and raid-5 and
               | raid-0. This allows fully using drives with various
               | sizes. And choosing to remove drive will cause it to
               | redistribute the chunks to other available drives,
               | without going offline.
        
               | cm2187 wrote:
               | And as a user it seems to me to be the most elegant
               | design. The quality of the implementation (parity write
               | performance in particular) is another matter.
        
               | lloeki wrote:
               | IIUC the ask (I have a hard time wrapping my head around
               | zfs vernacular), btrfs allows this at least in some
               | cases.
               | 
               | If you can convince _btrfs balance_ to not use the dev to
               | remove it will simply rebalance data to the other devs
               | and then you can _btrfs device remove_.
        
             | c45y wrote:
             | Bcachefs allows it
        
               | eptcyka wrote:
               | Cool, just have to wait before it is stable enough for
               | daily use of mission critical data. I am personally
               | optimistic about bcachefs, but incredibly pessimistic
               | about changing filesystems.
        
               | ryao wrote:
               | It seems easier to copy data to a new ZFS pool if you
               | need to remove RAID-Z top level vdevs. Another
               | possibility is to just wait for someone to implement it
               | in ZFS. ZFS already has top level vdev removal for other
               | types of vdevs. Support for top level raid-z vdev removal
               | just needs to be implemented on top of that.
        
             | unixhero wrote:
             | Btrfs
        
               | tw04 wrote:
               | Except you shouldn't use btrfs for any parity based raid
               | if you value your data at all. In fact, I'm not aware if
               | any vendor that has implemented btrfs with parity based
               | raid, they all resort to btrfs on md.
        
           | ryao wrote:
           | That was added a while ago:
           | 
           | https://openzfs.github.io/openzfs-docs/man/master/8/zpool-
           | re...
           | 
           | It works by making a readonly copy of the vdev being removed
           | inside the remaining space. The existing vdev is then
           | removed. Data can still be accessed from the copy, but new
           | writes will go to an actual vdev while data no longer needed
           | on the copy is gradually reclaimed as free space as the old
           | data is no longer needed.
        
             | lutorm wrote:
             | Although "Top-level vdevs can only be removed if the
             | primary pool storage does not contain a top-level raidz
             | vdev, all top-level vdevs have the same sector size, and
             | the keys for all encrypted datasets are loaded."
        
               | ryao wrote:
               | I forgot we still did not have that last bit implemented.
               | However, it is less important now that we have expansion.
        
               | cm2187 wrote:
               | And in my case all the vdevs are raidz
        
         | jdboyd wrote:
         | The first 4 seem like really big deals.
        
           | snvzz wrote:
           | The fifth is also, once you consider non-ascii names.
        
             | GeorgeTirebiter wrote:
             | Could someone show a legit reason to use 1000-character
             | filenames? Seems to me, when filenames are long like that,
             | they are actually capturing several KEYS that can be easily
             | searched via ls & re's. e.g.
             | 
             | 2025-Jan-14-1258.93743_Experiment-2345_Gas-
             | Flow-375.3_etc_etc.dat
             | 
             | But to me this stuff should be in metadata. It's just that
             | we don't have great tools for grepping the metadata.
             | 
             | Heck, the original Macintosh FS had no subdirectories -
             | they were faked by burying subdirectory names in the (flat
             | filesysytem) filename. The original Macintosh File System
             | (MFS), did not support true hierarchical subdirectories.
             | Instead, the illusion of subdirectories was created by
             | embedding folder-like names into the filenames themselves.
             | 
             | This was done by using colons (:) as separators in
             | filenames. A file named Folder:Subfolder:File would appear
             | to belong to a subfolder within a folder. This was entirely
             | a user interface convention managed by the Finder.
             | Internally, MFS stored all files in a flat namespace, with
             | no actual directory hierarchy in the filesystem structure.
             | 
             | So, there is 'utility' in "overloading the filename space".
             | But...
        
               | p_l wrote:
               | > Could someone show a legit reason to use 1000-character
               | filenames?
               | 
               | 1023 _byte_ names can mean less than 250 characters due
               | to use of unicode and utf-8. Add to it unicode
               | normalization which might  "expand" some characters into
               | two or more combining characters, deliberate use of
               | combining characters, emoji, rare characters, and you
               | might end up with many "characters" taking more than 4
               | bytes. A single "country flag" character will be usually
               | 8 bytes, usually most emoji will be at least 4 bytes,
               | skin tone modifiers will add 4 bytes, etc.
               | 
               | this ' ' takes 27 bytes in my terminal, '' takes 28,
               | another combo I found is 35 bytes.
               | 
               | And that's on top of just getting a long title using
               | let's say one of CJK or other less common scripts - an
               | early manuscript of somewhat successful Japanese novel
               | has a non-normalized filename of 119 byte, and it's
               | nowhere close to actually long titles, something that
               | someone might reasonably have on disk. A random find on
               | the internet easily points to a book title that takes
               | over 300 bytes in non-normalized utf8.
               | 
               | P.S. proper title of "Robinson Crusoe" if used as
               | filename takes at least 395 bytes...
        
               | p_l wrote:
               | hah. Apparently HN eradicated the carefully pasted
               | complex unicode emojis.
               | 
               | The first was "man+woman kissing" with skin tone
               | modifier, then there was few flags
        
         | eatbitseveryday wrote:
         | > RAIDZ Expansion: Add new devices to an existing RAIDZ pool,
         | increasing storage capacity without downtime.
         | 
         | More specifically:
         | 
         | > A new device (disk) can be attached to an existing RAIDZ vdev
        
         | BodyCulture wrote:
         | How well tested is this in combination with encryption?
         | 
         | Is the ZFS team handling encryption as a first class priority
         | at all?
         | 
         | ZFS on Linux inherited a lot of fame from ZFS on Solaris, but
         | everyone using it in production should study the issue tracker
         | very well for a realistic impression of the situation.
        
           | p_l wrote:
           | Main issue with encryption is occasional attempts by certain
           | (specific) Linux kernel developer to lockout ZFS out of
           | access to advanced instruction set extensions (far from the
           | only weird idea of that specific developer).
           | 
           | The way ZFS encryption is layered, the features should be
           | pretty much orthogonal from each other, but I'll admit that
           | there's a bit of lacking with ZFS native encryption (though
           | mainly in upper layer tooling in my experience rather than
           | actual on-disk encryption parts)
        
       | uniqueuid wrote:
       | It's good to see that they were pretty conservative about the
       | expansion.
       | 
       | Not only is expansion completely transparent and resumable, it
       | also maintains redundancy throughout the process.
       | 
       | That said, there is one tiny caveat people should be aware of:
       | 
       | > After the expansion completes, old blocks remain with their old
       | data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2
       | parity), but distributed among the larger set of disks. New
       | blocks will be written with the new data-to-parity ratio (e.g. a
       | 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data
       | to 2 parity).
        
         | chungy wrote:
         | I'm not sure that's really a caveat, it just means old data
         | might be in an inoptimal layout. Even with that, you still get
         | the full benefits of raidzN, where up to N disks can completely
         | fail and the pool will remain functional.
        
           | stavros wrote:
           | Is that the case? What if I expand a 3-1 array to 3-2? Won't
           | the old blocks remain 3-1?
        
             | Timshel wrote:
             | I don't believe it supports adding parity drives only data
             | drives.
        
               | stavros wrote:
               | Ahh interesting, thanks.
        
               | bmicraft wrote:
               | Since preexisting blocks are kept at their current parity
               | ratio and not modified (only redistributed among all
               | devices), increasing the parity level of new blocks won't
               | really be useful in practice anyway.
        
           | crote wrote:
           | I think it's a huge caveat, because it makes upgrades a lot
           | less efficient than you'd expect.
           | 
           | For example, home users generally don't want to buy all of
           | their storage up front. They want to add additional disks as
           | the array fills up. Being able to start with a 2-disk raidz1
           | and later upgrade that to a 3-disk and eventually 4-disk
           | array is amazing. It's a lot less amazing if you end up with
           | a 55% storage efficiency rather than 66% you'd ideally get
           | from a 2-disk to 3-disk upgrade. That's 11% of your total
           | disk capacity wasted, without any benefit whatsoever.
        
             | chungy wrote:
             | It still seems pretty minor. If you want extreme
             | optimization, feel free to destroy the pool and create it
             | new, or create it with the ideal layout from the beginning.
             | 
             | Old data still works fine, the same guarantees RAID-Z
             | provides still hold. New data will be written with the new
             | data layout.
        
             | bmicraft wrote:
             | Well, when you start a raidz with 2 devices you've already
             | done goofed. Start with a mirror or at least 3 devices.
             | 
             | Also, if you don't wait to upgrade until the disks are at
             | 100% utilization (which you should never do! you're
             | creating massive fragmentation upwards of ~85%) efficiency
             | in the real world will be better.
        
             | ryao wrote:
             | You have a couple options:
             | 
             | 1. Delete the snapshots and rewrite the files in place like
             | how people do when they want to rebalance a pool.
             | 
             | 2. Use send/receive inside the pool.
             | 
             | Either one will make the data use the new layout. They both
             | carry the caveat that reflinks will not survive the
             | operation, such that if you used reflinks to deduplicate
             | storage, you will find the deduplication effect is gone
             | afterward.
        
         | rekoil wrote:
         | Yaeh it's a pretty huge caveat to be honest.
         | Da1 Db1 Dc1 Pa1 Pb1         Da2 Db2 Dc2 Pa2 Pb2         Da3 Db3
         | Dc3 Pa3 Pb3         ___ ___ ___ Pa4 Pb4
         | 
         | ___ represents free space. After expansion by one disk you
         | would logically expect something like:                   Da1
         | Db1 Dc1 Da2 Pa1 Pb1         Db2 Dc2 Da3 Db3 Pa2 Pb2         Dc3
         | ___ ___ ___ Pa3 Pb3         ___ ___ ___ ___ Pa4 Pb4
         | 
         | But as I understand it it would actually expand to:
         | Da1 Db1 Dc1 Dd1 Pa1 Pb1         Da2 Db2 Dc2 Dd2 Pa2 Pb2
         | Da3 Db3 Dc3 Dd3 Pa3 Pb3         ___ ___ ___ ___ Pa4 Pb4
         | 
         | Where the Dd1-3 blocks are just wasted. Meaning by adding a new
         | disk to the array you're only expanding _free storage_ by
         | 25%... So say you have 8TB disks for a total of 24TB of storage
         | free originally, and you have 4TB free before expansion, you
         | would have 5TB free after expansion.
         | 
         | Please tell me I've misunderstood this, because to me it is a
         | pretty useless implementation if I haven't.
        
           | ryao wrote:
           | ZFS RAID-Z does not have parity disks. The parity and data is
           | interleaved to allow data reads to be done from all disks
           | rather than just the data disks.
           | 
           | The slides here explain how it works:
           | 
           | https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
           | 
           | Anyway, you are not entirely wrong. The old data will have
           | the old parity:data ratio while new data will have the new
           | parity:data ratio. As old data is freed from the vdev, new
           | writes will use the new parity:data ratio. You can speed this
           | up by doing send/receive, or by deleting all snapshots and
           | then rewriting the files in place. This has the caveat that
           | reflinks will not survive the operation, such that if you
           | used reflinks to deduplicate storage, you will find the
           | deduplication effect is gone afterward.
        
             | chungy wrote:
             | To be fair, RAID5/6 don't have parity disks either. RAID2,
             | RAID3, and RAID4 do, but they're all effectively dead
             | technology for good reason.
             | 
             | I think it's easy for a lot of people to conceptualize
             | RAID5/6 and RAID-Zn as having "data disks" and "parity
             | disks" to wrap around the complicated topic of how it
             | works, but all of them truly interleave and compute parity
             | data across all disks, allowing any single disk to die.
             | 
             | I've been of two minds on the persistent myth of "parity
             | disks" but I usually ignore it, because it's a convenient
             | lie to understand your data is safe, at least. It's also a
             | little bit the same way that raidz1 and raidz2 are
             | sometimes talked about as "RAID5" and "RAID6"; the
             | effective benefits are the same, but the implementation is
             | totally different.
        
           | magicalhippo wrote:
           | Unless I misunderstood you, you're describing more how
           | classical RAID would work. The RAID-Z expansion works like
           | you note you would logically expect. You added a drive with
           | four blocks of free space, and you end up with four blocks
           | more of free space afterwards.
           | 
           | You can see this in the presentation[1] slides[2].
           | 
           | The reason this is sub-optimal post-expansion is because, in
           | your example, the old maximal stripe width is lower than the
           | post-expansion maximal stripe width.
           | 
           | Your example is a bit unfortunate in terms of allocated
           | blocks vs layout, but if we tweak it slightly, then
           | Da1 Db1 Dc1 Pa1 Pb1         Da2 Db2 Dc2 Pa2 Pb2         Da3
           | Db3 Pa3 Pb3 ___
           | 
           | would after RAID-Z expansion would become
           | Da1 Db1 Dc1 Pa1 Pb1 Da2         Db2 Dc2 Pa2 Pb2 Da3 Db3
           | Pa3 Pb3 ___ ___ ___ ___
           | 
           | Ie you added a disk with 3 new blocks, and so total free
           | space after is 1+3 = 4 blocks.
           | 
           | However if the same data was written in the post-expanded
           | vdev configuration, it would have become
           | Da1 Db1 Dc1 Dd1 Pa1 Pb1         Da2 Db2 Dc2 Dd2 Pa2 Pb2
           | ___ ___ ___ ___ ___ ___
           | 
           | Ie, you'd have 6 free blocks not just 4 blocks.
           | 
           | Of course this doesn't count for writes which end up taking
           | less than the maximal stripe width.
           | 
           | [1]: https://www.youtube.com/watch?v=tqyNHyq0LYM
           | 
           | [2]:
           | https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
        
             | ryao wrote:
             | Your diagrams have some flaws too. ZFS has a variable
             | stripe size. Let's say you have a 10 disk raid-z2 vdev that
             | is ashift=12 for 4K columns. If you have a 4K file, 1 data
             | block and 2 parity blocks will be written. Even if you
             | expand the raid-z vdev, there is no savings to be had from
             | the new data:parity ratio. Now, let's assume that you have
             | a 72K file. Here, you have 18 data blocks and 6 parity
             | blocks. You would benefit from rewriting this to use the
             | new data:parity ratio. In this case, you would only need 4
             | parity blocks. ZFS does not rewrite it as part of the
             | expansion, however.
             | 
             | There are already good diagrams in your links, so I will
             | refrain from drawing my own with ASCII. Also, ZFS will vary
             | which columns get parity, which is why the slides you
             | linked have the parity at pseudo-random locations. It was
             | not a quirk of the slide's author. The data is really laid
             | out that way.
        
               | magicalhippo wrote:
               | What are the errors? I tried to show exactly what you
               | talk about.
               | 
               | edit: ok, I didn't consider the exact locations of the
               | parity, I was only concerned with space usage.
               | 
               | The 8 data blocks need three stripes on a 3+2 RAID-Z2
               | setup both pre and post expansion, the last being a
               | partial stripe, but when written in the 4+2 setup only
               | needs 2 full stripes, leading to more total free space.
        
         | wjdp wrote:
         | Caveat is very much expected, you should expect ZFS features to
         | not rewrite blocks. Changes to settings only apply to new data
         | for example.
        
       | cgeier wrote:
       | This is huge news for ZFS users (probably mostly those in the
       | hobbyist/home use space, but still). raidz expansion has been one
       | of the most requested features for years.
        
         | jfreax wrote:
         | I'm not yet familiar with zfs and couldn't find it in the
         | release note: Does expansion only works with disk of the same
         | size? Or is adding are bigger/smaller disks possible or do all
         | disk need to have the same size?
        
           | shiroiushi wrote:
           | As far as I understand, ZFS doesn't work at all with disks of
           | differing sizes (in the same array). So if you try it, it
           | just finds the size of the smallest disk, and uses that for
           | all disks. So if you put an 8TB drive in an array with a
           | bunch of 10TB drives, they'll all be treated as 8TB drives,
           | and the extra 2TB will be ignored on those disks.
           | 
           | However, if you replace the smallest disk with a new, larger
           | drive, and resilver, then it'll now use the new smallest disk
           | as the baseline, and use that extra space on the other
           | drives.
           | 
           | (Someone please correct me if I'm wrong.)
        
             | mustache_kimono wrote:
             | > As far as I understand, ZFS doesn't work at all with
             | disks of differing sizes (in the same array).
             | 
             | This might be misleading, however, it may only be my
             | understanding of word "array".
             | 
             | You can use 2x10TB mirrors as vdev0, and 6x12TB in RAIDZ2
             | as vdev1 in the same pool/array. You can also stack as many
             | unevenly sized disks as you want in a pool. The actual
             | problem is when you want a different drive topology within
             | a pool or vdev, or you want to mismatch, say, 3 oddly sized
             | drives to create some synthetic redundancy level (2x4TB and
             | 1x8TB to achieve two copies on two disks) like btrfs
             | does/tries to do.
        
             | tw04 wrote:
             | This is the case with any parity based raid, they just hide
             | it or lie to you in various ways. If you have two 6TB dives
             | and two 12TB drives in a single raid-6 array, it is
             | physically impossible to have two drive parity once you
             | exceed 12TB of written capacity. BTRFS and bcachefs can't
             | magically create more space where none exists on your 6TB
             | drives. They resort to dropping to mirror protection for
             | the excess capacity which you could also do manually with
             | ZFS by giving it partitions instead of the whole drive.
        
           | chasil wrote:
           | IIRC, you could always replace drives in a raidset with
           | larger devices. When the last drive is replaced, then the new
           | space is recognized.
           | 
           | This new operation seems somewhat more sophisticated.
        
           | zelcon wrote:
           | You need to buy the same exact drive with the same capacity
           | and speed. Your raidz vdev be as small and as slow as your
           | smallest and slowest drive.
           | 
           | btrfs and the new bcachefs can do RAID with mixed drives, but
           | I can't trust either of them with my data yet.
        
             | Mashimo wrote:
             | > You need to buy the same exact drive
             | 
             | AFAIK you can add larger and faster drives, you will just
             | not get any benefits from it.
        
               | bpye wrote:
               | You can get read speed benefits with faster drives, but
               | your writes will be limited by your slowest.
        
             | hda111 wrote:
             | It doesn't have to be the same exact drive. Mixing drives
             | from different manufacturers (with the same capacity) is
             | often used to prevent correlated failure. ZFS is not using
             | the whole disk, so different disks can be mixed, because
             | the disk often have varying capacity.
        
             | unixhero wrote:
             | Just have backups. I used btrfs and zfs for different
             | purposes. Never had any lost data or downtime with btrfs
             | since 2016. I only use raid 0 and raid 1 and compression.
             | Btrfs does not havr a hungry ram requirement.
        
               | tw04 wrote:
               | Neither does zfs, that's a widely repeated red herring
               | from people trying to do dedup in the very early days,
               | and people who misunderstood how it used ram to do
               | caching.
        
             | tw04 wrote:
             | You can run raid-z across partitions to utilize the full
             | drive just like synology does with their "hybrid raid" -
             | you just shouldn't.
        
           | ryao wrote:
           | You can use different sized disks, but RAID-Z will truncate
           | the space it uses to the lowest common denominator. If you
           | increase the lowest common denominator, RAID-Z should auto-
           | expand to use the additional space. All parity RAID
           | technologies truncate members to the lowest common
           | denominator, rather than just ZFS.
        
       | zelcon wrote:
       | Been running it since rc2. It's insane how long this took to
       | finally ship.
        
       | senectus1 wrote:
       | Would love to use ZFS, but unfortunately Fedora just cant keep up
       | with it...
        
         | vedranm wrote:
         | If you delay upgrading the kernel on occasions, it is more or
         | less fine.
        
         | kawsper wrote:
         | Not sure if it helps you at all, but I have a simple Ruby
         | script that I use to build kernels on Fedora with a specified
         | ZFS version.
         | 
         | https://github.com/kaspergrubbe/fedora-kernel-compilation/bl...
         | 
         | It builds on top of the exploded fedora kernel tree, adds zfs
         | and spits out a .rpm that you can install with rpm -ivh.
         | 
         | It doesn't play well with dkms because it tries to interfere,
         | so I disable it on my system.
        
           | _factor wrote:
           | I could never getting working on rpm-ostree distros.
        
         | klauserc wrote:
         | I've been running Fedora on top of the excellent ZFSBootMenu[1]
         | for about a year. You need to pay attention to the kernel
         | versions supported by OpenZFS and might have to wait for
         | support for a couple of weeks. The setup works fine otherwise.
         | 
         | [1] https://docs.zfsbootmenu.org
        
       | endorphine wrote:
       | Can someone describe why they would use ZFS (or similar) for home
       | usage?
        
         | lutorm wrote:
         | Apart from just peace of mind from bitrot, I use it for the
         | snapshotting capability which makes it super easy to do
         | backups. You can snapshot and send the snapshots to other
         | storage with e.g zfs-autobackup and it's trivial and you can't
         | screw it up. If the snapshots exist on the other drive, you
         | know you have a backup.
        
         | vedranm wrote:
         | Several reasons, but major ones (for me) are reliability
         | (checksums and self-healing) and portability (no other modern
         | filesystem can be read and written on Linux, FreeBSD, Windows,
         | and macOS).
         | 
         | Snapshots ("boot environments") are also supported by Btrfs (my
         | Linux installations use that so I don't have to worry about
         | having the 3rd party kernel module to read my rootfs).
         | Performance isn't that great either and, assuming Linux, XFS is
         | a better choice if that is your main concern.
        
         | nesarkvechnep wrote:
         | I'm trying to find a reason not to use ZFS at home.
        
           | dizhn wrote:
           | Requirement for enterprise quality disks, huge RAM (1 gig per
           | TB), ECC, at least x5 disks of redundancy. None of these are
           | things, but people will try to educate you anyway. So use it
           | but keep it to yourself. :)
        
             | craftkiller wrote:
             | No need to keep it to yourself. As you've mentioned, all of
             | these requirements are misinformation so you can ignore
             | people who repeat them (or even better, tell them to stop
             | spreading misinformation).
             | 
             | For those not in the know:
             | 
             | You don't need to use enterprise quality disks. There is
             | nothing in the ZFS design that requires enterprise quality
             | disks any more than any other file system. In fact, ZFS has
             | saved my data through multiple consumer-grade HDD failures
             | over the years thanks to raidz.
             | 
             | The 1 gig per TB figure is ONLY for when using the ZFS
             | dedup feature, which the ZFS dedup feature is widely
             | regarded as a bad idea except in VERY specific use cases.
             | 99.9% of ZFS users should not and will not use dedup and
             | therefore they do not need ridiculous piles of ram.
             | 
             | There is nothing in the design of ZFS any more dangerous to
             | run without ECC than any other filesystem. ECC is a good
             | idea regardless of filesystem but its certainly not a
             | requirement.
             | 
             | And you don't need x5 disks of redundancy. It runs great
             | and has benefits even on single-disk systems like laptops.
             | Naturally, having parity drives is better in case a drive
             | fails but on single disk systems you still benefit from the
             | checksumming, snapshotting, boot environments, transparent
             | compression, incremental zfs send/recv, and cross-platform
             | native encryption.
        
               | bbatha wrote:
               | > The 1 gig per TB figure is ONLY for when using the ZFS
               | dedup feature, which the ZFS dedup feature is widely
               | regarded as a bad idea except in VERY specific use cases.
               | 99.9% of ZFS users should not and will not use dedup and
               | therefore they do not need ridiculous piles of ram.
               | 
               | You also really don't need a 1GB for RAM unless you have
               | a very high write volume. YMMV but my experience is that
               | its closer to 1GB for 10TB.
        
             | tpetry wrote:
             | The interesting part about the enterprise quality disk
             | misinformation is how so wrong it is. The core idea of ZFS
             | was to detect issues when those drives or their drivers are
             | faulty. And this was more happening with cheap non-
             | enterprise disks at that time.
        
         | Mashimo wrote:
         | It's relatively easy, and yet powerful. Before that I had MDADM
         | + LVM + dm-crypt + ext4, which also worked but all the layers
         | got me into a headache.
         | 
         | Automated snapshots are super easy and fast. Also easy to
         | access if you deleted a file, you don't have to restore the
         | whole snapshot, you can just cp from the hidden .zfs/ folder.
         | 
         | I run it on 6x 8TB disk for a couple of years now. I run it in
         | a raidz2, which means up to 2 disk can die. Would I use it on a
         | single disk on a Desktop? Probably not.
        
           | redundantly wrote:
           | > Would I use it on a single disk on a Desktop? Probably not.
           | 
           | I do. Snapshots and replication and checksumming are awesome.
        
         | chromakode wrote:
         | I replicate my entire filesystem to a local NAS every 10
         | minutes using zrepl. This has already saved my bacon once when
         | a WD_BLACK SN850 suddenly died on me [1]. It's also recovered
         | code from some classic git blunders. It shouldn't be possible
         | any more to lose data to user error or single device failure.
         | We have the technology.
         | 
         | [1]: https://chromakode.com/post/zfs-recovery-with-zrepl/
        
         | mrighele wrote:
         | Good reasons for me:
         | 
         | Checksums: this is even more important in home usage as the
         | hardware is usually of lower quality. Faulty controllers,
         | crappy cables, hard disks stored in a higher than advised
         | temperature... many reasons for bogus data to be saved, and zfs
         | handles that well and automatically (if you have redundancy)
         | 
         | Snapshots: very useful to make backups and quickly go back to
         | an older version of a file when mistakes are made
         | 
         | Ease of mind: compared to the alternatives, I find that zfs is
         | easier to use and makes it harder to make a mistake that could
         | bring data loss (e.g. remove by mistake the wrong drive when
         | replacing a faulty one, pool becomes unusable, "ops!", put the
         | disk back, pool goes back to work as nothing happened). Maybe
         | it is different now with mdadm, ma when I used it years ago I
         | was always worried to make a destructive mistake.
        
           | EvanAnderson wrote:
           | > Snapshots: very useful to make backups and quickly go back
           | to an older version of a file when mistakes are made
           | 
           | Piling on here: Sending snapshots to remote machines (or
           | removable drives) is very easy. That makes snapshots viable
           | as a backup mechanism (because they can exist off-site and
           | offline).
        
         | ryao wrote:
         | To give an answer that nobody else has given, ZFS is great for
         | storing Steam games. Set recordsize=1M and compression=zstd and
         | you can often store about 33% more games in the same space.
         | 
         | A friend uses ZFS to store his Steam games on a couple of hard
         | drives. He gave ZFS a SSD to use as L2ARC. ZFS automatically
         | caches the games he likes to run on the SSD so that they load
         | quickly. If he changes which games he likes to run, ZFS will
         | automatically adapt to cache those on the SSD instead.
        
           | chillfox wrote:
           | The compression and ARC will make games load much master than
           | they would on NTFS even without having a separate drive for
           | the ARC.
        
           | bmicraft wrote:
           | As I understand, L2ARC doesn't work across reboots which
           | unfortunately makes it almost useless for systems that get
           | rebooted regularly, like desktops.
        
         | NamTaf wrote:
         | I used it on my home NAS (4x3TB drives, holding all of my
         | family's backups, etc.) for the data security / checksumming
         | features. IMO it's performant, robust and well-designed in ways
         | that give me reassurance regarding data integrity and help
         | prevent me shooting myself in the foot.
        
         | PaulKeeble wrote:
         | I have a home built NAS that uses ZFS for the storage array and
         | the checksumming has been really quite useful in detecting and
         | correcting bit rot. In the past I used MDADM and EXT over the
         | top and that worked but it didn't defend against bit rot. I
         | have considered BTRFS since it would get me the same
         | checksumming without the rest of ZFS but its not considered
         | reliable for systems with parity yet (although now I think it
         | likely is more than reliable enough now).
         | 
         | I do occasionally use snapshots and the compression feature is
         | handy on quite a lot of my data set but I don't use the user
         | and group limitations or remote send and receive etc. ZFS does
         | a lot more than I need but it also works really well and I
         | wouldn't move away from a checksumming filesystem now.
        
         | mshroyer wrote:
         | I use it on a NAS for:
         | 
         | - Confidence in my long-term storage of some data I care about,
         | as zpool scrub protects against bit rot
         | 
         | - Cheap snapshots that provide both easy checkpoints for work
         | saved to my network share, and resilience against ransomware
         | attacks against my other computers' backups to my NAS
         | 
         | - Easy and efficient (zfs send) replication to external hard
         | drives for storage pool backup
         | 
         | - Built-in and ergonomic encryption
         | 
         | And it's really pretty easy to use. I started with FreeNAS (now
         | TrueNAS), but eventually switched to just running FreeBSD + ZFS
         | + Samba on my file server because it's not that complicated.
        
         | zbentley wrote:
         | I use ZFS for boot and storage volumes on my main workstation,
         | which is primarily that--a workstation, not a server or NAS.
         | Some benefits:
         | 
         | - Excellent filesystem level backup facility. I can transfer
         | snapshots to a spare drive, or send/receive to a remote (at
         | present a spare computer, but rsync.net looks better every year
         | I have to fix up the spare).
         | 
         | - Unlike other fs-level backup solutions, the flexibility of
         | zvols means I can easily expand or shrink the scope of what's
         | backed up.
         | 
         | - It's incredibly easy to test (and restore) backups. Pointing
         | my to-be-backed-up volume, or my backup volume, to a previous
         | backup snapshot is instant, and provides a complete view of the
         | filesystem at that point in time. No "which files do you want
         | to restore" hassles or any of that, and then I can re-point
         | back to latest and keep stacking backups. Only Time Machine has
         | even approached that level of simplicity in my experience, and
         | I have tried a _lot_ of backup tools. In general, backup tools
         | /workflows that uphold "the test process _is_ the restoration
         | process, so we made the restoration process as easy and
         | reversible as possible " are the best ones.
         | 
         | - Dedup occasionally comes in useful (if e.g. I'm messing
         | around with copies of really large AI training datasets or many
         | terabytes of media file organization work). It's RAM-expensive,
         | yes, but what's often not mentioned is that you can turn it on
         | and off for a volume--if you rewrite data. So if I'm looking
         | ahead to a week of high-volume file wrangling, I can turn dedup
         | on where I need it, start a snapshot-and-immediately-restore of
         | my data (or if it's not that many files, just cp them back and
         | forth), and by the next day or so it'll be ready. Turning it
         | off when I'm done is even simpler. I imagine that the copy cost
         | and unpredictable memory usage mean that this kind of "toggled"
         | approach to dedup isn't that useful for folks driving servers
         | with ZFS, but it's outstanding on a workstation.
         | 
         | - Using ZFSBootMenu outside of my OS means I can be extremely
         | cavalier with my boot volume. Not sure if an experimental
         | kernel upgrade is going to wreck my graphics driver? Take a
         | snapshot and try it! Not sure if a curl | bash invocation from
         | the internet is going to rm -rf /? Take a snapshot and try it!
         | If my boot volume gets ruined, I can roll it back to a snapshot
         | _in the bootloader_ from outside of the OS. For extra paranoia
         | I have a ZFSBootMenu EFI partition on a USB drive if I ever
         | wreck the bootloader as well, but the odds are that if I ever
         | break the system that bad the boot volume is damaged at the
         | block level and can 't restore local snapshots. In that case,
         | I'd plug in the USB drive and restore a snapshot from the
         | adjacent data volume, or my backup volume ... all without
         | installing an OS or leaving the bootloader. The benefits of
         | this to mental health are huge; I can tend towards a more
         | "college me" approach to trying random shit from StackOverflow
         | for tweaking my system without having to worry about "adult
         | professional me" being concerned that I don't _know_ what
         | running some random garbage will do to my system. Being able to
         | experiment first, and then learn what 's really going on once I
         | find what works, is very relieving and makes tinkering a much
         | less fraught endeavor.
         | 
         | - Being able to per-dataset enable/disable ARC and ZIL means
         | that I can selectively make some actions really fast. My Steam
         | games, for example, are in a high-ARC-bias dataset that starts
         | prewarming (with throttled IO) in the background on boot. Game
         | load times are extremely fast--sometimes at better than single-
         | ext4-SSD levels--and I'm storing all my game installs on
         | spinning rust for $35 (4x 500GB + 2x 32GB cheap SSD for cache)!
        
         | tbrownaw wrote:
         | > _describe why they would use ZFS (or similar) for home usage_
         | 
         | Mostly because it's there, but also the snapshots have a `diff`
         | feature that's occasionally useful.
        
         | klauserc wrote:
         | I use it on my work laptop. Reasons:
         | 
         | - a single solution that covers the entire storage domain (I
         | don't have to learn multiple layers, like logical volume
         | manager vs. ext4 vs. physical partitions) - cheap/free
         | snapshots. I have been glad to have been able to revert
         | individual files or entire file systems to an earlier state.
         | E.g., create a snapshot before doing a major distro update. -
         | easy to configure/well documented
         | 
         | Like others have said, at this point I would need a good
         | reason, NOT to use ZFS on a system.
        
       | FrostKiwi wrote:
       | FINALLY!
       | 
       | You can do borderline insane single-vdev setups like RAID-Z3 with
       | 4 disks (3 Disks worth of redundancy) of the most expensive and
       | highest density hard drives money can buy right now, for an
       | initial effective space usage of 25% and then keep buying and
       | expanding Disk by Disk, with the space demand growing, up to
       | something like 12ish disks. Disk prices dropping as time goes on
       | and a spread out failure chance with disks being added at
       | different times.
        
         | uniqueuid wrote:
         | Yes but see my sibling comment.
         | 
         | When you expand your array, your _existing data_ will not be
         | stored any more efficiently.
         | 
         | To get the new parity/data ratios, you would have to force
         | copies of the data and delete the old, inefficient versions,
         | e.g. with something like this [1]
         | 
         | My personal take is that it's a much better idea to buy
         | individual complete raid-z configurations and add new ones /
         | replace old ones (disk by disk!) as you go.
         | 
         | [1] https://github.com/markusressel/zfs-inplace-rebalancing
        
           | Mashimo wrote:
           | I wish something like this would be build into ZFS, so
           | snapshots and current access would not be broken.
        
             | uniqueuid wrote:
             | True, but I have a gut feeling that a lot of these thorny
             | issues would come up again:
             | 
             | https://github.com/openzfs/zfs/issues/3582
        
       | averageRoyalty wrote:
       | Worth noting that TrueNAS already supports this[0] (I assuming
       | using 2.3.0rc3?). Not sure about the stability, but very
       | exciting.
       | 
       | https://www.truenas.com/blog/electric-eel-openzfs-23/
        
       | poisonborz wrote:
       | I just don't get it how the Windows world - by far the largest PC
       | platform per userbase - still doesn't have any answer to ZFS.
       | Microsoft had WinFS and then ReFS but it's on the backburner and
       | while there is active development (Win11 ships some bits time to
       | time) release is nowhere in sight. There are some lone warriors
       | trying the giant task of creating a ZFS compatibility layer with
       | some projects, but they are far from being mature/usable.
       | 
       | How come that Windows still uses a 32 year old file system?
        
         | mustache_kimono wrote:
         | > I just don't get it how the Windows world - by far the
         | largest PC platform per userbase - still doesn't have any
         | answer to ZFS.
         | 
         | The mainline Linux kernel doesn't either, and I think the
         | answer is because it's hard and high risk with a return mostly
         | measured in technical respect?
        
           | ffsm8 wrote:
           | Technically speaking, bcachefs has been merged into the Linux
           | Kernel - that makes your initial assertion wrong.
           | 
           | But considering it's had two drama events within 1 year of
           | getting merged... I think we can safely confirm your
           | conclusion of it being really hard
        
             | mustache_kimono wrote:
             | > Technically speaking, bcachefs has been merged into the
             | Linux Kernel - that makes your initial assertion wrong.
             | 
             | bcachefs doesn't implement its erasure coding/RAID yet?
             | Doesn't implement send/receive. Doesn't implement
             | scrub/fsck. See: https://bcachefs.org/Roadmap,
             | https://bcachefs.org/Wishlist/
             | 
             | btrfs is still more of a legit competitor to ZFS these days
             | and it isn't close to touching ZFS where it matters. If the
             | perpetually half-finished bcachefs and btrfs are the
             | "answer" to ZFS that seems like too little, too late to me.
        
               | koverstreet wrote:
               | Erasure coding is almost done; all that's missing is some
               | of the device evacuate and reconstruct paths, and people
               | have been testing it and giving positive feedback
               | (especially w.r.t. performance).
               | 
               | It most definitely does have fsck and has since the
               | beginning, and it's a much more robust and dependable
               | fsck than btrfs's. Scrub isn't quite done - I actually
               | was going to have it ready for this upcoming merge window
               | except for a nasty bout of salmonella :)
               | 
               | Send/recv is a long ways off, there might be some low
               | level database improvements needed before that lands.
               | 
               | Short term (next year or two) priorities are finishing
               | off online fsck, more scalability work (upcoming version
               | for this merge window will do 50PB, but now we need to up
               | the limit on number of drives), and quashing bugs.
        
               | ryao wrote:
               | Hearing that it is missing some code for reconstruction
               | makes it sound like it is missing something fairly
               | important. The original purpose of parity RAID is to
               | support reconstruction.
        
               | koverstreet wrote:
               | We can do reconstruct reads, what's missing is the code
               | to rewrite missing blocks in a stripe after a drive dies.
               | 
               | In general, due to the scope of the project, I've been
               | prioritizing the functionality that's needed to validate
               | the design and the parts that are needed for getting the
               | relationships between different components correct.
               | 
               | e.g. recently I've been doing a bunch of work on
               | backpointers scalability, and that plus scrub are leading
               | to more back and forth iteration on minor interactions
               | with erasure coding.
               | 
               | So: erasure coding is complete enough to know that it
               | works and for people to torture test it, but yes you
               | shouldn't be running it in production yet (and it's
               | explicitly marked as such). What's remaining is trivial
               | but slightly tedious stuff that's outside the critical
               | path of the rest of the design.
               | 
               | Some of the code I've been writing for scrub is turning
               | out to also be what we want for reconstruct, so maybe
               | we'll get there sooner rather than later...
        
               | BSDobelix wrote:
               | >except for a nasty bout of salmonella
               | 
               | Did the Linux Foundation send you some "free" sushi? ;)
               | 
               | However keep the good work rolling, super happy about a
               | good, usable and modern Filesystem native to Linux.
        
               | pdimitar wrote:
               | FYI: the main reason I gave up on bcachefs is that I
               | can't use devices with native 16K blocks.
               | 
               | Hope that's coming this year. I have a bunch of old HDDs
               | and SSDs and I could very easily assemble a spare storage
               | server with about 4TB capacity. Already tested bcachefs
               | with most of the drives and it performed very well.
               | 
               | Also lack of ability to reconstruct seems like another
               | worrying omission.
        
               | koverstreet wrote:
               | I wasn't aware there were actual users needing bs > ps
               | yet. Cool :)
               | 
               | That should be a completely trivial for bcachefs to
               | support, it'll mostly just be a matter of finding or
               | writing the tests.
        
               | pdimitar wrote:
               | Seriously? But... NVMe drives! I stopped testing because
               | I only have one spare NVMe and couldn't use it with
               | bcachefs.
               | 
               | If you or others can get it done I'm absolutely starting
               | to use bcachefs the month after. I do need fast storage
               | servers in my home office.
        
               | ryao wrote:
               | You can do this on ZFS today with `zpool create -o
               | ashift=14 ...`.
        
               | pdimitar wrote:
               | Yeah I know, thanks. But ZFS still mostly requires drives
               | with the same sizes. My main NAS is like that but I can't
               | expand it even though I want to, with drives of different
               | sizes I have lying around, and I am not keen on spending
               | for new HDDs right now. So I thought I'll make a
               | secondary NAS with bcachefs and all the spare drives I
               | have.
               | 
               | As for ZFS, I'll be buying some extra drives later this
               | year and will make use of direct_io so I can use another
               | NVMe spare for faster access.
        
               | ryao wrote:
               | If you don't care about redundancy, you could add all of
               | them as top level vdevs and then ZFS will happily use all
               | of the space on them until one fails. Performance should
               | be great until there is a failure. Just have good
               | backups.
        
               | mafuy wrote:
               | Thank you, looking forward to it!
        
         | kwanbix wrote:
         | Honest question. As an end user that uses Windows and Linux and
         | does not uses ZFS, what I am missing?
        
           | madeofpalk wrote:
           | I'm missing file clones/copy-on-write.
        
           | poisonborz wrote:
           | Way better data security, resilience against file rotting.
           | This goes for both HDDs or SSDs. Copy-on-write, snapshots,
           | end to end integrity. Also easier to extend the storage for
           | safety/drive failure (and SSDs corrupt in a more sneaky way)
           | with pools.
        
             | wil421 wrote:
             | How many of us are using single disks on our laptops? I
             | have a NAS and use all of the above but that doesn't help
             | people with single drive systems. Or help me understand why
             | I would want it on my laptop.
        
               | ekianjo wrote:
               | It provides encryption by default without having to deal
               | with LUKS. And no need to ever do fsck again.
        
               | Twey wrote:
               | Except that swap on OpenZFS still deadlocks 7 years later
               | (https://github.com/openzfs/zfs/issues/7734) so you're
               | still going to need LUKS for your swap anyway.
        
               | ryao wrote:
               | Another option is to go without swap. I avoid swap on my
               | machines unless I want hibernation support.
        
               | ryao wrote:
               | My thinkpad from college uses ZFS as its rootfs. The
               | benefits are:                 * If the hard drive / SSD
               | corrupted blocks, the corruption would be identified.
               | * Ditto blocks allow for self healing. Usually, this only
               | applies to metadata, but if you set copies=2, you can get
               | this on data too. It is a poor man's RAID.       * ARC
               | made the desktop environment very responsive since unlike
               | the LRU cache, ARC resists cold cache effects from
               | transient IO workloads.       * Transparent compression
               | allowed me to store more on the laptop than otherwise
               | possible.       * Snapshots and rollback allowed me to do
               | risky experiments and undo them as if nothing happened.
               | * Backups were easy via send/receive of snapshots.
               | * If the battery dies while you are doing things, you can
               | boot without any damage to the filesystem.
               | 
               | That said, I use a MacBook these days when I need to go
               | outside. While I miss ZFS on it, I have not felt
               | motivated to try to get a ZFS rootfs on it since the last
               | I checked, Apple hardcoded the assumption that the rootfs
               | is one of its own filesystems into the XNU kernel and
               | other parts of the system.
        
               | CoolCold wrote:
               | NTFS had compression since mot even sure when.
               | 
               | For other stuff, let that nerdy CorpIT handle your
               | system.
        
               | ryao wrote:
               | NTFS compression is slow and has a low compression ratio.
               | ZFS has both zstd and lz4.
        
               | adgjlsfhk1 wrote:
               | yes but NTFS is bad enough that no one needs to be told
               | how bad it is.
        
               | rabf wrote:
               | Not ever having to deal with partitions and instead using
               | data sets each of which can have their own properties
               | such as compression, size quota, encryption etc is
               | another benefit. Also using zfsbootmenu instead of grub
               | enables booting from different datasets or snapshots as
               | well as mounting and fixing data sets all from the
               | bootloader!
        
               | yjftsjthsd-h wrote:
               | If the single drive in your laptop corrupts data, you
               | won't know. ZFS can't _fix_ corruption without extra
               | copies, but it 's still useful to catch the problem and
               | notify the user.
               | 
               | Also snapshots are great regardless.
        
               | Polizeiposaune wrote:
               | In some circumstances it can.
               | 
               | Every ZFS block pointer has room for 3 disk addresses; by
               | default, the extras are used only for redundant metadata,
               | but they can also be used for user data.
               | 
               | When you turn on ditto blocks for data (zfs set copies=2
               | rpool/foo), zfs can fix corruption even on single-drive
               | systems at the cost of using double or triple the space.
               | Note that (like compression), this only affects blocks
               | written after the setting is in place, but (if you can
               | pause writes to the filesystem) you can use zfs send|zfs
               | recv to rewrite all blocks to ensure all blocks are
               | redundant.
        
             | jeroenhd wrote:
             | The data security and rot resilience only goes for systems
             | with ECC memory. Correct data with a faulty checksum will
             | be treated the same as incorrect data with a correct
             | checksum.
             | 
             | Windows has its own extended filesystem through Storage
             | Spaces, with many ZFS features added as lesser used Storage
             | Spaces options, especially when combined with ReFS.
        
               | abrookewood wrote:
               | Please stop repeating this, it is incorrect. ECC helps
               | with any system, but it isn't necessary for ZFS checksums
               | to work.
        
               | BSDobelix wrote:
               | On zfs there is the ARC (adaptive read cache), on non-zfs
               | systems this "read cache" is called buffer, both reside
               | in memory, so ECC is equally important for both systems.
               | 
               | Rot means changing bits without accessing those bits, and
               | that's ~not possible with zfs, additionally you can
               | enable check-summing IN the ARC (disabled by default),
               | and with that you can say that ECC and "enterprise"
               | quality hardware is even more important for non-ZFS
               | systems.
               | 
               | >Correct data with a faulty checksum will be treated the
               | same as incorrect data with a correct checksum.
               | 
               | There is no such thing as "correct" data, only a block
               | with a correct checksum, if the checksum is not correct,
               | the block is not ok.
        
               | _factor wrote:
               | This has nothing to do with ZFS as a filesystem. It has
               | integrity verification on duplicated raid configurations.
               | If the system memory flips a bit, it will get written to
               | disk like all filesystems. If a bit flips on a disk,
               | however, it can be detected and repaired. Without ECC,
               | your source of truth can corrupt, but this true of any
               | system.
        
               | mrb wrote:
               | _" data security and rot resilience only goes for systems
               | with ECC memory."_
               | 
               | No. Bad HDDs/SSDs or bad SATA cables/ports cause a lot
               | more data corruption than bad RAM. And ZFS will correct
               | these cases even without ECC memory. It's a myth that the
               | data healing properties of ZFS are useless without ECC
               | memory.
        
               | elseless wrote:
               | Precisely this. And don't forget about bugs in
               | virtualization layers/drivers -- ZFS can very often save
               | your data in those cases, too.
        
               | ryao wrote:
               | I once managed to use ZFS to detect a bit flip on a
               | machine that did not have ECC RAM. All python programs
               | started crashing in libpython.so on my old desktop one
               | day. I thought it was a bug in ZFS, so I started
               | debugging. I compared the in-memory buffer from ARC with
               | the on-disk buffer for libpython.so and found a bit flip.
               | At the time, accessing a snapshot through .zfs would
               | duplicate the buffer in ARC, which made it really easy to
               | compare the in-memory buffer against the on-disk buffer.
               | I was in shock as I did not expect to ever see one in
               | person. Since then, I always insist on my computers
               | having ECC.
        
           | e12e wrote:
           | Cross platform native encryption with sane fs for removable
           | media.
        
             | lazide wrote:
             | Who would that help?
             | 
             | MacOS also defaults to a non-portable FS for likely similar
             | reasons, if one was being cynical.
        
           | chillfox wrote:
           | Much faster launch of applications/files you use regularly.
           | Ability to always rollback updates in seconds if they cause
           | issues thanks to snapshots. Fast backups with snapshots + zfs
           | send/receive to a remote machine. Compressed disks, this both
           | let's you store more on a drive and makes accessing files
           | faster. Easy encryption. ability to mirror 2 large usb disks
           | so you never have your data corrupted or lose it from drive
           | failures. Can move your data or entire os install to a new
           | computer easily by using a live disk and just doing a
           | send/receive to the new pc.
           | 
           | (I have never used dedup, but it's there if you want I guess)
        
           | wkat4242 wrote:
           | Snapshots (Note: NTFS does have this in the way of Volume
           | Shadow Copy but it's not as easily accessible as a feature to
           | the end user as it is in ZFS). Copy on Write for reliability
           | under crashes. Block checksumming for data protection
           | (bitrot)
        
           | johannes1234321 wrote:
           | For a while I ran Open Solaris with ZFS as root filesystem.
           | 
           | The key feature for me, which I miss, is the snapshotting
           | integrated into the package manager.
           | 
           | ZFS allows snapshots more or less for free (due to copy on
           | weite) including cron based snapshotting every 15 minutes. So
           | if I did a mistake anywhere there was a way to recover.
           | 
           | And that integrated with the update manager and boot manager
           | means that on an update a snapshot is created and during boot
           | one can switch between states. Never had a broken update, but
           | gave a good feeling.
           | 
           | On my home server I like the raid features and on Solaris it
           | was nicely integrated with NFS etc so that one can easily
           | create volumes and export them and set restrictions (max size
           | etc.) on it.
        
           | hoherd wrote:
           | Online filesystem checking and repair.
           | 
           | Reading any file will tell you with 100% guarantee if it is
           | corrupt or not.
           | 
           | Snapshots that you can `cd` into, so you can compare any
           | prior version of your FS with the live version of your FS.
           | 
           | Block level compression.
        
         | badgersnake wrote:
         | NTFS is good enough for most people, who have a laptop with one
         | SSD in it.
        
           | wkat4242 wrote:
           | The benefits of ZFS don't need multiple drives to be useful.
           | I'm running ZFS on root for years now and snapshots have
           | saved my bacon several times. Also with block checksums you
           | can at least detect bitrot. And COW is always useful.
        
             | zamadatix wrote:
             | Windows manages volume snapshots on NTFS through VSS. I
             | think ZFS snapshots are a bit "cleaner" of a design, and
             | the tooling is a bit friendlier IMO, but the functionality
             | to snapshot, rollback, and save your bacon is there
             | regardless. Outside of the automatically enabled "System
             | Restore" (which only uses VSS to snapshot specific system
             | files during updates) I don't think anyone bothers to use
             | it though.
             | 
             | CoW, advanced parity, and checksumming are the big ones
             | NTFS lacks. CoW is just inherently not how NTFS is designed
             | and checksumming isn't there. Anything else (encryption,
             | compression, snapshots, ACLs, large scale, virtual devices,
             | basic parity) is done through NTFS on Windows.
        
               | wkat4242 wrote:
               | Yes I know that NTFS has snapshots, I mentioned that in
               | another comment. I don't think NTFS is as relevant in
               | comparison though. People who choose windows will have no
               | interest in ZFS and vice versa (someone considering ZFS
               | will not pick Windows).
               | 
               | And I don't think anyone bothers to use it due to the
               | lack of user-facing tooling around it. If it would be as
               | easy to create snapshots as it is on ZFS, more people
               | would use it, I'm sure. It's just so amazing to try
               | something out, screw up my system and just revert :P But
               | VSS is more of a system API than a user-facing geature.
               | 
               | VSS is also used by backup software to quiet the
               | filesystem by the way.
               | 
               | But yeah the others are great features. My main point was
               | though that almost all the features of ZFS are very
               | beneficial even on a single drive. You don't need an
               | array to take advantage of Snapshots, the crash
               | reliability that CoW offers, and checksumming (though you
               | will lack the repair option obviously)
        
               | EvanAnderson wrote:
               | > I don't think NTFS is as relevant in comparison though.
               | People who choose windows will have no interest in ZFS
               | and vice versa (someone considering ZFS will not pick
               | Windows).
               | 
               | ZFS on Windows, as a first-class supported-by-Microsoft
               | option would be killer. It won't ever happen, but it
               | would be great. (NTFS / VSS with filesystem/snapshot
               | send/receive would "scratch" a lot of that "itch", too.)
               | 
               | > And I don't think anyone bothers to use it due to the
               | lack of user-facing tooling around it. If it would be as
               | easy to create snapshots as it is on ZFS, more people
               | would use it, I'm sure. It's just so amazing to try
               | something out, screw up my system and just revert :P But
               | VSS is more of a system API than a user-facing geature.
               | 
               | VSS on NTFS is handy and useful but in my experience
               | brittle compared to ZFS snapshots. Sometimes VSS just
               | doesn't work. I've had repeated cases over the years
               | where accessing a snapshot failed (with traditional
               | unhelpful Microsoft error messages) until the host
               | machine was rebooted. Losing VSS snapshots on a volume is
               | much easier than trashing a ZFS volume.
               | 
               | VSS straddles the filesystem and application layers in a
               | way that ZFS doesn't. I think that contributes to some of
               | the jank (VSS writers becoming "unstable", for example).
               | It also straddles hardware interfaces in a novel way that
               | ZFS doesn't (using hardware snapshot functionality--
               | somewhat like using a GPU versus "software rendering"). I
               | think that also opens up a lot of opportunity for jank,
               | as compared to ZFS treating storage as dumb blocks.
        
         | GuB-42 wrote:
         | To be honest, the situation with Linux is barely better.
         | 
         | ZFS has license issues with Linux, preventing full integration,
         | and Btrfs is 15 years in the making and still doesn't match ZFS
         | in features and stability.
         | 
         | Most Linux distros still use ext4 by default, which is 19 years
         | old, but ext4 is little more than a series of extensions on top
         | of ext2, which is the same age as NTFS.
         | 
         | In all fairness, there are few OS components that are as
         | critical as the filesystem, and many wouldn't touch filesystems
         | that have less than a decade of proven track record in
         | production.
        
           | lousken wrote:
           | as far as stability goes, btrfs is used by meta, synology and
           | many others, so I wouldn't say it's not stable, but some
           | features are lacking
        
             | fourfour3 wrote:
             | Do Synology actually use the multi-device options of btrfs,
             | or are they using linux softraid + lvm underneath?
             | 
             | I know Synology Hybrid RAID is a clever use of LVM + MD
             | raid, for example.
        
               | phs2501 wrote:
               | I believe Synology runs btrfs on top of regular mdraid +
               | lvm, possibly with patches to let btrfs checksum failures
               | reach into the underlying layers to find the right data
               | to recover.
               | 
               | Related blog post: https://daltondur.st/syno_btrfs_1/
        
             | azalemeth wrote:
             | My understanding is that single-disk btrfs is good, but
             | raid is decidedly dodgy;
             | https://btrfs.readthedocs.io/en/latest/btrfs-
             | man5.html#raid5... states that:
             | 
             | > The RAID56 feature provides striping and parity over
             | several devices, same as the traditional RAID5/6.
             | 
             | > There are some implementation and design deficiencies
             | that make it unreliable for some corner cases and *the
             | feature should not be used in production, only for
             | evaluation or testing*.
             | 
             | > The power failure safety for metadata with RAID56 is not
             | 100%.
             | 
             | I have personally been bitten once (about 10 years ago) by
             | btrfs just failing horribly on a single desktop drive. I've
             | used either mdadm + ext4 (for /) or zfs (for large /data
             | mounts) ever since. Zfs is fantastic and I genuinely don't
             | understand why it's not used more widely.
        
               | lousken wrote:
               | I was assuming OP wants to highlight filesystem use on a
               | workstation/desktop, not for a file server/NAS. I had
               | similar experience decade ago, but these days single
               | drives just work, same with mirroring. For such setups
               | btrfs should be stable. I've never seen a workstation
               | with raid5/6 setup. Secondly, filesystems and volume
               | managers are something else, even if e.g. btrfs and ZFS
               | are essentialy both.
               | 
               | For a NAS setup I would still prefer ZFS with truenas
               | scale (or proxmox if virtualization is needed), just
               | because all these scenarios are supported as well. And as
               | far as ZFS goes, encryption is still something I am not
               | sure about especially since I want to use snapshots
               | sending those as a backup to remote machine.
        
               | hooli_gan wrote:
               | RAID5/6 is not needed with btrfs. One should use RAID1,
               | which supports striping the same data onto multiple
               | drives in a redundant way.
        
               | johnmaguire wrote:
               | How can you achieve 2-disk fault tolerance using btrfs
               | and RAID 1?
        
               | Dalewyn wrote:
               | By using three drives.
               | 
               | RAID1 is just making literal copies, so each additional
               | drive in a RAID1 is a self-sufficient copy. You want two
               | drives of fault tolerance? Use three drives, so if you
               | lose two copies you still have one left.
               | 
               | This is of course hideously inefficient as you scale
               | larger, but that is not the question posed.
        
               | ryao wrote:
               | Btrfs did not support that until Linux 5.5 when it added
               | RAID1c3. On its mirror devices instead of doing
               | mirroring, it just stores 2 copies, no matter how many
               | mirror members you have.
        
               | johnmaguire wrote:
               | > This is of course hideously inefficient as you scale
               | larger, but that is not the question posed.
               | 
               | It's not just inefficient, you literally can't scale
               | larger. Mirroring is all that RAID 1 allows for. To
               | scale, you'd have to switch to RAID 10, which doesn't
               | allow two-disk fault tolerance (you can get lucky if they
               | are in different stripes, but this isn't fault
               | tolerance.)
               | 
               | But you're right - RAID 1 also scales terribly compared
               | to RAID 6, even before introducing striping. Imagine you
               | have 6 x 16 TB disks:
               | 
               | With RAID 6, usable space of 64 TB, two-drive fault
               | tolerance.
               | 
               | With RAID 1, usable space of 16 TB, five-drive fault
               | tolerance.
               | 
               | With RAID 10, usable space of 32 GB, one-drive fault
               | tolerance.
        
               | crest wrote:
               | One problem with your setup is that ZFS by design can't
               | use a traditional *nix filesystem buffer cache. Instead
               | it has to use its own ARC (adaptive replacement cache)
               | with end-to-end checksumming, transparent compression,
               | and copy-on-write semantics. This can lead to annoying
               | performance problems when the two types of file system
               | caches contest for available memory. There is a back
               | pressure mechanism, but it effectively pauses other
               | writes while evicting dirty cache entries to release
               | memory.
        
               | ryao wrote:
               | Traditionally, you have the page cache on top of the FS
               | and the buffer cache below the FS, with the two being
               | unified such that double caching is avoided in
               | traditional UNIX filesystems.
               | 
               | ZFS goes out of its way to avoid the buffer cache,
               | although Linux does not give it the option to fully opt
               | out of it since the block layer will buffer reads done by
               | userland to disks underneath ZFS. That is why ZFS began
               | to purge the buffer cache on every flush 11 years ago:
               | 
               | https://github.com/openzfs/zfs/commit/cecb7487fc8eea3508c
               | 3b6...
               | 
               | That is how it still works today:
               | 
               | https://github.com/openzfs/zfs/blob/fe44c5ae27993a8ff53f4
               | cef...
               | 
               | If I recall correctly, the page cache is also still above
               | ZFS when mmap() is used. There was talk about fixing it
               | by having mmap() work out of ARC instead, but I don't
               | believe it was ever done, so there is technically double
               | caching done there.
        
               | taskforcegemini wrote:
               | what's the best way to deal with this then? disable
               | filecache of linux? I've tried disabling/minimizing arc
               | in the past to avoid the oom reaper, but the arc was
               | stubborn and its RAM usage remained as is
        
               | ssl-3 wrote:
               | I didn't have any trouble limiting zfs_arc_max to 3GB on
               | one system where I felt that it was important. I ran it
               | that way for a fair number of years and it always stayed
               | close to that bound (if it was ever exceeded, it wasn't
               | by a noteworthy amount at any time when I was looking).
               | 
               | At the time, I had it this way because I had fear of OOM
               | events causing [at least] unexpected weirdness.
               | 
               | A few months ago I discovered weird issues with a fairly
               | big, persistent L2ARC being ignored at boot due to
               | insufficient ARC. So I stopped arbitrarily limiting
               | zfs_arc_max and just let it do its default self-managed
               | thing.
               | 
               | So far, no issues. For me. With my workload.
               | 
               | Are you having issues with this, or is it a theoretical
               | problem?
        
               | ryao wrote:
               | These days, ZFS frees memory fast enough when Linux
               | requests memory to be freed that you generally do not see
               | OOM because of ZFS, but if you have a workload where it
               | is not fast enough, you can limit the maximum arc size to
               | try to help:
               | 
               | https://openzfs.github.io/openzfs-
               | docs/Performance%20and%20T...
        
               | brian_cunnie wrote:
               | > I have personally been bitten once (about 10 years ago)
               | by btrfs just failing horribly on a single desktop drive.
               | 
               | Me, too. The drive was unrecoverable. I had to reinstall
               | from scratch.
        
               | worthless-trash wrote:
               | Licensing incompatibilities.
        
             | _joel wrote:
             | I'm similar to some other people here, I guess once they've
             | been bitten by data loss due to btrfs, it's difficult to
             | advocate for it.
        
               | lousken wrote:
               | I am assuming almost everybody at some point experienced
               | data loss because they pulled out a flash drive too
               | early. Is it safe to assume that we stopped using flash
               | drives because of it?
        
               | _joel wrote:
               | I'm not sure we have stopped using flash, judging by the
               | pile of USB sticks on my desk :) In relation to the fs
               | analogy if you used a flash drive that you know corrupted
               | your data, you'd throw it away for one you know works.
        
               | ryao wrote:
               | I once purchased a bunch of flash drives from Google's
               | online swag store and just unplugging them was often
               | enough to put then in a state where they claimed to be
               | 8MB devices and nothing I wrote to them was ever possible
               | to read back in my limited tests. I stopped using those
               | fast.
        
             | jeltz wrote:
             | It is possible to corrupt the file system from user space
             | as a normal user with Btrfs. The PostgreSQL devs found that
             | when working on async IO. And as fer as I know that issue
             | has not been fixed.
             | 
             | https://www.postgresql.org/message-id/CA%2BhUKGL-
             | sZrfwcdme8j...
        
               | curt15 wrote:
               | LMDB users also unearthed a btrfs data corruption bug
               | last year:
               | https://bugzilla.redhat.com/show_bug.cgi?id=2169947
        
           | xattt wrote:
           | ZFS on OS X was killed because of Oracle licensing drama. I
           | don't expect anything better on Windows either.
        
             | ryao wrote:
             | There is a third party port here:
             | 
             | https://openzfsonosx.org/wiki/Main_Page
             | 
             | It was actually the NetApp lawsuit that caused problems for
             | Apple's adoption of ZFS. Apple wanted indemnification from
             | Sun because of the lawsuit, Sun's CEO did not sign the
             | agreement before Oracle's acquisition of Sun happened and
             | Oracle had no interest in granting that, so the official
             | Apple port was cancelled.
             | 
             | I heard this second hand years later from people who were
             | insiders at Sun.
        
               | xattt wrote:
               | That's a shame re: NetApp/ZFS.
               | 
               | While third-party ports are great, they lack deep
               | integration that first-party support would have brought
               | (non-kludgy Time Machine which is technically fixed with
               | APFS).
        
             | BSDobelix wrote:
             | >ZFS on OS X was killed because of Oracle licensing drama.
             | 
             | Naa it was Jobs ego not the license:
             | 
             | >>Only one person at Steve Jobs' company announces new
             | products: Steve Jobs.
             | 
             | https://arstechnica.com/gadgets/2016/06/zfs-the-other-new-
             | ap...
        
               | bolognafairy wrote:
               | It's a cute story that plays into the same old assertions
               | about Steve Jobs, but the conclusion is mostly baseless.
               | There are many other, more credible, less conspiratorial,
               | possible explanations.
        
               | wkat4242 wrote:
               | It could have played into it though, but I agree the
               | support contract that couldn't be worked out mentioned
               | elsewhere in the thread is more likely.
               | 
               | But I think these things are usually a combination. When
               | a business relationship sours, agreements are suddenly
               | much harder to work out. The negotiators are still people
               | and they have feelings that will affect their
               | decisionmaking.
        
             | throw0101a wrote:
             | > _ZFS on OS X was killed because of Oracle licensing
             | drama._
             | 
             | It was killed because Apple and Sun couldn't agree on a
             | 'support contract'. From Jeff Bonwick, one of the co-
             | creators ZFS:
             | 
             | >> _Apple can currently just take the ZFS CDDL code and
             | incorporate it (like they did with DTrace), but it may be
             | that they wanted a "private license" from Sun (with
             | appropriate technical support and indemnification), and the
             | two entities couldn't come to mutually agreeable terms._
             | 
             | > _I cannot disclose details, but that is the essence of
             | it._
             | 
             | * https://archive.is/http://mail.opensolaris.org/pipermail/
             | zfs...
             | 
             | Sun took DTrace, licensed via CDDL--just like ZFS--and put
             | it into the kernel without issue. Of course a file system
             | is much more central to an operating system, so they wanted
             | much more of a CYA for that.
        
           | mogoh wrote:
           | ZFS might be better then any other FS on Linux (I don't judge
           | that).
           | 
           | But you must admit that the situation on Linux is quite
           | better then on Windows. Linux has so many FS in main branch.
           | There is a lot of development. BTRFS had a rocky start, but
           | it got better.
        
           | stephen_g wrote:
           | I'm interested to know what 'full integration' does look
           | like, I use ZFS in Proxmox (Debian-based) and it's really
           | great and super solid, but I haven't used ZFS in more vanilla
           | Linux distros. Does Proxmox have things that regular Linux is
           | missing out on, or are there shortcomings and things I just
           | don't realise about Proxmox?
        
             | whataguy wrote:
             | The difference is that the ZFS kernel module is included by
             | default with Proxmox, whereas with e.g. Debian, you would
             | need to install it manually.
        
               | pimeys wrote:
               | And you can't follow the latest kernel before the ZFS
               | module supports it.
        
               | blibble wrote:
               | for Debian that's not exactly a problem
        
               | oarsinsync wrote:
               | Unless you're using Debian backports, and they backport a
               | new kernel a week before the zfs backport package update
               | happens.
               | 
               | Happened to me more than once. I ended up manually
               | changing the kernel version limitations the second time
               | just to get me back online, but I don't recall if that
               | ended up hurting me in the long run or not.
        
               | BSDobelix wrote:
               | Try CachyOS https://cachyos.org/ , you can even swap from
               | an existing Arch installation:
               | 
               | https://wiki-
               | dev.cachyos.org/sk/cachyos_repositories/how_to_...
        
               | ryao wrote:
               | There is a trick for this:                 * Step 1: Make
               | friends with a ZFS developer.       * Step 2: Guilt him
               | into writing patches to add support as soon as a new
               | kernel is released.       * Step 3: Enjoy
               | 
               | Adding support for a new kernel release to ZFS is usually
               | only a few hours of work. I have done it in the past more
               | than a dozen times.
        
             | BodyCulture wrote:
             | You probably don't realise how important encryption is.
             | 
             | It's still not supported by Proxmox, yes, you can do it
             | yourself somehow but you are alone then and miss features
             | and people report problems with double or triple file
             | system layers.
             | 
             | I do not understand how they have not encryption out of the
             | box, this seems to be a problem.
        
               | kevinmgranger wrote:
               | I'm not sure about proxmox, but ZFS on Linux does have
               | encryption.
        
           | BSDobelix wrote:
           | >ZFS has license issues with Linux, preventing full
           | integration
           | 
           | No one wants that, openZFS is much healthier without Linux
           | and it's "Foundation/Politics".
        
             | bhaney wrote:
             | > No one wants that
             | 
             | I want that
        
               | BSDobelix wrote:
               | Then let me tell you that FreeBSD or OmniOS is what you
               | really want ;)
        
               | bhaney wrote:
               | You're now 0 for 2 at telling me what I want
        
               | BSDobelix wrote:
               | The customer is not always right, however a good/modern
               | Filesystem really would be something for Linux ;)
        
               | ruthmarx wrote:
               | > The customer is not always right,
               | 
               | An uninvited door-to-door salesman is rarely, if ever
               | right.
        
           | nabla9 wrote:
           | License is not a real issue. It must be just distributed in
           | separate module. No big hurdle.
        
             | Jnr wrote:
             | From my point of view it is a real usability issue.
             | 
             | zfs modules are not in the official repos. You either have
             | to compile it on each machine or use unofficial repos,
             | which is not exactly ideal and can break things if those
             | repos are not up to date. And I guess it also needs some
             | additional steps for secureboot setup on some distros?
             | 
             | I really want to try zfs because btrfs has some issues with
             | RAID5 and RAID6 (it is not recommended so I don't use it)
             | but I am not sure I want to risk the overall system
             | stability, I would not want to end up in a situation where
             | my machines don't boot and I have to fix it manually.
        
               | chillfox wrote:
               | I have been using ZFS on Mint and Alpine Linux for years
               | for all drives (including root) and have never had an
               | issue. It's been fantastic and is super fast. My
               | linux/zfs laptop loads games much faster than an
               | identical machine running Windows.
               | 
               | I have never had data corruption issues with ZFS, but I
               | have had both xfs and ext4 destroy entire discs.
        
               | harshreality wrote:
               | Why are you considering raid5/6? Are you considering
               | building a large storage array? If the data will fit
               | comfortably (50-60% utilization) on one drive, all you
               | need is raid1. Btrfs is fine for raid1 (raid1c3 for extra
               | redundancy); it might have hidden bugs, but no filesystem
               | is immune from those; zfs had a data loss bug (it was
               | rare, but it happened) a year ago.
               | 
               | Why use zfs for a boot partition? Unless you're using
               | every disk mounting point and nvme slot for a single
               | large raid array, you can use a cheap 512GB nvme drive or
               | old spare 2.5" ssd for the boot volume. Or two, in btrfs
               | raid1 if you absolutely must... but do you even need
               | redundancy or datasum (which can hurt performance) to
               | protect OS files? Do you really care if static package
               | files get corrupted? Those are easily reinstalled, and
               | modern quality brand SSDs are quite reliable.
        
             | crest wrote:
             | The main hurdle is hostile Linux kernel developers who
             | aren't held accountable intentionally breaking ZFS for
             | their own petty ideological reasons e.g. removing the in-
             | kernel FPU/SIMD register save/restore API and replacing it
             | with a "new" API to do the the same.
             | 
             | What's "new" about the "new" API? Its symbols are GPL2 only
             | to deny it's use to non-GPL2 modules (like ZFS). Guess
             | that's an easy way to make sure that BTRFS is faster than
             | ZFS or set yourself up as the (to be) injured party.
             | 
             | Of course a reimplementation of the old API in terms of the
             | new is an evil "GPL condom" violating the kernel license
             | right? Why can't you see ZFS's CDDL2 license is the real
             | problem here for being the wrong flavour of copyleft
             | license. Way to claim the moral high ground you short-
             | sighted, bigoted pricks. _sigh_
        
             | GuB-42 wrote:
             | It is a problem because most of the internal kernel APIs
             | are GPL-only, which limit the abilities of the ZFS module.
             | It is a common source of argument between the Linux guys
             | and the ZFS on Linux guys.
             | 
             | The reason for this is not just to piss off non-GPL module
             | developers. GPL-only internal APIs are subject to change
             | without notice, even more so than the rest of the kernel.
             | And because the licence may not allow the Linux kernel
             | developers to make the necessary changes to the module when
             | it happens, there is a good chance it breaks without
             | warning.
             | 
             | And even with that, _all_ internal APIs may change, it is
             | just a bit less likely than for the GPL-only ones, and
             | because ZFS on Linux is a separate module, there is no
             | guarantee for it to not break with successive Linux
             | versions, in fact, it is more like a guarantee that it will
             | break.
             | 
             | Linux is proudly monolithic, and as constantly evolving a
             | monolithic kernel, developers need to have control over the
             | entire project. It is also community-driven. Combined, you
             | need rules to have the community work together, or
             | everything will break down, and that's what the GPL is for.
        
           | bayindirh wrote:
           | > Most Linux distros still use ext4 by default, which is 19
           | years old, but ext4 is little more than a series of
           | extensions on top of ext2, which is the same age as NTFS.
           | 
           | However, ext4 and XFS are much more simpler and performant
           | than BTRFS & ZFS as root drives on personal systems and small
           | servers.
           | 
           | I personally won't use either on a single disk system as root
           | FS, regardless of how fast my storage subsystem is.
        
             | ryao wrote:
             | ZFS will outscale ext4 in parallel workloads with ease. XFS
             | will often scale better than ext4, but if you use L2ARC and
             | SLOG devices, it is no contest. On top of that, you can use
             | compression for an additional boost.
             | 
             | You might also find ZFS outperforms both of them in read
             | workloads on single disks where ARC minimizes cold cache
             | effects. When I began using ZFS for my rootfs, I noticed my
             | desktop environment became more responsive and I attributed
             | that to ARC.
        
               | bayindirh wrote:
               | No doubt. I want to reiterate my point. Citing myself:
               | 
               | > "I personally won't use either on a _single disk system
               | as root FS_ , regardless of how fast my storage subsystem
               | is." (emphasis mine)
               | 
               | We are no strangers to filesystems. I personally
               | benchmarked a ZFS7320 extensively, writing a
               | characterization report, plus we have a ZFS7420 for a
               | very long time, complete with separate log SSDs for read
               | and write on every box.
               | 
               | However, ZFS is not saturation proof, plus is nowhere
               | near a Lustre cluster performance wise, when scaled.
               | 
               | What kills ZFS and BTRFS on desktop systems are write
               | performance, esp. on heavy workloads like system updates.
               | If I need a desktop server (performance-wise), I'd
               | configure it accordingly and use these, but I'd _never_
               | use BTRFS or ZFS on a _single root disk_ due to their
               | overhead, to reiterate myself thrice.
        
               | ryao wrote:
               | I am generally happy with the write performance of ZFS. I
               | have not noticed slow system updates on ZFS (although I
               | run Gentoo, so slow is relative here). In what ways is
               | the write performance bad?
               | 
               | I am one of the OpenZFS contributors (although I am less
               | active as late). If you bring some deficiency to my
               | attention, there is a chance I might spend the time
               | needed to improve upon it.
               | 
               | By the way, ZFS limits the outstanding IO queue depth to
               | try to keep latencies down as a type of QoS, but you can
               | tune it to allow larger IO queue depths, which should
               | improve write performance. If your issue is related to
               | that, it is an area that could use improvement in certain
               | situations:
               | 
               | https://openzfs.github.io/openzfs-
               | docs/Performance%20and%20T...
               | 
               | https://openzfs.github.io/openzfs-
               | docs/Performance%20and%20T...
               | 
               | https://openzfs.github.io/openzfs-
               | docs/Performance%20and%20T...
        
               | jeltz wrote:
               | Not on most database workloads. There zfs does not scale
               | very well.
        
               | ryao wrote:
               | Percona and many others who benchmarked this properly
               | would disagree with you. Percona found that ext4 and ZFS
               | performed similarly when given identical hardware (with
               | proper tuning of ZFS):
               | 
               | https://www.percona.com/blog/mysql-zfs-performance-
               | update/
               | 
               | In this older comparison where they did not initially
               | tune ZFS properly for the database, they found XFS to
               | perform better, only for ZFS to outperform it when tuning
               | was done and a L2ARC was added:
               | 
               | https://www.percona.com/blog/about-zfs-performance/
               | 
               | This is roughly what others find when they take the time
               | to do proper tuning and benchmarks. ZFS outscales both
               | ext4 and XFS, since it is a multiple block device
               | filesystem that supports tiered storage while ext4 and
               | XFS are single block device filesystems (with the
               | exception of supporting journals on external drives).
               | They need other things to provide them with scaling to
               | multiple block devices and there is no block device level
               | substitute for supporting tiered storage at the
               | filesystem level.
               | 
               | That said, ZFS has a killer feature that ext4 and XFS do
               | not have, which is low cost replication. You can snapshot
               | and send/recv without affecting system performance very
               | much, so even in situations where ZFS is not at the top
               | in every benchmark such as being on equal hardware, it
               | still wins, since the performance penalty of database
               | backups on ext4 and XFS is huge.
        
           | cesarb wrote:
           | > Btrfs [...] still doesn't match ZFS in features [...]
           | 
           | Isn't the feature in question (array expansion) precisely one
           | which btrfs already had for a long time? Does ZFS have the
           | opposite feature (shrinking the array), which AFAIK btrfs
           | also already had for a long time?
           | 
           | (And there's one feature which is important to many, "being
           | in the upstream Linux kernel", that ZFS most likely will
           | never have.)
        
             | wkat4242 wrote:
             | ZFS also had expansion for a long time but it was offline
             | expansion. I don't know if btrfs has also had online for a
             | long time?
             | 
             | And shrinking no, that is a big missing feature in ZFS IMO.
             | Understandable considering its heritage (large scale
             | datacenters) but nevertheless an issue for home use.
             | 
             | But raidz is rock-solid. Btrfs' raid is not.
        
               | unsnap_biceps wrote:
               | Raidz wasn't able to be expanded in place before this.
               | You were able to add to a pool that included a raidz
               | vdev, but that raidz vdev was immutable.
        
               | wkat4242 wrote:
               | Oh ok, I've never done this, but I thought it was already
               | there. Maybe this was the original ZFS from Sun? But
               | maybe I just remember it incorrectly, sorry.
               | 
               | I've used it on multi-drive arrays but I never had the
               | need for expansion.
        
               | ryao wrote:
               | You could add top level raidz vdevs or replace the
               | members of a raid-z vdev with larger disks to increase
               | storage space back then. You still have those options
               | now.
        
           | honestSysAdmin wrote:
           | https://openzfs.github.io/openzfs-
           | docs/Getting%20Started/index.html
           | 
           | ZFS runs on all major Linux distros, the source is compiled
           | locally and there is no meaningful license problem. In
           | datacenter and "enterprise" environments we compile ZFS
           | "statically" with other kernel modules all the time.
           | 
           | For over six years now, there is an "experimental" option
           | presented by the graphical Ubuntu installer to install the
           | root filesystem on ZFS. Almost everyone I personally know
           | (just my anecdote) chooses this "experimental" option. There
           | has been an occasion here and there of ZFS snapshots taking
           | up too much space, but other than this there have not been
           | any problems.
           | 
           | I statically compile ZFS into a kernel that intentionally
           | does not support loading modules on some of my personal
           | laptops. My experience has been great, others' mileage may
           | (certainly will) vary.
        
         | ryao wrote:
         | What do you mean by a ZFS compatibility layer? There is a
         | Windows port:
         | 
         | https://github.com/openzfsonwindows/openzfs
         | 
         | Note that it is a beta.
        
         | MauritsVB wrote:
         | There is occasional talk of moving the Windows implementation
         | of OpenZFS
         | (https://github.com/openzfsonwindows/openzfs/releases) into an
         | officially supported tier, though that will probably come after
         | the MacOS version (https://github.com/openzfsonosx) is
         | officially supported.
        
         | bayindirh wrote:
         | > How come that Windows still uses a 32 year old file system?
         | 
         | Simple. Because most of the burden is taken by the (enterprise)
         | storage hardware hosting the FS. Snapshots, block level
         | deduplication, object storage technologies, RAID/Resiliency,
         | size changes, you name it.
         | 
         | Modern storage appliances are black magic, and you don't need
         | much more features from NTFS. You either transparently access
         | via NAS/SAN or store your NTFS volumes on capable disk boxes.
         | 
         | On the Linux world, at the higher end, there's Lustre and GPFS.
         | ZFS is mostly for resilient, but not performance critical
         | needs.
        
           | BSDobelix wrote:
           | >ZFS is mostly for resilient, but not performance critical
           | needs.
           | 
           | Los Alamos disagrees ;)
           | 
           | https://www.lanl.gov/media/news/0321-computational-storage
           | 
           | But yes, in general you are right, Cern for example uses
           | Ceph:
           | 
           | https://indico.cern.ch/event/1457076/attachments/2934445/515.
           | ..
        
             | bayindirh wrote:
             | I think what LLNL did predates GPUDirect and other new
             | technologies came after 2022, but that's a good start.
             | 
             | CERN's Ceph also for their "General IT" needs. Their
             | clusters are independent from that. Also CERN's most
             | processing is distributed across Europe. We are part of
             | that network.
             | 
             | Many, if not all of the HPC centers we talk with uses
             | Lustre as their "immediate" storage. Also, there's Weka
             | now, a closed source storage system supporting _insane_
             | speeds and tons of protocols at the same time. Mostly used
             | for and by GPU clusters around the world. You connect
             | _terabits_ to that cluster _casually_. It 's all flash, and
             | flat out fast.
        
               | ryao wrote:
               | Did you confuse LANL for LLNL?
        
               | bayindirh wrote:
               | It's just a typo, not a confusion, and I'm well beyond
               | the edit window.
        
           | poisonborz wrote:
           | So private consumers should just pay cloud subscription if
           | they want safer/modern data storage for their PC? (without
           | NAS)
        
             | bayindirh wrote:
             | I think Microsoft has discontinued Windows 7 backup to
             | force people to buy OneDrive subscriptions. They also
             | forcefully enabled the feature when they first introduced
             | it.
             | 
             | So, I think that your answer for this question is
             | "unfortunately, yes".
             | 
             | Not that I support the situation.
        
             | BSDobelix wrote:
             | If you need Windows, you can use something like restic
             | (checksums and compression) and external drives (more than
             | one, stored in more than one place) to make a backup. Plus
             | "maybe" but not needed ReFS (on your non-Windows
             | partition), which is included in the Workstation/Enterprise
             | editions of Windows.
             | 
             | I trust my own backups much more than any subscription, not
             | essentially from a technical point of view, but from an
             | access point of view (e.g. losing access to your Google
             | account).
             | 
             | EDIT: You have to enable check-summing and/or compression
             | for data on ReFS manually
             | 
             | https://learn.microsoft.com/en-us/windows-
             | server/storage/ref...
        
               | bayindirh wrote:
               | > I trust my own backups much more than any subscription,
               | not from a technical standpoint but from an access one
               | (for example, losing access to your google account).
               | 
               | I personally use cloud storage extensively, but I keep a
               | local version with periodic rclone/borg. It allows me
               | access from everywhere and sleep well at night.
        
               | qwertox wrote:
               | NTFS has Volume Shadow Copy, which is "good enough" for
               | private users if they want to create image backups while
               | their system is running.
        
               | BSDobelix wrote:
               | First of all, that's not a backup, that's a snapshot, and
               | NO, that's not "good enough", tell your grandma that all
               | her digitised pictures are gone because her hard drive
               | exploded, or that one most important jpeg is now
               | unwatchable because of bitrot.
               | 
               | Just because someone is a private user doesn't mean that
               | the data is less important, often it's quite the
               | opposite, for example a family album vs your cloned git
               | repository.
        
               | tjoff wrote:
               | ... VSS is used to create backups. Re-read parent.
        
               | BSDobelix wrote:
               | Not good enough, you can make 10000 backups of bitrotten
               | data, if you don't have check-sums on your block (zfs) or
               | files (restic) nothing can help you. That's the same
               | integrity as to copy stuff on your thump-drive.
        
             | shrubble wrote:
             | No, private consumers have a choice, since Linux and
             | FreeBSD runs well on their hardware. Microsoft is too busy
             | shoveling their crappy AI and convincing OEMs to put a
             | second Windows button (the CoPilot button) on their
             | keyboards.
        
             | bluGill wrote:
             | Probably. There are levels of backups, and a cloud
             | subscription SHOULD give you copies in geographical
             | separate locations with someone to help you (who probably
             | isn't into computers and doesn't want to learn the complex
             | details) restore when (NOT IF!) needed.
             | 
             | I have all my backups on a NAS in the next room. This
             | covers the vast majority of use cases for backups, but if
             | my house burns down everything is lost. I know I'm taking
             | that risk, but really I should have better. Just paying
             | someone to do it all in the cloud should be better for me
             | as well and I keep thinking I should do this.
             | 
             | Of course paying someone assumes they will do their job.
             | There are always incompetent companies out there to take
             | your money.
        
               | pdimitar wrote:
               | My setup is similar to yours, but I also distribute my
               | most important data in compressed (<5GB) encrypted
               | backups to several free-tier cloud storage accounts. I
               | could restore it by copying one key and running one
               | script.
               | 
               | I lost faith in most paid operators. Whoops, this thing
               | that absolutely can happen to home users and we're
               | supposed to protect them from now actually happened to us
               | and we were not prepared. We're so sorry!
               | 
               | Nah. Give me access to 5-15 cloud storage accounts, I'll
               | handle it myself. Have done so for years.
        
             | NoMoreNicksLeft wrote:
             | Having a NAS is life-changing. Doesn't have to be some
             | large 20-bay monstrosity, just something that will give you
             | redundancy and has an ethernet jack.
        
         | zamadatix wrote:
         | NTFS was able to be extended in various way over the years to
         | the point what you could do with an NTFS drive 32 years ago
         | will feel like talking about a completely different filesystem
         | than what you can do with it on current Windows.
         | 
         | Honestly I really like ReFS, particularly in context of storage
         | spaces, but I don't think it's relevant to Microsoft's consumer
         | desktop OS where users don't have 6 drives they need to pool
         | together. Don't get me wrong, I use ZFS because that's what I
         | can get running on a Linux server and I'm not going to go run
         | Windows Server just for the storage pooling... but ReFS +
         | Storage Spaces wins my heart with the 256 MB slab approach.
         | This means you can add+remove mixed sized drives and get the
         | maximum space utilization for the parity settings of the pool.
         | Here ZFS is still getting to online adds of same or larger
         | drives 10 years later.
        
         | nickdothutton wrote:
         | OS development pretty much stopped around 2000. ZFS is from
         | 2001. I don't count a new way to organise my photos or
         | integrate with a search engine as "OS" though.
        
         | doctorpangloss wrote:
         | The same reason file deduplication is not enabled for client
         | Windows: greed.
         | 
         | For example, there are numerous new file systems people use:
         | OneDrive, Google Drive, iCloud Storage. Do you get it?
        
       | happosai wrote:
       | The annual reminder that if Oracle wanted to contribute
       | positively to the Linux ecosystem, they would update the CDDL
       | license ZFS uses to GPL compatible.
        
         | ryao wrote:
         | This is the annual reply that Oracle cannot change the OpenZFS
         | license because OpenZFS contributors removed the "or any later
         | version" part of the license from their contributions.
         | 
         | By the way, comments such as yours seem to assume that Oracle
         | is somehow involved with OpenZFS. Oracle has no connection with
         | OpenZFS outside of owning copyright on the original OpenSolaris
         | sources and a few tiny commits their employees contributed
         | before Oracle purchased Sun. Oracle has its own internal ZFS
         | fork and they have zero interest in bringing it to Linux. They
         | want people to either go on their cloud or buy this:
         | 
         | https://www.oracle.com/storage/nas/
        
           | jeroenhd wrote:
           | Is there a reason the OpenZFS contributors don't want to
           | dual-license their code? I'm not too familiar with the CDDL
           | but I'm not sure what advantage it brings to an open source
           | project compared to something like GPL? Having to deal with
           | DKMS is one of the reasons why I'm sticking with BTRFS for
           | doing ZFS-like stuff.
        
             | ryao wrote:
             | The OpenZFS code is based on the original OpenSolaris code,
             | and the license used is the CDDL because that is what
             | OpenSolaris used. Dual licensing that requires the current
             | OpenSolaris copyright holder to agree. That is unlikely
             | without writing a very big check. Further speculation is
             | not a productive thing to do, but since I know a number of
             | people assume that OpenSolaris copyright holder is the only
             | one preventing this, let me preemptively say that it is not
             | so simple. Different groups have different preferred
             | licenses. Some groups cannot stand certain licenses. Other
             | groups might detest the idea of dual licensing in general
             | since it causes community fragmentation whenever
             | contributors decide to publish changes only under 1 of the
             | 2 licenses.
             | 
             | The CDDL was designed to ensure that if Sun Microsystems
             | were acquired by a company hostile to OSS, people could
             | still use Sun's open source software. In particular, the
             | CDDL has an explicit software patent grant. Some consider
             | that to have been invaluable in preempting lawsuits from a
             | certain company that would rather have ZFS be closed source
             | software.
        
         | abrookewood wrote:
         | The only thing Oracle wants to "contribute positively to" is
         | Larry's next yacht.
        
         | MauritsVB wrote:
         | Oracle changing the license would not make a huge difference to
         | OpenZFS.
         | 
         | Oracle only owns the copyright to the original Sun Microsystems
         | code. It doesn't apply to all ZFS implementations (probably not
         | OracleZFS, perhaps not IllumosZFS) but in the specific case of
         | OpenZFS the majority of the code is no longer Sun code.
         | 
         | Don't forget that SunZFS was open sourced in 2005 before Oracle
         | bought Sun Microsystems in 2009. Oracle have created their own
         | closed source version of ZFS but outside some Oracle shops
         | nobody uses it (some people say Oracle has stopped working on
         | OracleZFS all together some time ago).
         | 
         | Considering the forks (first from Sun to the various open
         | source implementations and later the fork from open source into
         | Oracle's closed source version) were such a long time ago,
         | there is not that much original code left. A lot of storage
         | tech, or even entire storage concepts, did not exist when Sun
         | open sourced ZFS. Various ZFS implementations developed their
         | own support for TRIM, or Sequential Resilvering, or Zstd
         | compression, or Persistent L2ARC, or Native ZFS Encryption, or
         | Fusion Pools, or Allocation Classes, or dRAID, or RAIDZ
         | expansion long after 2005. That's is why the majority of the
         | code in OpenZFS 2 is from long after the fork from Sun code
         | twenty years ago.
         | 
         | Modern OpenZFS contains new code contributions from Nexenta
         | Systems, Delphix, Intel, iXsystems, Datto, Klara Systems and a
         | whole bunch of other companies that have voluntarily offered
         | their code when most of the non-Oracle ZFS implementations
         | merged to become OpenZFS 2.0.
         | 
         | If you'd want to relicense OpenZFS you could get Oracle to
         | agree for the bit under Sun copyright but for the majority of
         | the code you'd have to get a dozen or so companies to agree to
         | relicensing their contributions (probably not that hard) and
         | many hundreds of individual contributors over two decades (a
         | big task and probably not worth it).
        
       | abrookewood wrote:
       | Can someone provide details on this bit please? "Direct IO:
       | Allows bypassing the ARC for reads/writes, improving performance
       | in scenarios like NVMe devices where caching may hinder
       | efficiency".
       | 
       | ARC is based in RAM, so how could it reduce performance when used
       | with NVMe devices? They are fast, but they aren't RAM-fast ...
        
         | nolist_policy wrote:
         | Because with a (ARC) cache you have to copy from the app to the
         | cache and then dma to disk. With direct io you can dma directly
         | from the app ram to the disk.
        
         | philjohn wrote:
         | Yes - interested in this too. Is this for both ARC and L2ARC,
         | or just L2ARC?
        
       | jakedata wrote:
       | Happy to see the ARC bypass for NVMe performance. ZFS really
       | fails to exploit NVMe's potential. Online expansion might be
       | interesting. I tried to use ZFS for some very busy databases and
       | ended up getting bitten badly by the fragmentation bug. The only
       | way to restore performance appears to be copying the data off the
       | volume, nuking it and then copying it back. Now -perhaps- if I
       | expand the zpool then I might be able to reduce fragmentation by
       | copying the tablespace on the same volume.
        
       | bitmagier wrote:
       | Marvelous!
        
       | wkat4242 wrote:
       | Note: This is online expansion. Expansion was always possible but
       | you did need to take the array down to do it. You could also move
       | to bigger drives but you also had to do that one at a time (and
       | only gain the new capacity once all drives were upgraded of
       | course)
       | 
       | As far as I know shrinking a pool is still not possible though.
       | So if you have a pool with 5 drives and add a 6th, you can't go
       | back to 5 drives even if there is very little data in it.
        
       | shepherdjerred wrote:
       | How does ZFS compare to btrfs? I'm currently using btrfs for my
       | home server, but I've had some strange troubles with it. I'm
       | thinking about switching to ZFS, but I don't want to end up in
       | the same situation.
        
         | ryao wrote:
         | I first tried btrfs 15 years ago with Linux 2.6.33-rc4 if I
         | recall. It developed an unlinkable file within 3 days, so I
         | stopped using it. Later, I found ZFS. It had a few less
         | significant problems, but I was a CS student at the time and I
         | thought I could fix them since they seemed minor in comparison
         | to the issue I had with btrfs, so over the next 18 months, I
         | solved all of the problems that it had that bothered me and
         | sent the patches to be included in the then ZFSOnLinux
         | repository. My effort helped make it production ready on Linux.
         | I have used ZFS ever since and it has worked well for me.
         | 
         | If btrfs had been in better shape, I would have been a btrfs
         | contributor. Unfortunately for btrfs, it not only was in bad
         | shape back then, but other btrfs issues continued to bite me
         | every time I tried it over the years for anything serious (e.g.
         | frequent ENOSPC errors when there is still space). ZFS on the
         | other hand just works. Myself and many others did a great deal
         | of work to ensure it works well.
         | 
         | The main reason for the difference is that ZFS had a very solid
         | foundation, which was achieved by having some fantastic
         | regression testing facilities. It has a userland version that
         | randomly exercises the code to find bugs before they occur in
         | production and a test suite that is run on every proposed
         | change to help shake out bugs.
         | 
         | ZFS also has more people reviewing proposed changes than other
         | filesystems. The Btrfs developers will often state that there
         | is a significant man power difference between the two file
         | systems. I vaguely recall them claiming the difference was a
         | factor of 6.
         | 
         | Anyway, few people who use ZFS regret it, so I think you will
         | find you like it too.
        
         | parshimers wrote:
         | btrfs has similar aims to ZFS, but is far less mature. i used
         | it for my root partitions due to it not needing DKMS, but had
         | many troubles. i used it in a fairly simple way, just a mirror.
         | one day, of the drives in the array started to have issues- and
         | btrfs fell on it's face. it remounted everything read-only if i
         | remember correctly, and would not run in degraded mode by
         | default. even mdraid would do better than this without
         | checksumming and so forth. ZFS also likewise, says that the
         | array is faulted, but of course allows it to be used. the fact
         | the default behavior was not RAID, because it's literally
         | missing the R part for reading the data back, made me lose any
         | faith in it. i moved to ZFS and haven't had issues since. there
         | is much more of a community and lots of good tooling around it.
        
       ___________________________________________________________________
       (page generated 2025-01-14 23:01 UTC)