[HN Gopher] ZFS: Apple's New Filesystem that wasn't (2016)
___________________________________________________________________
ZFS: Apple's New Filesystem that wasn't (2016)
Author : jitl
Score : 128 points
Date : 2025-04-27 09:25 UTC (13 hours ago)
(HTM) web link (ahl.dtrace.org)
(TXT) w3m dump (ahl.dtrace.org)
| jitl wrote:
| Besides the licensing issue, I wonder if optimizing ZFS for low
| latency + low RAM + low power on iPhone was an uphill battle or
| if it's easy. My experiencing running ZFS years ago was poor
| latency and large RAM use with my NAS, but that hardware and
| drive configuration was optimized for low $ per gb stored and
| used parity stuff.
| zoky wrote:
| If it were an issue it would hardly be an insurmountable one. I
| just can't imagine a scenario where Apple engineers go "Yep,
| we've eked out all of the performance we possibly can from this
| phone, the only thing left to do is change out the filesystem."
| klodolph wrote:
| Does it matter if it's insurmountable? At some point, the
| benefits of a new FS outweigh the drawbacks. This happens
| earlier than you might think, because of weird factors like
| "this lets us retain top filesystem experts on staff".
| karlgkk wrote:
| It's worth remembering that the filesystem they were
| looking to replace was HFS+. It was introduced in the 90s
| as a modernization of HFS, itself introduced in the 80s.
|
| Now, old does not necessarily mean bad, but in this
| case....
| twoodfin wrote:
| This seems like an early application of the Tim Cook doctrine:
| Why would Apple want to surrender control of this key bit of
| technology for their platforms?
|
| The rollout of APFS a decade later validated this concern.
| There's just no way that flawless transition happens so rapidly
| without a filesystem fit to order for Apple's needs from Day 0.
| TheNewsIsHere wrote:
| (Edit: My comment is simply about the logistics and work
| involved in a very well executed filesystem migration. Not
| about whether ZFS is good for embedded or memory constrained
| devices.)
|
| What you describe hits my ear as more NIH syndrome than
| technical reality.
|
| Apple's transition to APFS was managed like you'd manage any
| kind of mass scale filesystem migration. I can't imagine
| they'd have done anything differently if they'd have adopted
| ZFS.
|
| Which isn't to say they wouldn't have modified ZFS.
|
| But with proper driver support and testing it wouldn't have
| made much difference whether they wrote their own file system
| or adopted an existing one. They have done a fantastic job of
| compartmentalizing and rationalizing their OS and user data
| partitions and structures. It's not like every iPhone model
| has a production run that has different filesystem needs that
| they'd have to sort out.
|
| There was an interesting talk given at WWDC a few years ago
| on this. The roll out of APFS came after they'd already
| tested the filesystem conversion for randomized groups of
| devices and then eventually every single device that upgraded
| to one of the point releases prior to iOS 10.3. The way they
| did this was to basically run the conversion in memory as a
| logic test against real data. At the end they'd have the
| super block for the new APFS volume, and on a successful exit
| they simply discarded it instead of writing it to persistent
| storage. If it errored it would send a trace back to Apple.
|
| Huge amounts of testing and consistency in OS and user data
| partitioning and directory structures is a huge part of why
| that migration worked so flawlessly.
| jeroenhd wrote:
| I don't see why ZFS wouldn't have gone over equally
| flawlessly. None of the features that make ZFS special were
| in HFS(+), so conversion wouldn't be too hard. The only
| challenge would be maintaining the legacy compression
| algorithms, but ZFS is configurable enough that Apple
| could've added their custom compression to it quite easily.
|
| There are probably good reasons for Apple to reinvent ZFS as
| APFS a decade later, but none of them technical.
|
| I also wouldn't call the rollout of APFS flawless, per se.
| It's still a terrible fit for (external) hard drives and
| their own products don't auto convert to APFS in some cases.
| There was also plenty of breakage when case-sensitivity
| flipped on people and software, but as far as I can tell
| Apple just never bothered to address that.
| jonhohle wrote:
| HFS compression, AFAICT, is all done in user space with
| metadata and extended attributes.
| kmeisthax wrote:
| To be clear, BTRFS _also_ supports in-place upgrade. It 's
| not a uniquely Apple feature; any copy-on-write filesystem
| with flexibility as to where data is located can be made to
| fit inside of the free blocks of another filesystem. Once you
| can do that, then you can do test runs[0] of the filesystem
| upgrade before committing to wiping the superblock.
|
| I don't know for certain if they could have done it with ZFS;
| but I can imagine it would at least been doable with some
| Apple extensions that would only have to exist during test /
| upgrade time.
|
| [0] Part of why the APFS upgrade was so flawless was that
| Apple had done a test upgrade in a prior iOS update. They'd
| run the updater, log any errors, and then revert the upgrade
| and ship the error log back to Apple for analysis.
| toast0 wrote:
| Using ZFS isn't surrendering control. Same as using parts of
| FreeBSD. Apple retains control because they don't have an
| obligation (or track record) of following the upstream.
|
| For zfs, there's been a lot of improvements over the years,
| but if they had done the fork and adapt and then leave it
| alone, their fork would continue to work without outside
| control. They could pull in things from outside if they want,
| when they want; some parts easier than others.
| hs86 wrote:
| While its deduplication feature clearly demands more memory, my
| understanding is that the ZFS ARC is treated by the kernel as a
| driver with a massive, persistent memory allocation that cannot
| be swapped out ("wired" pages). Unlike the regular file system
| cache, ARC's eviction is not directly managed by the kernel.
| Instead, ZFS itself is responsible for deciding when and how to
| shrink the ARC.
|
| This can lead to problems under sudden memory pressure. Because
| the ARC does not immediately release memory when the system
| needs it, userland pages might get swapped out instead. This
| behavior is more noticeable on personal computers, where memory
| usage patterns are highly dynamic (applications are constantly
| being started, used, and closed). On servers, where workloads
| are more static and predictable, the impact is usually less
| severe.
|
| I do wonder if this is also the case on Solaris or illumos,
| where there is no intermediate SPL between ZFS and the kernel.
| If so, I don't think that a hypothetical native integration of
| ZFS on macOS (or even Linux) would adopt the ARC in its current
| form.
| dizhn wrote:
| Maz arc size is configurable and it does not need the
| mythical 1GB per TB to function well.
| ryao wrote:
| The ZFS driver will release memory if the kernel requests it.
| The only integration level issue is that the free command
| does not show ARC as a buffer/cache, so it misrepresents
| reality, but as far as I know, this is an issue with caches
| used by various filesystems (e.g. extent caches). It is only
| obvious in the case of ZFS because the ARC can be so large.
| That is a feature, not a bug, since unused memory is wasted
| memory.
| pseudalopex wrote:
| > The ZFS driver will release memory if the kernel requests
| it.
|
| Not fast enough always.
| netbsdusers wrote:
| Solaris achieved some kind of integration between the ARC and
| the VM subsystem as part of the VM2 project. I don't know any
| more details than that.
| ryao wrote:
| I assume that the VM2 project achieved something similar to
| the ABD changes that were done in OpenZFS. ABD replaced the
| use of SLAB buffers for ARC with lists of pages. The issue
| with SLAB buffers is that absurd amounts of work could be
| done to free memory, and a single long lived SLAB object
| would prevent any of it from mattering. Long lived slab
| objects caused excessive reclaim, slowed down the process
| of freeing enough memory to satisfy system needs and in
| some cases, prevented enough memory from being freed to
| satisfy system needs entirely. Switching to linked lists of
| pages fixed that since the memory being freed from ARC upon
| request would immediately become free rather than be
| deferred to when all of the objects in the SLAB had been
| freed.
| fweimer wrote:
| If I recall correctly, ZFS error recovery was still "restore
| from backup" at the time, and iCloud acceptance was more
| limited. (ZFS basically gave up if an error was encountered
| after the checksum showed that the data was read correctly from
| storage media.) That's fine for deployments where the
| individual system does not matter (or you have dedicated staff
| to recover systems if necessary), but phones aren't like that.
| At least not from the user perspective.
| ryao wrote:
| ZFS has ditto blocks that allows it to self heal in the case
| of corrupt metadata as long as a good copy remains (and there
| would be at least 2 copies by default). ZFS only ever needs
| you to restore from backup if the damage is so severe that
| there is no making sense of things.
|
| Minor things like the indirect blocks being missing for a
| regular file only affect that file. Major things like all 3
| copies of the MOS (the equivalent to a superblock) being gone
| for all uberblock entries would require recovery from backup.
|
| If all copies of any other filesystem's superblock were gone
| too, that filesystem would be equally irrecoverable and would
| require restoring from backup.
| fweimer wrote:
| As far as I understand it, ditto blocks were only used if
| the corruption was detected due to checksum mismatch. If
| the checksum was correct, but metadata turned out to be
| unusable later (say because it was corrupted in memory, and
| the the checksum was computed after the corruption
| happened), that was treated as a fatal error.
| alwillis wrote:
| Apple wanted one operating system that ran on everything from a
| Mac Pro to an Apple Watch and there's no way ZFS could have
| done that.
| ryao wrote:
| ZFS would be quite comfortable with the 512MB of RAM on an
| Apple Watch:
|
| https://iosref.com/ram-processor
|
| People have run operating systems using ZFS on less.
| volemo wrote:
| It was just yesterday I relistened to the contemporary
| Hypercritical episode on the topic:
| https://hypercritical.fireside.fm/56
| mrkstu wrote:
| Wow, John's voice has changed a LOT from back then
| jeroenhd wrote:
| I wonder what ZFS in the iPhone would've looked like. As far as I
| recall, the iPhone didn't have error correcting memory, and ZFS
| is notorious for corrupting itself when bit flips hit it and
| break the checksum on disk. ZFS' RAM-hungry nature would've also
| forced Apple to add more memory to their phone.
| amarshall wrote:
| > ZFS is notorious for corrupting itself when bit flips hit it
| and break the checksum on disk
|
| ZFS does not need or benefit from ECC memory any more than any
| other FS. The bitflip corrupted the data, regardless of ZFS.
| Any other FS is just oblivious, ZFS will at least tell you your
| data is corrupt but happily keep operating.
|
| > ZFS' RAM-hungry nature
|
| ZFS is not really RAM-hungry, unless one uses deduplication
| (which is not enabled by default, nor generally recommended).
| It can often seem RAM hungry on Linux because the ARC is not
| counted as "cache" like the page cache is.
|
| ---
|
| ZFS docs say as much as well:
| https://openzfs.github.io/openzfs-docs/Project%20and%20Commu...
| williamstein wrote:
| And even dedup was finally rewritten to be significantly more
| memory efficient, as of the new 2.3 release of ZFS:
| https://github.com/openzfs/zfs/discussions/15896
| Dylan16807 wrote:
| > ZFS is notorious for corrupting itself when bit flips hit it
| and break the checksum on disk
|
| I don't think it is. I've never heard of that happening, or
| seen any evidence ZFS is more likely to break than any random
| filesystem. I've only seen people spreading paranoid rumors
| based on a couple pages saying ECC memory is important to fully
| get the benefits of ZFS.
| thfuran wrote:
| They also insist that you need about 10 TB RAM per TB disk
| space or something like that.
| yjftsjthsd-h wrote:
| There is a rule of thumb that you should have at least 1 GB
| of RAM per TB of disk _when using deduplication_. That
| 's.... Different.
| williamstein wrote:
| Fortunately, this has significantly improved since dedup
| was rewritten as part of the new ZFS 2.3 release. Search
| for zfs "fast dedup".
| thfuran wrote:
| So you've never seen the people saying you should steer
| clear of ZFS unless you're going to have an enormous ARC
| even when talking about personal media servers?
| amarshall wrote:
| Even then you obviously need L2ARC as well!! /s
| thfuran wrote:
| But on optane. Because obviously you need an all flash
| main array for streaming a movie.
| toast0 wrote:
| People, especially those on the Internet, say a lot of
| things.
|
| Some of the things they say aren't credible, even if
| they're said often.
|
| You don't need an enormous amount of ram to run zfs
| unless you have dedupe enabled. A lot of people thought
| they wanted dedupe enabled though. (2024's fast dedupe
| may help, but probably the right answer for most people
| is not to use dedupe)
|
| It's the same thing with the "need" for ECC. If your ram
| is bad, you're going to end up with bad data in your
| filesystem. With ZFS, you're likely to find out your
| filesystem is corrupt (although, if the data is corrupted
| before the checksum is calculated, then the checksum
| doesn't help); with a non-checksumming filesystem, you
| may get lucky and not have meta data get corrupted and
| the OS keeps going, just some of your files are wrong.
| Having ECC would be better, but there's tradeoffs so it
| never made sense for me to use it at home; zfs still
| works and is protecting me from disk contents changing,
| even if what was written could be wrong.
| yjftsjthsd-h wrote:
| Not that I recall? And it's worked fine for me...
| ryao wrote:
| I have seen people say such things, and none of it was
| based on reality. They just misinterpreted the
| performance cliff that data deduplication had to mean you
| must have absurd amounts of memory even though data
| deduplication is off by default. I suspect few of the
| people peddling this nonsense even used ZFS and the few
| who did, had not looked very deeply into it.
| amarshall wrote:
| It's unfortunate some folks are missing the tongue-in-cheek
| nature of your comment.
| mrkeen wrote:
| > ZFS is notorious for corrupting itself when bit flips hit it
| and break the checksum on disk.
|
| What's a bit flip?
| zie wrote:
| Basically it's that memory changes out from under you. As we
| know, computers use Binary, so everything boils down to it
| being a 0 or a 1. A bit flip is changing what was say a 0
| into a 1.
|
| Usually attributed to "cosmic rays", but really can happen
| for any number of less exciting sounding reasons.
|
| Basically, there is zero double checking in your computer for
| almost _everything_ except stuff that goes across the
| network. Memory and disks are not checked for correctness,
| basically ever on any machine anywhere. Many servers(but
| certainly not all) are the rare exception when it comes to
| memory safety. They usually have ECC(Error Correction Code)
| Memory, basically a checksum on the memory to ensure that if
| memory is corrupted, it 's noticed and fixed.
|
| Essentially every filesystem everywhere does zero data
| integrity checking: MacOS APFS: Nope
| Windows NTFS: Nope Linux EXT4: Nope BSD's UFS:
| Nope Your mobile phone: Nope
|
| ZFS is the rare exception for file systems that actually
| double check the data you save to it is the data you get back
| from it. Every other filesystem is just a big ball of unknown
| data. You probably get back what you put it, but there is
| zero promises or guarantees.
| crazygringo wrote:
| > _disks are not checked for correctness, basically ever on
| any machine anywhere._
|
| I'm not sure that's really accurate -- all modern hard
| drives and SSD's use error-correcting codes, as far as I
| know.
|
| That's different from implementing _additional_ integrity
| checking at the filesystem level. But it 's definitely
| there to begin with.
| tpetry wrote:
| But SSDs (to my knowledge) only implement checksum for
| the data transfer. Its a requirement of the protocol. So
| you can be sure that the Stuff in memory and checksum
| computed by the CPU arrives exactly like that in the SSD
| driver. In the past this was a common error source with
| hardware raid which was faulty.
|
| But there is ABSOLUTELY NO checksum for the bits stored
| on a SSD. So bit rot at the cells of the SSDs are
| undetected.
| lgg wrote:
| That is ABSOLUTELY incorrect. SSDs have enormous amounts
| of error detection and correction builtin explicitly
| because errors on the raw medium are so common that
| without it you would never be able to read correct data
| from the device.
|
| It has been years since I was familiar enough with the
| insides of SSDs to tell you exactly what they are doing
| now, but even ~10-15 years ago it was normal for each raw
| 2k block to actually be ~2176+ bytes and use at least 128
| bytes for LDPC codes. Since then the block sizes have
| gone up (which reduces the number of bytes you need to
| achieve equivalent protection) and the lithography has
| shrunk (which increases the raw error rate).
|
| Where exactly the error correction is implemented
| (individual dies, SSD controller, etc) and how it is
| reported can vary depending on the application, but I can
| say with assurance that there is no chance your OS sees
| uncorrected bits from your flash dies.
| zie wrote:
| > I can say with assurance that there is no chance your
| OS sees uncorrected bits from your flash dies.
|
| While true, there is zero promises that what you meant to
| save and what gets saved are the same things. All the
| drive mostly promises is that if the drive safely wrote
| XYZ to the disk and you come back later, you should
| expect to get XYZ back.
|
| There are lots of weasel words there on purpose. There is
| generally zero guarantee in reality and drives lie all
| the time about data being safely written to disk, even if
| it wasn't actually safely written to disk yet. This means
| on power failure/interruption the outcome of being able
| to read XYZ back is 100% unknown. Drive Manufacturers
| make zero promises here.
|
| On most consumer compute, there is no promises or
| guarantees that what you wrote on day 1 will be there on
| day 2+. It mostly works, and the chances are better than
| even that your data will be mostly safe on day 2+, but
| there is zero promises or guarantees. We know how to
| guarantee it, we just don't bother(usually).
|
| You can buy laptops and desktops with ECC RAM and use
| ZFS(or other checksumming FS), but basically nobody does.
| I'm not aware of any mobile phones that offer either
| option.
| crazygringo wrote:
| > _While true, there is zero promises that what you meant
| to save and what gets saved are the same things. All the
| drive mostly promises is that if the drive safely wrote
| XYZ to the disk and you come back later, you should
| expect to get XYZ back._
|
| I'm not really sure what point you're trying to make.
| It's using ECC, so they should be the same.
|
| There isn't infinite reliability, but nothing has
| infinite reliability. File checksums don't provide
| infinite reliability either, because the checksum itself
| can be corrupted.
|
| You keep talking about promises and guarantees, but there
| aren't any. All there is are statistical rates of
| reliability. Even ECC RAM or file checksums don't offer
| perfect guarantees.
|
| For daily consumer use, the level of ECC built into disks
| is generally far more than sufficient.
| o11c wrote:
| All MLC SSDs absolutely do data checksums and error
| recovery, otherwise they would very lose your data much
| more than they do.
|
| You can see some stats using `smartctl`.
| zie wrote:
| Yes, the disk mostly promises what you write there will
| be read back correctly, but that's at the disk level
| only. The OS, Filesystem and Memory generally do no
| checking, so any errors at those levels will propagate.
| We know it happens, we just mostly choose to not do
| anything about it.
|
| My point was, on most consumer compute, there is no
| promises or guarantees that what you see on day 1 will be
| there on day 2. It mostly works, and the chances are
| better than even that your data will be mostly safe on
| day 2, but there is zero promises or guarantees, even
| though we know how to do it. Some systems do, those with
| ECC memory and ZFS for example. Other filesystems also
| support checksumming, like BTRFS being the most common
| counter-example to ZFS. Even though parts of BTRFS are
| still completely broken(see their status page for
| details).
| amarshall wrote:
| Btrfs and bcachefs both have data checksumming. I think
| ReFS does as well.
| zie wrote:
| Yes, ZFS is not the only filesystem with data
| checksumming and guarantees, but it's one of the very
| rare exceptions that do.
|
| ZFS has been in productions work loads since 2005, 20
| years now. It's proven to be very safe.
|
| BTRFS has known fundamental issues past one disk. It is
| however improving. I will say BTRFS is fine for a single
| drive. Even the developers last I checked(a few years
| ago) don't really recommend it past a single drive,
| though hopefully that's changing over time.
|
| I'm not familiar enough with bcachefs to comment.
| ahl wrote:
| Sometimes data on disk and in memory are randomly corrupted.
| For a pretty amazing example, check out
| "bitsquatting"[1]--it's like domain name squatting, but
| instead of typos, you squat on domains that would bit looked
| up in the case of random bit flips. These can occur due e.g.
| to cosmic rays. On-disk, HDDs and SSDs can produce the wrong
| data. It's uncommon to see actual invalid data rather than
| have an IO fail on ECC, but it certainly can happen (e.g. due
| to firmware bugs).
|
| [1]: https://en.wikipedia.org/wiki/Bitsquatting
| terlisimo wrote:
| > ZFS is notorious for corrupting itself when bit flips
|
| That is a notorious myth.
|
| https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...
| ahl wrote:
| It's very amusing that this kind of legend has persisted! ZFS
| is notorious for *noticing* when bits flip, something APFS
| designers claimed was rare given the robustness of Apple
| hardware.[1][2] What would ZFS on iPhone have looked like? Hard
| to know, and that certainly wasn't the design center.
|
| Neither here nor there, but DTrace _was_ ported to iPhone--it
| was shown to me in hushed tones in the back of an auditorium
| once...
|
| [1]: https://arstechnica.com/gadgets/2016/06/a-zfs-developers-
| ana...
|
| [2]: https://ahl.dtrace.org/2016/06/19/apfs-part5/#checksums
| ryao wrote:
| I did early ZFSOnLinux development on hardware that did not
| have ECC memory. I once had a situation where a bit flip
| happened in the ARC buffer for libpython.so and all python
| software started crashing. Initially, I thought I had hit
| some sort of blizzard bug in ZFS, so I started debugging. At
| that time, opening a ZFS snapshot would fetch a duplicate
| from disk into a redundant ARC buffer, so while debugging, I
| ran cmp on libpython.so between the live copy and a snapshot
| copy. It showed the exact bit that had flipped. After seeing
| that and convincing myself the bitflip was not actually on
| stable storage, I did a reboot, and all was well. Soon
| afterward, I got a new development machine that had ECC so
| that I would not waste my time chasing phantom bugs caused by
| bit flips.
| Modified3019 wrote:
| ZFS _detects_ corruption.
|
| A very long ago someone named cyberjock was a prolific and
| opinionated proponent of ZFS, who wrote many things about ZFS
| during a time when the hobbyist community was tiny and not very
| familiar with how to use it and how it worked. Unfortunately,
| some of their most misguided and/or outdated thoughts still
| haunt modern consciousness like an egregore.
|
| What you are probably thinking of is the proposed doomsday
| scenario where bad ram could _theoretically_ kill a ZFS pool
| during a scrub.
|
| This article does a good job of explaining how that might
| happen, and why being concerned about it is tilting at
| windmills: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-
| ram-kill-y...
|
| I have _never once_ heard of this happening in real life.
|
| Hell, I've never even had bad ram. I have had bad sata/sas
| cables, and a bad disk though. ZFS faithfully informed me there
| was a problem, which no other file system would have done. I've
| seen other people that start getting corruption when sata/sas
| controllers go bad or overheat, which again is detected by ZFS.
|
| What actually destroys pools is user error, followed very
| distantly by plain old fashioned ZFS bugs that someone with an
| unlucky edge case ran into.
| tmoertel wrote:
| > Hell, I've never even had bad ram.
|
| To what degree can you separate this claim from "I've never
| _noticed_ RAM failures "?
| wtallis wrote:
| It isn't hard to run memtest on all your computers, and
| that _will_ catch the kind of bad RAM that the
| aforementioned doomsday scenario requires.
| Modified3019 wrote:
| You can take that as meaning "I've never had a noticed
| issue that was detected by extensive ram testing, or solved
| by replacing ram".
|
| I got into overclocking both regular and ECC DDR4 ram for a
| while when AMD's 1st gen ryzen stuff came out, thanks to
| asrock's x399 motherboard which unofficially supporting
| ECC, allowing both it's function and reporting of errors
| (produced when overlocking)
|
| Based on my own testing and issues seen from others,
| regular memory has quite a bit of leeway before it becomes
| unstable, and memory that's generating errors tends to
| constantly crash the system, or do so under certain
| workloads.
|
| Of course, without ECC you can't _prove_ every single
| operation has been fault free, but as some point you call
| it close enough.
|
| I am of the opinion that ECC memory is _the best_ memory to
| overclock, precisely because you can prove stability simply
| by using the system.
|
| All that said, as things become smaller with tighter
| specifications to squeeze out faster performance, I do grow
| more leery of intermittent single errors that occur on the
| order of weeks or months in newer generations of hardware.
| I was once able to overclock my memory to the edge of what
| I thought was stability as it passed all tests for days,
| but about every month or two there'd be a few corrected
| errors show up in my logs. Typically, any sort of stability
| is caught by manual tests within minutes or the hour.
| wtallis wrote:
| To me, the most implausible thing about ZFS-without-ECC
| doomsaying is the presumption that the failure mode of RAM is
| a persistently stuck bit. That's _way_ less common than
| transient errors, and way more likely to be noticed, since it
| will destabilize any piece of software that uses that address
| range. And now that all modern high-density DRAM includes on-
| die ECC, transient data corruption on the link between DRAM
| and CPU seems overwhelmingly more likely than a stuck bit.
| rrdharan wrote:
| Kind of odd that the blog states that "The architect for ZFS at
| Apple had left" and links to the LinkedIn profile of someone who
| doesn't have any Apple work experience listed on their resume. I
| assume the author linked to the wrong profile?
| nikhizzle wrote:
| Ex-Apple File System engineer here who shared an office with
| the other ZFS lead at the time. Can confirm they link to the
| wrong profile for Don Brady.
|
| This is the correct person: https://github.com/don-brady
|
| Also can confirm Don is one of the kindest, nicest principal
| engineer level people I've worked with in my career. Always had
| time to mentor and assist.
| ahl wrote:
| Not sure how I fat-fingered Don's LinkedIn, but I'm updating
| that 9-year-old typo. Agreed that Don is a delight. In the
| years after this article I got to collaborate more with him,
| but left Delphix before he joined to work on ZFS.
| whitepoplar wrote:
| Given your expertise, any chance you can comment on the risk
| of data corruption on APFS given that it only checksums
| metadata?
| nikhizzle wrote:
| I moved out of the kernel in 2008 and never went back, so
| don't have a wise opinion here which would be current.
| smittywerben wrote:
| Thanks for sharing I was just looking for what happened to Sun. I
| like the second-hand quote comparing the IBM and HP as "garbage
| trucks colliding" plus the inclusion of blog posts with links to
| the court filings.
|
| Is it fair to say ZFS made most sense on Solaris using Solaris
| Containers on SPARK?
| ahl wrote:
| ZFS was developed in Solaris, and at the time we were mostly
| selling SPARC systems. That changed rapidly and the biggest
| commercial push was in the form of the ZFS Storage Appliance
| that our team (known as Fishworks) built at Sun. Those systems
| were based on AMD servers that Sun was making at the time such
| as Thumper [1]. Also in 2016, Ubuntu leaned in to use of ZFS
| for containers [2]. There was nothing that specific about
| Solaris that made sense for ZFS, and even less of a connection
| to the SPARC architecture.
|
| [1]: https://www.theregister.com/2005/11/16/sun_thumper/
|
| [2]: https://ubuntu.com/blog/zfs-is-the-fs-for-containers-in-
| ubun...
| ghaff wrote:
| Yeah I think if it hadn't been for the combination of Oracle
| and CDDL, Red Hat would have been more interested in for
| Linux. As it was they basically went with XFS and volume
| management. Fedora did eventually go with btrfs but dints
| know if there are are any plans for copy-on-write FS for RHEL
| at any point.
| m4rtink wrote:
| Fedora Server uses XFS on LVM by default & you can do CoW
| with any modern filesystem on top of an LVM thin pool.
|
| And there is also the Stratis project Red Hat is involved
| in: https://stratis-storage.github.io/
| ghaff wrote:
| It looks like btrfs is/was the default for just Fedora
| Workstation. I'm less connected to Red Hat filesystem
| details than I used to be.
| curt15 wrote:
| TIL Stratis is still alive. I thought it basically went
| on life support after the lead dev left Red Hat.
|
| Still no checksumming though...
| ryao wrote:
| RedHat's policy is no out of tree kernel modules, so it
| would not have made a difference.
| ghaff wrote:
| It's not like Red Hat had/has no influence over what
| makes it into mainline. But the options for copy on write
| were either relatively immature or had license issues in
| their view.
| ryao wrote:
| Their view is that if it is out of tree, they will not
| support it. This supersedes any discussion of license.
| Even out of tree GPL drivers are not supported by RedHat.
| thyristan wrote:
| We had those things at work as fileservers, so no containers
| or anything fancy.
|
| Sun salespeople tried to sell us the idea of "zfs filesystems
| are very cheap, you can create many of them, you don't need
| quota" (which ZFS didn't have at the time), which we tried
| out. It was abysmally slow. It was even slow with just one
| filesystem on it. We scrapped the whole idea, just put Linux
| on them and suddenly fileserver performance doubled. Which is
| something we weren't used to with older Solaris/Sparc/UFS or
| /VXFS systems.
|
| We never tried another generation of those, and soon after
| Sun was bought by Oracle anyways.
| kjellsbells wrote:
| I had a combination uh-oh/wow! moment back in those days
| when the hacked up NFS server I built on a Dell with Linux
| and XFS absolutely torched the Solaris and UFS system we'd
| been using for development. Yeah, it wasnt apples to
| apples. Yes, maybe ZFS would have helped. But XFS was
| proven at SGI and it was obvious that the business would
| save thousands overnight by moving to Linux on Dell instead
| of sticking with Sun E450s. That was the death knell for my
| time as a Solaris sysadmin, to be honest.
| thyristan wrote:
| ZFS probably wouldn't have helped. One of my points is,
| ZFS was slower than UFS in our setup. And both where
| slower than Linux on the same hardware.
| ryao wrote:
| > There was nothing that specific about Solaris that made
| sense for ZFS, and even less of a connection to the SPARC
| architecture.
|
| Although it does not change the answer to the original
| question, I have long been under the impression that part of
| the design of ZFS had been influenced by the Niagara
| processor. The heavily threaded ZIO pipeline had been so
| forward thinking that it is difficult to imagine anyone
| devising it unless they were thinking of the future that the
| Niagara processor represented.
|
| Am I correct to think that or did knowledge of the upcoming
| Niagara processor not shape design decisions at all?
|
| By the way, why did Thumper use an AMD Opteron over the
| UltraSPARC T1 (Niagara)? That decision seems contrary to idea
| of putting all of the wood behind one arrow.
| ahl wrote:
| I don't recall that being the case. Bonwick had been
| thinking about ZFS for at least a couple of years. Matt
| Ahrens joined Sun (with me) in 2001. The Afara acquisition
| didn't close until 2002. Niagara certainly was tantalizing
| but it wasn't a primary design consideration. As I recall,
| AMD was head and shoulders above everything else in terms
| of IO capacity. Sun was never very good (during my tenure
| there) at coordination or holistic strategy.
| bcantrill wrote:
| Niagara did not shape design decisions at all -- remember
| that Niagara was really only doing on a single socket what
| we had already done on large SMP machines (e.g.,
| Starfire/Starcat). What _did_ shape design decisions -- or
| at least informed thinking -- was a belief that all main
| memory would be non-volatile within the lifespan of ZFS.
| (Still possible, of course!) I don 't know that there are
| any true artifacts of that within ZFS, but I would say that
| it affected thinking much more than Niagara.
|
| As for Thumper using Opteron over Niagara: that was due to
| many reasons, both technological (Niagara was interesting
| but not world-beating) and organizational (Thumper was a
| result of the acquisition of Kealia, which was
| independently developing on AMD).
| ryao wrote:
| Thanks. I had been unaware of the Starfire/Starcat
| machines.
| smittywerben wrote:
| Thanks. Also, the Thumper looks awesome like a max-level
| MMORPG character that would kill the level-1 consumer
| Synology NAS character in one hit.
| jFriedensreich wrote:
| The death of ZFS in macOS was a huge shift in the industry. This
| has to be seen in the context of microsoft killed their largely
| ambitious WinFS which felt like the death of desktop innovation
| in combination.
| thyristan wrote:
| Both are imho linked to "offline desktop use cases are not
| important anymore". Both companies saw their future gains
| elsewhere, in internet-related functions and what became known
| as "cloud". No need to have a fancy, featurefull and expensive
| filesystem when it is only to be used as a cache for remote
| cloud stuff.
| em500 wrote:
| Linux or FreeBSD developers are free to adopt ZFS as their
| primary file systems. But it appears that practical benefits
| are not really evident to most users.
| thyristan wrote:
| Lots of ZFS users are enthusiasts who heard about that one
| magic thing that does it all in one tidy box. Whereas
| usually you would have to known all the minutiae of
| LVM/mdadm/cryptsetup/nbd and mkfs.whatever to get to the
| same point. So while ZFS is the nicer-dicer of volume
| management and filesystems, the latter is your whole chef's
| knife set. And while you can dice with both, the user
| groups are not the same. And enthusiasts with the right
| usecases are very few.
|
| And for the thin-provisioned snapshotted subvolume usecase,
| btrfs is currently eating ZFS's lunch due to far better
| Linux integration. Think snapshots at every update, and
| having a/b boot to get back to a known-working config after
| an update. So widespread adoption through the distro route
| is out of the question.
| queenkjuul wrote:
| Ubuntu's ZFS-on-root with zsys auto snapshots have been
| working excellently on my server for 5 years. It
| automatically takes snapshots on every update and adds
| entries to grub so rolling back to the last good state is
| just a reboot away.
| wkat4242 wrote:
| That's called marketing. Give it a snazzy name, like say
| "TimeMachine" and users will jump on it.
|
| Also, ZFS has a bad name within the Linux community due to
| some licensing stuff. I find that most BSD users don't
| really care about such legalese and most people I know that
| run FreeBSD are running ZFS on root. Which works amazingly
| well I might add.
|
| Especially with something like sanoid added to it, it
| basically does the same as timemachine on mac, a feature
| that users love. Albeit stored on the same drive (but with
| syncoid or just manually rolled zfs send/recv scripts you
| can do that on another location too).
| cherryteastain wrote:
| > ZFS has a bad name within the Linux community due to
| some licensing stuff
|
| This is out of an abundance of caution. Canonical bundle
| ZFS in the Ubuntu kernel and no one sued them (yet).
| wkat4242 wrote:
| True and I understand the caution considering Oracle is
| involved which are an awful company to do deal with (and
| their takeover of Sun was a disaster).
|
| But really, this is a concern for distros. Not for end
| users. Yet many of the Linux users I speak to are somehow
| worried about this. Most can't even describe the
| provisions of the GPL so I don't really know what that's
| about. Just something they picked up, I guess.
| thyristan wrote:
| Licensing concerns that prevent distros from using ZFS
| will sooner or later also have adverse effects on end
| users. Actually those effects are already there: The
| constant need to adapt a large patchset to the current
| kernel, meaning updates are a hassle. The lack of
| packaging in distributions, meaning updates are a hassle.
| And the lack of integration and related tooling, meaning
| many features can not be used (like a/b boots from
| snapshots after updates) easily, and installers won't
| know about ZFS so you have to install manually.
|
| None of this is a worry about being sued as an end user.
| But all of those are worries that you life will be harder
| with ZFS, and a lot harder as soon as the first lawsuits
| hit anyone, because all the current (small) efforts to
| keep it working will cease immediately.
| ryao wrote:
| Unlike other out of tree filesystems such as Reiser4, the
| ZFS driver does not patch the kernel sources.
| thyristan wrote:
| That is due to licensing reasons, yes. It makes
| maintaining the codebase even more complicated because
| when the kernel module API changes (which it very
| frequently does) you cannot just adapt it to your needs,
| you have to work around all the new changes that are
| there in the new version.
| ryao wrote:
| You have things backward. Licensing has nothing to do
| with it. Changes to the kernel are unnecessary.
| Maintaining the code base is also simplified by
| supporting the various kernel versions the way that they
| are currently supported.
| yjftsjthsd-h wrote:
| > I find that most BSD users don't really care about such
| legalese and most people I know that run FreeBSD are
| running ZFS on root.
|
| I don't think it's that they don't _care_ , it's that the
| CDDL and BSD-ish licenses are generally believed to just
| not have the conflict that CDDL and GPL might. (IANAL,
| make your own conclusions about whether either of those
| are true)
| gruturo wrote:
| >I find that most BSD users don't really care about such
| legalese and most people I know that run FreeBSD are
| running ZFS on root.
|
| What a weird take. BSD's license is compatible with ZFS,
| that's why. "Don't really care?" Really? Come on.
| badc0ffee wrote:
| Time Machine was released 17 years ago, and I wish
| Windows had anything that good. And they're on their 3rd
| backup system since then.
| mdaniel wrote:
| > I wish Windows had anything that good
|
| I can't readily tell how much of the dumbness is from the
| filesystem and how much from the kernel but the end
| result is that until it gets away from 1980s version of
| file locking there's no prayer. Imagine having to explain
| to your boss that your .docx wasn't backed up because you
| left Word open over the weekend. A just catastrophically
| idiotic design
| m4rtink wrote:
| The ZFS license makes it impossible to include in upstream
| Linux kernel, which makes it much less usable as primary
| filesystem.
| ryao wrote:
| Linux's signed off policy makes that impossible. Linus
| Torvalds would need Larry Elison's signed off before even
| considering it. Linus told me this by email around 2013
| (if I recall correctly) when I emailed him to discuss
| user requests for upstream inclusion. He had no concerns
| about the license being different at the time.
| Gud wrote:
| ZFS is a first class citizen in FreeBSD and has been for at
| least a decade(probably longer). Not at all like in most
| Linux distros.
| toast0 wrote:
| ZFS on FreeBSD is quite nice. System tools like freebsd-
| update integrate well. UFS continues to work as well, and
| may be more appropriate for some use cases where ZFS isn't
| a good fit, copy on write is sometimes very expensive.
|
| Afaik, the FreeBSD position is both ZFS and UFS are fully
| supported and neither is secondary to the other; the
| installer asks what you want from ZFS, UFS, Manual (with a
| menu based tool), or Shell and you do whatever; in that
| order, so maybe a slight preferance towards ZFS.
| lotharcable wrote:
| OpenZFS exists and there is a port of it for Mac OS X.
|
| The problem is that it is still owned by Oracle. And
| Solaris ZFS is incompatible with OpenZFS. Not that people
| really use Solaris anymore.
|
| It is really unfortunate. Linux has adopted file systems
| from other operating systems before. It is just nobody
| trust Oracle.
| 8fingerlouie wrote:
| Exactly this.
|
| The business case for providing a robust desktop filesystem
| simply doesn't exist anymore.
|
| 20 years ago, (regular) people stored their data on computers
| and those needed to be dependable. Phones existed, but not to
| the extent they do today.
|
| Fast forward 20 years, and many people don't even own a
| computer (in the traditional sense, many have consoles).
| People now have their entire life on their phones, backed up
| and/or stored in the cloud.
|
| SSDs also became "large enough" that HDDs are mostly a thing
| of the past in consumer computers.
|
| Instead you today have high reliability hardware and software
| in the cloud, which arguably is much more resilient than
| anything you could reasonably cook up at home. Besides the
| hardware (power, internet, fire suppression, physical
| security, etc), you're also typically looking at multi
| geographical redundancy across multiple data centers using
| reed-Solomon erasure coding, but that's nothing the ordinary
| user needs to know about.
|
| Most cloud services also offer some kind of snapshot
| functionality as malware protection (ie OneDrive offers
| unlimited snapshots for 30 days rolling).
|
| Truth is that most people are way better off just storing
| their data in the cloud and making a backup at home, though
| many people seem to ignore the latter, and Apple makes it
| exceptionally hard to automate.
| ryao wrote:
| What do you do when you discover that some thing you have
| not touched in a long time, but suddenly need, is corrupted
| and all of your backups are corrupt because the corruption
| happened prior to your 30 day window at OneDrive?
|
| You would have early warning with ZFS. You have data loss
| with your plan.
| sho_hn wrote:
| Workstation use cases exist. Data archival is not the only
| application of file systems.
| GeekyBear wrote:
| Internet connections of the day didn't yet offer enough speed
| for cloud storage.
|
| Apple was already working to integrate ZFS when Oracle bought
| Sun.
|
| From TFA:
|
| > ZFS was featured in the keynotes, it was on the developer
| disc handed out to attendees, and it was even mentioned on
| the Mac OS X Server website. Apple had been working on its
| port since 2006 and now it was functional enough to be put on
| full display.
|
| However, once Oracle bought Sun, the deal was off.
|
| Again from TFA:
|
| > The Apple-ZFS deal was brought for Larry Ellison's
| approval, the first-born child of the conquered land brought
| to be blessed by the new king. "I'll tell you about doing
| business with my best friend Steve Jobs," he apparently said,
| "I don't do business with my best friend Steve Jobs."
|
| And that was the end.
| mixmastamyk wrote:
| Was it not open source at that point?
| ahl wrote:
| It was! And Apple seemed fine with including DTrace under
| the CDDL. I'm not sure why Apple wanted some additional
| arrangement but they did.
| wenc wrote:
| I remember eagerly anticipating ZFS for desktop hard disks. I
| seem to remember it never took off because memory
| requirements were too high and payoffs were insufficient to
| justify the trade off.
| deburo wrote:
| APFS? That still happened.
| ahl wrote:
| Back in 2016, Ars Technica picked up this piece from my blog [1]
| as well as a longer piece reviewing the newly announced APFS [2]
| [3]. Glad it's still finding an audience!
|
| [1]: https://arstechnica.com/gadgets/2016/06/zfs-the-other-new-
| ap...
|
| [2]: https://ahl.dtrace.org/2016/06/19/apfs-part1/
|
| [3]: https://arstechnica.com/gadgets/2016/06/a-zfs-developers-
| ana...
| throw0101b wrote:
| Apple and Sun couldn't agree on a 'support contract'. From Jeff
| Bonwick, one of the co-creators ZFS:
|
| >> _Apple can currently just take the ZFS CDDL code and
| incorporate it (like they did with DTrace), but it may be that
| they wanted a "private license" from Sun (with appropriate
| technical support and indemnification), and the two entities
| couldn't come to mutually agreeable terms._
|
| > _I cannot disclose details, but that is the essence of it._
|
| * https://archive.is/http://mail.opensolaris.org/pipermail/zfs...
|
| Apple took DTrace, licensed via CDDL--just like ZFS--and put it
| into the kernel without issue. Of course a file system is much
| more central to an operating system, so they wanted much more of
| a CYA for that.
| secabeen wrote:
| ZFS remains an excellent filesystem for bulk storage on rust, but
| were I Apple at the time, I would probably want to focus on
| something built for the coming era of flash and NVMe storage.
| There are a number of axioms built into ZFS that come out of the
| spinning disk era that still hold it back for flash-only
| filesystems.
| ahl wrote:
| Certainly one would build something different starting in 2025
| rather than 2001, but do you have specific examples of how
| ZFS's design holds it back? I think it has been adapted
| extremely well for the changing ecosystem.
| whartung wrote:
| As a desktop user, I am content with APFS. The only feature from
| ZFS that I would like, is the corruption detection. I honestly
| don't know how robust the image and video formats are to bit
| corruption. On the one hand, potentially, "very" robust. But on
| the other, I would think that there are some very special bits
| that if toggled can potentially "ruin" the entire file. But I
| don't know.
|
| However, I can say, every time I've tried ZFS on my iMac, it was
| simply a disaster.
|
| Just trying to set it up on a single USB drive, or setting it up
| to mirror a pair. The net effect was that it CRUSHED the
| performance on my machine. It became unusable. We're talking
| "move the mouse, watch the pointer crawl behind" unusable. "Let's
| type at 300 baud" unusable. Interactive performance was shot.
|
| After I remove it, all is right again.
| ryao wrote:
| > I honestly don't know how robust the image and video formats
| are to bit corruption.
|
| It depends on the format. A BMP image format would limit the
| damage to 1 pixel, while a JPEG could propagate the damage to
| potentially the entire image. There is an example of a bitflip
| damaging a picture here:
|
| https://arstechnica.com/information-technology/2014/01/bitro...
|
| That single bit flip ruined about half of the image.
|
| As for video, that depends on how far apart I frames are. Any
| damage from a bit flip would likely be isolated to the section
| of video from the bitflip until the next I-frame occurs. As for
| how bad it could be, it depends on how the encoding works.
|
| > On the one hand, potentially, "very" robust.
|
| Only in uncompressed files.
|
| > But on the other, I would think that there are some very
| special bits that if toggled can potentially "ruin" the entire
| file. But I don't know.
|
| The way that image compression works means that a single bit
| flip prior to decompression can affect a great many pixels, as
| shown at Ars Technica.
|
| > However, I can say, every time I've tried ZFS on my iMac, it
| was simply a disaster.
|
| Did you file an issue? I am not sure what the current status of
| the macOS driver's production readiness is, but it will be
| difficult to see it improve if people do not report issues that
| they have.
| mdaniel wrote:
| > Just trying to set it up on a single USB drive
|
| That's the fault of macOS, I also experienced 100% CPU and load
| off the charts and it was kernel_task jammed up by USB. Once I
| used a Thunderbolt enclosure it started to be sane. This
| experience was the same across multiple non-Apple filesystems
| as I was trying a bunch to see which one was the best at cross-
| os compatibility
|
| Also, separately, ZFS says "don't run ZFS on USB". I didn't
| have problems with it, but I knew I was rolling the dice
| queenkjuul wrote:
| Yeah they do say that but anecdotally my Plex server has been
| ZFS over USB 3 since 2020 with zero problems (using Ubuntu
| 20.04)
|
| Anyway only bringing it up to reinforce that it is probably a
| macOS problem.
| ewuhic wrote:
| What's the current state of ZFS on Macos? As far as I'm aware
| there's a supported fork.
| ein0p wrote:
| ZFS sort of moved inside the NVMe controller - it also checksums
| and scrubs things all the time, you just don't see it. This does
| not, however, support multi-device redundant storage, but that is
| not a concern for Apple - the vast majority of their devices have
| only one storage device.
___________________________________________________________________
(page generated 2025-04-27 23:01 UTC)