[HN Gopher] ZFS 2.2.0 (RC): Block Cloning merged
___________________________________________________________________
ZFS 2.2.0 (RC): Block Cloning merged
Author : turrini
Score : 176 points
Date : 2023-07-04 15:46 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| miohtama wrote:
| What are applications that benefit from block cloning?
| aardvark179 wrote:
| It can be a really convenient way to snapshot something if you
| can arrange some point at which everything is synced to disk.
| Get to that point, make your new files that start sharing all
| their blocks, and then let your main db process (or whatever)
| continue on as normal.
| ikiris wrote:
| I think the big piece is native overlayfs so k8 setups get a
| bit simpler.
| philsnow wrote:
| It seems kind of like hard linking but with copy-on-write for
| the underlying data, so you'll get near-instant file copies and
| writing into the middle of them will also be near-instant.
|
| All of this happens under the covers already if you have dedup
| turned on, but this allows utilities (gnu cp might be taught to
| opportunistically and transparently use the new clone zfs
| syscalls, because there is no downside and only upside) and
| applications to tell zfs that "these blocks are going to be the
| same as those" without zfs needing to hash all the new blocks
| and compare them.
|
| Aditionally, for finer control, ranges of blocks can be cloned,
| not just entire files.
|
| I can't tell from the github issue, can this manual dedup /
| block cloning be turned on if you're not already using dedup on
| a dataset? Last time I set up zfs, I was warned that dedup took
| gobs of memory, so I didn't turn it on.
| rincebrain wrote:
| It's orthogonal to dedup being on or off, and as someone else
| said, it's more or less the same underlying semantics you
| would expect from cp --reflink anywhere.
|
| Also, as mentioned, on Linux, it's not wired up with any
| interface to be used at all right now.
| nabla9 wrote:
| Gnu cp --reflink.
|
| >When --reflink[=always] is specified, perform a lightweight
| copy, where the data blocks are copied only when modified. If
| this is not possible the copy fails, or if --reflink=auto is
| specified, fall back to a standard copy. Use --reflink=never
| to ensure a standard copy is performed."
| danudey wrote:
| As others have said: block cloning (the underlying technology
| that enables copy-on-write) allows you to 'copy' a file without
| reading all of the data and re-writing it.
|
| For example, if you have a 1 GB file and you want to make a
| copy of it, you need to read the whole file (all at once or in
| parts) and then write the whole new file (all at once or in
| parts). This results in 1 GB of reads and 1 GB of writes.
| Obviously the slower (or more overloaded) your storage media
| is, the longer this takes.
|
| With block cloning, you simply tell the OS "I want this file A
| to be a copy of this file B" and it creates a new "file" that
| references all the blocks in the old "file". Given that a
| "file" on a filesystem is just a list of blocks that make up
| the data in that file, you can create a new "file" which has
| pointers to the same blocks as the old "file". This is a simple
| system call (or a few system calls), and as such isn't much
| more intensive than simply renaming a file instead of copying
| it.
|
| At my previous job we did builds for our software. This
| required building the BIOS, kernel, userspace, generating the
| UI, and so on. These builds required pulling down 10+ GB of git
| repositories (the git data itself, the checkout, the LFS binary
| files, external vendor SDKs), and then a large amount of build
| artifacts on top of that. We also needed to do this build for
| 80-100 different product models, for both release and debug
| versions. This meant 200+ copies of the source code alone (not
| to mention build artifacts and intermediate products), and
| because of disk space limitations this meant we had to
| dramatically reduce the number of concurrent builds we could
| run. The solution we came up with was something like:
|
| 1. Check out the source code
|
| 2. Create an overlayfs filesystem to mount into each build
| space
|
| 3. Do the build
|
| 4. Tear down the overlayfs filesystem
|
| This was problematic if we weren't able to mount the
| filesystem, if we weren't able to unmount the filesystem
| (because of hanging file descriptors or processes), and so on.
| Lots of moving parts, lots of `sudo` commands in the scripts,
| and so on.
|
| Copy-on-write would have solved this for us by accomplishing
| the same thing; we could simply do the following:
|
| 1. Check out the source code
|
| 2. Have each build process simply `cp -R --reflink=always
| source/ build_root/`; this would be instantaneous and use no
| new disk space.
|
| 3. Do the build
|
| 4. `rm -rf build_root`
|
| Fewer moving parts, no root access required, generally simpler
| all around.
| thrill wrote:
| FTFA: "Block Cloning allows to clone a file (or a subset of its
| blocks) into another (or the same) file by just creating
| additional references to the data blocks without copying the
| data itself. Block Cloning can be described as a fast, manual
| deduplication."
| the8472 wrote:
| Any copy command. On-demand deduplication managed by userspace.
|
| https://man7.org/linux/man-pages/man2/ioctl_fideduperange.2....
| https://man7.org/linux/man-pages/man2/copy_file_range.2.html
| https://github.com/markfasheh/duperemove
| vovin wrote:
| This is huge. One practical application is fast recovery of a
| file from past snapshot without using any additional space. I
| use ZFS dataset for my vCenter datastore (storing my vmdk
| files). In case of need to launch a clone from a past state one
| could use a block cloning to bring past vmdk file without the
| need to actually copy the file - it saves both space and time
| to make such clone.
| bithavoc wrote:
| Can you elaborate a bit more on how you use ZFS with vCenter?
| How do you mount it?
| mustache_kimono wrote:
| Excited, because in addition to ref copies/clones, httm will
| use this feature, if available (I've already done some work to
| implement), for its `--roll-forward` operation, and for faster
| file recoveries from snapshots [0].
|
| As I understand it, there will be no need to copy any data from
| the same dataset, and _this includes all snapshots_. Blocks
| written to the live dataset can just be references to the
| underlying blocks, and no additional space will need be used.
|
| Imagine being able to continuously switch a file or a dataset
| back to a previous state extremely quickly without a heavy
| weight clone, or a rollback, etc.
|
| Right now, httm simply diff copies the blocks for file recovery
| and roll-forward. For further details, see the man page entry
| for `--roll-forward`, and the link to the httm GitHub below:
| --roll-forward="snap_name" traditionally 'zfs
| rollback' is a destructive operation, whereas httm roll-forward
| is non-destructive. httm will copy only the blocks and file
| metadata that have changed since a specified snapshot, from
| that snapshot, to its live dataset. httm will also take two
| precautionary snapshots, one before and one after the copy.
| Should the roll forward fail for any reason, httm will roll
| back to the pre-execution state. Note: This is a ZFS only
| option which requires super user privileges.
|
| [0]: https://github.com/kimono-koans/httm
| rossmohax wrote:
| Does ZFS or any other FS offer special operations which DB engine
| like RocksDB, SQLite or PostgreSQL could benefit from if they
| decided to target that FS specifically?
| magicalhippo wrote:
| Internally, ZFS is kinda like an object store[1], and there was
| a project trying to expose the ZFS internals through an object
| store API rather than through a filesystem API.
|
| Sadly I can't seem to find the presentation or recall the name
| of the project.
|
| On the other hand, looking at for example RocksDB[2]:
|
| _File system operations are not atomic, and are susceptible to
| inconsistencies in the event of system failure. Even with
| journaling turned on, file systems do not guarantee consistency
| on unclean restart. POSIX file system does not support atomic
| batching of operations either. Hence, it is not possible to
| rely on metadata embedded in RocksDB datastore files to
| reconstruct the last consistent state of the RocksDB on
| restart. RocksDB has a built-in mechanism to overcome these
| limitations of POSIX file system [...]_
|
| ZFS _does_ provide atomic operations internally[1], so if
| exposed it seems something like RocksDB could take advantage of
| that and forego all the complexity mentioned above.
|
| How much that would help I don't know though, but seems
| potentially interesting at first glance.
|
| [1]: https://youtu.be/MsY-BafQgj4?t=442
|
| [2]: https://github.com/facebook/rocksdb/wiki/MANIFEST
| ludde wrote:
| A M A Z I N G
|
| Have been looking forward to this for years!
|
| This is so much better than automatically doing dedup and the RAM
| overhead that entails.
|
| Doing offline/RAM+in memory dedup size optimizations seem like a
| really good optimization path. In the spirit of also paying only
| what you use and not the rest.
|
| Edit: What's the RAM overhead of this? Is it ~64B per 128kB
| deduped block or what's the magnitude of things?
| mlyle wrote:
| > Edit: What's the RAM overhead of this? Is it ~64B per 128kB
| deduped block or what's the magnitude of things?
|
| No real memory impact. There's a regions table that uses 128k
| of memory per terabyte of total storage (and may be a bit more
| in the future). So for your 10 petabyte pool using deduping,
| you'd better have an extra gigabyte of RAM.
|
| But erasing files can potentially be twice as expensive in
| IOPS, even if not deduped. They try to prevent this.
| uvatbc wrote:
| Technically, yes: through the use of Truenas that gives us API
| access to iscsi on ZFS.
| GauntletWizard wrote:
| > Note: currently it is not possible to clone blocks between
| encrypted datasets, even if those datasets use the same
| encryption key (this includes snapshots of encrypted datasets).
| Cloning blocks between datasets that use the same keys should be
| possible and should be implemented in the future.
|
| Once this is ready, I am going to subdivide my user homedir much
| more than it already is. The biggest obstacle in the way of this
| has been that it would waste a bunch of space until the snapshots
| were done rolling over, which for me is a long time (I keep
| weekly snapshots of my homedir for a year).
| yjftsjthsd-h wrote:
| Is there a benefit to breaking up your home directory?
| GauntletWizard wrote:
| Controlling the rate and location of snapshots, mostly. I've
| broken out some kinds of datasets (video archives) but not
| others historically (music). It doesn't matter that much, but
| I want to split some more chunks out.
| yjftsjthsd-h wrote:
| Fair enough. I've personally slowly moved to a smaller
| number of filesystems, but if you're actually handling
| snapshots differently per-area then it makes sense (indeed,
| one of the reasons I'm consolidating is the realization
| that _personally_ I 'm almost never going to
| snapshot/restore things separately).
| someplaceguy wrote:
| Not sure why you'd want to do that to your home directory
| usually, but it depends on what you store in it and how you
| use it, really.
|
| In general, breaking up a filesystem into multiple ones in
| ZFS is mostly useful for making filesystem management more
| fine-grained, as a filesystem/dataset in ZFS is the unit of
| management for most properties and operations (snapshots,
| clones, compression and checksum algorithms, quotas,
| encryption, dedup, send/recv, ditto copies, etc) as well as
| their inheritance and space accounting.
|
| In terms of filesystem management, there aren't many
| downsides to breaking up a filesystem (within reason), as
| most properties and the most common operations can be shared
| between all sub-filesystems if they are part of the same
| inherited tree (which doesn't necessarily have to correspond
| to the mountpoint tree!).
|
| As far as I know, the major downsides by far were that 1) you
| couldn't quickly move a file from one dataset to another,
| i.e. `mv` would be forced to do a full copy of the file
| contents rather than just do a cheap rename, and 2) in terms
| of disk space, moving a file between filesystems would be
| equivalent to copying the file and deleting the original,
| which could be terrible if you use snapshots as it would lead
| to an additional space consumption of a full new file's worth
| of disk space.
|
| In principle, both of these downsides should be fixed with
| this new block cloning feature and AFAIU the only tradeoffs
| would be some amount of increased overhead when freeing data
| (which should be zero overhead if you don't have many of
| these cloned blocks being shared anymore), and the low
| maturity of this code (i.e. higher chance of running into
| bugs) due to being so new.
| dark-star wrote:
| Wow, I was under the impression that this had long been
| implemented already (as it's already in btrfs and other
| commercial file systems)
|
| Awesome!
| mgerdts wrote:
| It has been in the Solaris version of zfs for a long time as
| well. This came a few years after the Oracle-imposed fork.
|
| https://blogs.oracle.com/solaris/post/reflink3c-what-is-it-w...
| Pxtl wrote:
| Filesystem-level de-duplication is scary as hell as a concept,
| but also sounds amazing, especially doing it at copy-time so you
| don't have to opt-in to scanning to deduplicate. Is this common
| in filesystems? Or is ZFS striking out new ground here? I'm not
| really an under-the-OS-hood kinda guy.
| yjftsjthsd-h wrote:
| > Filesystem-level de-duplication is scary as hell as a concept
|
| What's scary about it? You have to track references, but it
| doesn't seem _that_ hard compared to everything else going on
| in ZFS et al.
|
| > Is this common in filesystems? Or is ZFS striking out new
| ground here? At least BTRFS does approximately the same.
| cesarb wrote:
| > > Filesystem-level de-duplication is scary as hell as a
| concept
|
| > What's scary about it?
|
| It's scary because there's only one copy when you might have
| expected two. A single bad block could lose both "copies" at
| once.
| phpisthebest wrote:
| 3-2-1..
|
| 3 Copies
|
| 2 Media
|
| 1 offsite.
|
| If you follow that then you would have no fear of data
| loss. if you are putting 2 copies on the same filesystem
| you are already doing backups wrongs
| magicalhippo wrote:
| Besides what all the others have mentioned, you can force
| ZFS to keep up to 3 copies of data blocks on a dataset. ZFS
| uses this internally for important metadata and will try to
| spread them around to maximize the chance of recovery,
| though don't rely on this feature alone for redundancy.
| Filligree wrote:
| Disks die all the time anyway. If you want to keep your
| data, you should have at least two-disk redundancy. In
| which case bad blocks won't kill anything.
| ForkMeOnTinder wrote:
| Copying a file isn't great protection against bad blocks.
| Modern SSDs, when they fail, tend to fail catastrophically
| (the whole device dies all at once), rather than dying one
| block at a time. If you care about the data, back it up on
| a separate piece of hardware.
| grepfru_it wrote:
| The file system metadata is redundant and on a correctly
| configured ZFS system your error correction is isolated and
| can be redundant as well
| Pxtl wrote:
| > What's scary about it?
|
| Just that I'm trusting the OS to re-duplicate it at block
| level on file write. The idea that block by block you've got
| "okay, this block is shared by files XYZ, this next block is
| unique to file Z, then the next block is back to XYZ... oh
| we're editing that one? Then it's a new block that's now
| unique to file Z too".
|
| I guess I'm not used to trusting filesystems to do anything
| but dumb write and read. I know they abstract away a crapload
| of amazing complexity in reality, I'm just used to thinking
| of them as dumb bags of bits.
| Dylan16807 wrote:
| If you're on ZFS you're probably using snapshots, so all
| that work is already happening.
| wrs wrote:
| MacOS and BTRFS have had it for several years. In fact I
| believe it's the default behavior when copying a file in MacOS
| using the Finder (you have to specify `cp -c` in shell).
| nijave wrote:
| Windows Server has had dedupe for at least 10 years, too
| lockhouse wrote:
| Anyone here using ZFS in production these days? If so what OS and
| implementation? What have been your experiences or gotchas you
| experienced?
| Drybones wrote:
| We use ZFS on every server we deploy
|
| We typically use Proxmox. It's a convenient node host setup and
| usually has a very up to date zfs and it's stable
|
| I just wouldn't use the Proxmox web ui for zfs configuration.
| It doesn't have up to date options. Always configure zfs on the
| cli
| shrubble wrote:
| The gotcha on Proxmox is that you can't do swapfiles on ZFS, so
| if your swap isn't made big enough when installing and you
| format everything as ZFS you have to live with it or do odd
| workarounds.
| trws wrote:
| I'm not a filesystem admin, but we at LLNL use OpenZFS as the
| storage layer for all of our Lustre file systems in production,
| including using raid-z for resilience in each pool (order 100
| disks each), and have for most of a decade. That combined with
| improvements in Lustre have taken the rate of data loss or need
| to clear large scale shared file systems down to nearly zero.
| There's a reason we spend as many engineer hours as we do
| maintaining it, it's worth it.
|
| LLNL openzfs project:
| https://computing.llnl.gov/projects/openzfs Old presentation
| from intel with info on what was one of our bigger deployments
| in 2016 (~50pb):
| https://www.intel.com/content/dam/www/public/us/en/documents...
| muxator wrote:
| If I'm not mistaken the linux port of ZFS that later became
| OpenZFS started at LLNL and was a port from FreeBSD (it may
| have been release ~9).
|
| I believe it was called ZFS On Linux or something like that.
|
| Nice how things have evolved: from FreeBSD to linux and back.
| In my mind this has always been a very inspiring example of a
| public institution working for the public good.
| rincebrain wrote:
| FreeBSD had its own ZFS port.
|
| ZoL, if my ancient memory serves, was at LLNL, not based on
| the FreeBSD port (if you go _very_ far back in the commit
| history you can see Brian rebasing against OpenSolaris
| revisions), but like 2 or 3 different orgs originally
| announced Linux ports at the same time and then all pooled
| together, since originally only one of the three was going
| to have a POSIX layer (the other two didn't need a working
| POSIX filesystem layer). (I'm not actually sure how much
| came of this collaboration, I just remember being very
| amused when within the span of a week or two, three
| different orgs announced ports, looked at each other, and
| went "...wait.")
|
| Then for a while people developed on either the FreeBSD
| port, the illumos fork called OpenZFS, or the Linux port,
| but because (among other reasons) a bunch of development
| kept happening on the Linux port, it became the defacto
| upstream and got renamed "OpenZFS", and then FreeBSD more
| or less got a fresh port from the OpenZFS codebase that is
| now what it's based on.
|
| The macOS port got a fresh sync against that codebase
| recently and is slowly trying to merge in, and then from
| there, ???
| justinclift wrote:
| TrueNAS (www.truenas.com) uses ZFS for the storage layer across
| it's product range (storage software).
|
| They have both FreeBSD and Linux based stuff, targeting
| different use cases.
| quags wrote:
| I have been using zfs for years from ubuntu 18. Easy snapshots,
| monitoring, choices for raid levels, and ability to very easily
| copy a dataset remotely with resume or incremental support is
| awesome. I mainly use it for kvm systems each with their own
| dataset. Coming from mdadm + lvm from my previous set up is
| night and day for doing snapshots and backups. I do not use zfs
| on root for ubuntu instead I do a raid1 software set up for the
| os and then a zfs set up on other disks - zfs on root was the
| only gotcha. For FreeBSD zfs on root works fine.
| noaheverett wrote:
| Running in production for about 3 years with Ubuntu 20.04 / zfs
| 0.8.3. ZFS is being used as the datastore for a cluster of
| LXD/LXC instances over multiple physical hosts. I have the OS
| setup on its own dedicated drive and ZFS striped/cloned over 4
| NVMe drives.
|
| No gotchas / issues, works well, easy to setup.
|
| I am looking forward to the Direct IO speed improvements for
| NVMe drives with https://github.com/openzfs/zfs/pull/10018
|
| edit: one thing I forgot to mention is, when creating your pool
| make sure to import your drives by ID (zpool import -d
| /dev/disk/by-id/ <poolname>) instead of name in case name
| assignments change somehow [1]
|
| [1] https://superuser.com/questions/1732532/zfs-disk-drive-
| lette...
| throw0101c wrote:
| > _I am looking forward to the Direct IO speed improvements
| for NVMe drives
| withhttps://github.com/openzfs/zfs/pull/10018_
|
| See also "Scaling ZFS for NVMe" by Allan Jude at EuroBSDcon
| 2022:
|
| * https://www.youtube.com/watch?v=v8sl8gj9UnA
| noaheverett wrote:
| Sweet, I appreciate the link!
| gigatexal wrote:
| Any of the enterprise customers of klara Systems are likely ZFS
| production folks.
|
| https://klarasystems.com/?amp
| drewg123 wrote:
| We use ZFS in production for the non-content filesystems on a
| large, ever increasing, percentage of our Netflix Open Connect
| CDN nodes, replacing geom's gmirror. We had one gotcha (caught
| on a very limited set of canaries) where a buggy bootloader
| crashed part-way through boot, leaving the ZFS "bootonce" stuff
| in a funky state requiring manual recovery (the nodes with
| gmirror were fine, and fell back to the old image without
| fuss). This has since been fixed.
|
| Note that we _do not_ use ZFS for content, since it is
| incompatible with efficient use of sendfile (both because there
| is no async handler for ZFS, so no async sendfile, and because
| the ARC is not integrated with the page cache, so content would
| require an extra memory copy to be served).
| lifty wrote:
| what do you use for the content filesystem?
| drewg123 wrote:
| FreeBSD's UFS
| throw0101c wrote:
| Any use of boot environments for easy(er?) rollbacks of OS
| updates?
| drewg123 wrote:
| Yes. That's the bootonce thing I was talking about. When we
| update the OS, we set the "bootonce" flag via bectl
| activate -t to ensure we fall back to the previous BE if
| the current BE is borked and not bootable. This is the same
| functionality we had by keeping a primary and secondary
| root partition in geom a and toggling the bootable
| partition via the bootonce flag in gpart.
| bakul wrote:
| Is integrating ARC with the page cache a lost cause? If not,
| may be Netflix can fund it!
| postmodest wrote:
| I would expect page cache to be one of those platform-
| dependent things that prevent OpenZFS from doing that.
| Especially on Linux, and especially because AFAIK the Linux
| version has the most eyes on it.
| rincebrain wrote:
| My understanding, not having dug into it, is that it's
| possible but just work nobody has done yet, though I'm
| not sure what the relevant interfaces are in the Linux
| kernel.
|
| One thing that makes the interfaces in Linux much messier
| than the FreeBSD ones is that a lot of the core
| functionality you might like to leverage in Linux
| (workqueues, basically anything more complicated than
| just calling kmalloc, and of course any SIMD save/restore
| state, to name three examples I've stumbled over
| recently) are marked EXPORT_SYMBOL_GPL or just entirely
| not exported in newer releases, so you get to reimplement
| the wheel for those, whereas on FreeBSD it's trivial to
| just use their implementations of such things and shim
| them to the Solaris-ish interfaces the non-platform-
| specific code expects.
|
| So that makes the Linux-specific code a lot heavier,
| because upstream is actively hostile.
| Dylan16807 wrote:
| > SIMD save/restore state
|
| I wish someone would come in and convince the kernel devs
| that "hey, if you want EXPORT_SYMBOL_GPL to have legal
| weight in a copyleft sense then you can't just slap it
| onto interfaces for political reasons"
| rincebrain wrote:
| I don't think they care about it having legal weight,
| that ship sailed long ago when they started advocating
| for just slapping SYMBOL_GPL on things out of spite; I
| think they care about excluding people from using their
| software.
|
| IMO Linus should stop being half-and-half about it and
| either mark everything SYMBOL_GPL and see how well that
| goes or stop this nonsense.
| Filligree wrote:
| I just don't understand why they're so anti-ZFS. I want
| my data to survive, please...
| rincebrain wrote:
| My impression is that some of the Linux kernel devs are
| anti-anything that's not GPL-compatible, of any sort,
| regardless of the particulars.
|
| Linus himself also made remarks about ZFS at one point
| that were pretty...hostile. [1] [2]
|
| > The fact is, the whole point of the GPL is that you're
| being "paid" in terms of tit-for-tat: we give source code
| to you for free, but we want source code improvements
| back. If you don't do that but instead say "I think this
| is _legal_, but I'm not going to help you" you certainly
| don't get any help from us.
|
| > So things that are outside the kernel tree simply do
| not matter to us. They get absolutely zero attention. We
| simply don't care. It's that simple.
|
| > And things that don't do that "give back" have no
| business talking about us being assholes when we don't
| care about them.
|
| > See?
|
| Note that there's at least one unfixed Linux kernel bug
| that was found by OpenZFS users, reproducible without
| using OpenZFS in any way, reported with a patch, and
| ignored. [3]
|
| So "not giving back" is a dubious claim.
|
| [1] - https://arstechnica.com/gadgets/2020/01/linus-
| torvalds-zfs-s...
|
| [2] - https://www.realworldtech.com/forum/?threadid=18971
| 1&curpost...
|
| [3] - https://bugzilla.kernel.org/show_bug.cgi?id=212295
| colonwqbang wrote:
| Why don't you think it has legal weight? Or did you mean
| something else?
|
| As far as know the point of EXPORT_SYMBOL_GPL was to push
| back on companies like Nvidia who wanted to exploit
| loopholes in the GPL. That seems to me like a reasonable
| objective.
|
| Relevant Torvalds quote:
| https://yarchive.net/comp/linux/export_symbol_gpl.html
| rincebrain wrote:
| Sure, and that alone isn't an unreasonable premise - as
| he says, intent matters.
|
| But if you're marking interfaces as GPL-only, or
| implementing taint detection that means if you use a non-
| SYMBOL_GPL kernel symbol which calls a GPL-only function
| it treats the non-SYMBOL_GPL symbol as GPL-only and
| blocks your linking, it gets a bit out of hand.
|
| Building the kernel with certain kernel options makes
| modules like OpenZFS or OpenAFS not link because of that
| taint propagation - because things like the lockdep
| checker turn uninfringing calls into infringing ones.
|
| Or a little while ago, there was a change which broke
| building on PPC because a change made a non-SYMBOL_GPL
| call on POWER into a SYMBOL_GPL one indirectly, and when
| the original author was contacted, he sent a patch
| reverting the changed symbol, and GregKH refused to pull
| it into stable, suggesting distros could carry it if they
| wanted to. (Of course, he had happily merged a change
| into -stable earlier that just implemented more
| aggressive GPL tainting and thereby broke things like the
| aforementioned...)
| PlutoIsAPlanet wrote:
| The Linux kernel has never supported out of tree modules
| like how ZFS works out of tree.
|
| All ZFS needs to do is just have one of Oracles many
| lawyers say "CDDL is compatible with GPL". Yet, they
| Oracle don't.
| rincebrain wrote:
| "All."
|
| It's explicitly not compatible with GPL, though. It has
| clauses that are more restrictive than GPL, and IIRC some
| people who contributed to the OpenZFS project did so
| explicitly without allowing later CDDL license revisions,
| which removes Oracle's ability to say CDDL-2 or whatever
| is GPL-compatible.
|
| So even if someone rolled up dumptrucks of cash and
| convinced Oracle that everything was great, they don't
| have all the control needed to do that.
| Dylan16807 wrote:
| To have legal weight, it has to be a signal that you're
| implementing something that is derivative of kernel code.
| That's the directly stated intent of EXPORT_SYMBOL_GPL.
|
| But "call an opaque function that saves SIMD state" is
| obviously not derivative of the kernel code in any way.
| The more exports that get badly marked this way, the more
| EXPORT_SYMBOL_GPL becomes indistinguishable from
| EXPORT_SYMBOL.
| colonwqbang wrote:
| I see it as just a kind of "warranty void if seal
| broken". Don't do this or you _may_ be in violation of
| the GPL. Maybe a legal court in $country would find in
| your favour (I 'm not convinced it's as clear cut as you
| imply). Maybe they would find that you willfully
| infringed, despite the kernel devs clearly warning you
| not to do it.
|
| The main "legal effect" I see is that you are not willing
| to take that risk, just like Oracle isn't.
| bakul wrote:
| I suspect the underlying issues for not unifying the two
| have more to do with the ZFS design than anything to do
| with Linux. It may be the codebase is far too large at
| this stage to make such a fundamental change.
| rincebrain wrote:
| I don't think so. The memory management stuff is pretty
| well abstracted; on FBSD it just glues into UMA pretty
| transparently, it's just on Linux there's a lot of
| machinery for implementing our own little cache
| allocating because Linux's kernel cache allocator is very
| limited in what sizes it will give you, and sometimes ZFS
| wants 16M (not necessarily contiguous) regions because
| someone said they wanted 16M records.
|
| The ZoL project lead said at one point there were a
| variety of reasons this wasn't initially done for the
| Linux integration [1], but that it was worth taking
| another look at since that was a decade ago now. Having
| looked at the Linux memory subsystems recently for
| various reasons, I would suspect the limiting factor is
| that almost all the Linux memory management functions
| that involve details beyond "give me X pages" are
| SYMBOL_GPL, so I suspect we couldn't access whatever
| functionality would be needed to do this.
|
| I could be wrong, though, as I wasn't looking at the code
| for that specific purpose, so I might have missed
| functionality that would provide this.
|
| [1] - https://github.com/openzfs/zfs/issues/10255#issueco
| mment-620...
| bakul wrote:
| Behlendorf's comment in that thread seems to be talking
| about linux integration. My point was this is an older
| issue, going back to the Sun days. See for instance this
| thread in where McVoy complains about the same issue! htt
| ps://www.tuhs.org/pipermail/tuhs/2021-February/023013.htm
| ...
| rincebrain wrote:
| That seems more like it's complaining about it not being
| the actual page cache, not it not being counted as
| "cache", which is a larger set in at least Linux than
| just the page cache itself.
|
| But sure, it's certainly an older issue, and given that
| the ABD rework happened, I wouldn't put anything past
| being "feasible" if the benefits were great enough.
|
| (Look at the O_DIRECT zvol rework stuff that's pending (I
| believe not merged) for how a more cut-through memory
| model could be done, though that has all the tradeoffs
| you might expect of skipping the abstractions ZFS uses to
| minimize the ability of applications to poke holes in the
| abstraction model and violate consistency, I believe...)
| the8472 wrote:
| Could the linux integration use dax[0] to bypass the page
| cache and go straight to ARC?
|
| [0] https://www.kernel.org/doc/Documentation/filesystems/
| dax.txt
| gigatexal wrote:
| This is amazing. A detail of Netflix that, I a plebe,
| wouldn't know if not for this site.
| ComputerGuru wrote:
| Actually, Drew's presentations about Netflix, FreeBSD, ZFS,
| saturating high-bandwidth network adapters, etc. are
| legendary and have been posted far and wide. But having him
| available to answer questions on HN just takes it to a
| whole 'nother level.
| drewg123 wrote:
| You're making me blush.. But, to set the record straight:
| I actually know very little about ZFS, beyond basic
| user/admin knowledge (from having run it for ~15 years).
| I've never spoken about it, and other members of the team
| I work for at Netflix are far more knowledgeable about
| ZFS, and are the ones who have managed the conversion of
| our fleet to ZFS for non-content partitions.
| gigatexal wrote:
| Have they ever blogged or spoke at conferences about it?
| I soak up all that content -- least I try to.
| ComputerGuru wrote:
| I've devoured your FreeBSD networking presentations but I
| guess I must have confused a post about tracking down a
| ZFS bug in production written by someone else with all
| the other content you've produced.
|
| Back to the topic at hand, it's actually scary how few
| software expose control over whether or not sendfile is
| used, assuming support is only a matter of OS and kernel
| version but not taking into account filesystem
| limitations. I ran into a terrible Samba on FreeBSD bug
| (shares remotely disconnected and connections reset with
| moderate levels of concurrent ro access from even a
| single client) that I ultimately tracked down to sendfile
| being enabled in the (default?) config - so it wasn't
| just the expected "performance requirements not being
| met" with sendfile on ZFS but even other reliability
| issues (almost certainly exposing a different underlying
| bug, tbh). Imagine if Samba didn't have a tubeable to
| set/override sendfile support, though.
| xmodem wrote:
| If you can share, what type of non-content data do the nodes
| store? Is this just OS+application+logs?
| nightfly wrote:
| Yes, Ubuntu 20.04 and 22.04. But we've been running ZFS in some
| form or other for 10+ years. ACL support not as good/easy to
| use as Solaris/FreeBSD. Not having weird pathological
| performance issues with kernel memory allocation like we had
| with FreeBSD though. Sometimes we have issues with automatic
| pool import on boot, so that's something to be careful with.
| The tooling is great though, and we've never had catastrophic
| failure that was due to ZFS, only due to failing hardware.
| DvdGiessen wrote:
| In production on SmartOS (illumos) servers running applications
| and VM's, on TrueNAS and plain FreeBSD for various storage and
| backups, and on a few Linux-based workstations. Using mirrors
| and raidz2 depending on the needs of the machines.
|
| We've successfully survived numerous disk failures (a broken
| batch of HDD's giving all kinds of small read errors, an SSD
| that completely failed and disappeared, etc), and were in most
| cases able to replace them without a second of downtime (would
| have been all cases if not for disks placed in hard-to-reach
| places, now only a few minutes downtime to physically swap the
| disk).
|
| Snapshots work perfectly as well. Systems are set up to
| automatically make snapshots using [1], on boot, on a timer,
| and right before potentially dangerous operations such as
| package manager commands as well. I've rolled back after
| botched OS updates without problems; after a reboot the machine
| was back in it's old state. Also rolled back a live system a
| few times after a broken package update, restoring the
| filesystem state without any issues. Easily accessing old
| versions of a file is an added bonus which has been helpful a
| few times.
|
| Send/receive is ideal for backups. We are able to send
| snapshots between machines, even across different OSes, without
| issues. We've also moved entire pools from one OS to another
| without problems.
|
| Knowing we have automatic snapshots and external backups
| configured also allows me to be very liberal with giving root
| access to inexperienced people to various (non-critical)
| machines, knowing that if anything breaks it will always be
| easy to roll back, and encouraging them to learn by
| experimenting a bit, to the point where we can even diff
| between snapshots to inspect what changed and learn from that.
|
| Biggest gotchas so far have been on my personal Arch Linux
| setup, where the out-of-tree nature of ZFS has caused some
| issues like a incompatible kernel being installed, the ZFS
| module failing to compile, and my workstation subsequently
| being unable to boot. But even that was solved by my entire
| system running on ZFS: a single rollback from my bootloader [2]
| and all was back the way it was before.
|
| Having good tooling set up definitely helped a lot. My monkey
| brain has the tendency to think "surely I got it right this
| time, so no need to make a snapshot before trying out X!",
| especially when experimenting on my own workstation. Automating
| snapshots using a systemd timer and hooks added to my package
| manager saved me a number of times.
|
| [1]: https://github.com/psy0rz/zfs_autobackup [2]:
| https://zfsbootmenu.org/
| enneff wrote:
| I use ZFS on Debian for my home file server. The setup is just
| a tiny NUC with a couple of large USB hard drives, mirrored
| with ZFS. I've had drives fail and painlessly replaced and
| resilvered them. This is easily the most hassle free file
| storage setup I've owned; been going strong over 10 years now
| with little to no maintenance.
| crest wrote:
| I use it as my default file system on FreeBSD. It was rough in
| FreeBSD 7.x (around 2009), but starting with FreeBSD 8.x it has
| been rock solid to this day. The only gotcha (which the
| documentation warns about) has been that automatic block level
| deduplication is only useful in a few special applications and
| has a large main memory overhead unless you can accept terrible
| performance for normal operations (e.g. a bandwidth limited
| offsite backup).
| yjftsjthsd-h wrote:
| Sure; we get good mileage out of compression and snapshots
| (well, mostly send-recv for moving data around rather than
| snapshots in-place). I think the only problems have been very
| specific to our install process (non-standard kernel in the
| live environment; if we used the normal distro install process
| it would be fine).
| mattjaynes wrote:
| ZFS on Linux has improved a lot in the last few years. We
| (prematurely) moved to using it in production for our MySQL
| data about 5 years ago and initially it was a nightmare due to
| unexplained stalling which would hang MySQL for 15-30 minutes
| at random times. I'm sure it shortened my life a few years
| trying to figure out what was wrong when everything was on
| fire. Fortunately, they have resolved those issues in the
| subsequent releases and it's been much more pleasant after
| that.
| SkyMarshal wrote:
| Not in production, but using ZoL on my personal workstations.
| https://zfsonlinux.org/
|
| Some discussion:
| https://www.reddit.com/r/NixOS/comments/ops0n0/big_shoutout_...
| szundi wrote:
| Yes and it is awesome, no issues.
| unixhero wrote:
| Yes. Using latest ZFS ZFS On Linux distrib on Debian. Using
| Proxmox. Never had and problems ever, ever.
| benlivengood wrote:
| "production" at home on Debian 11, previously on FreeBSD 10-13.
| The weirdest gotcha has been related to sending encrypted raw
| snapshots to remote machines[0],[1]. These have been the first
| instabilities I had with ZFS in roughly 15 years around the
| filesystem since switching to native encryption this year.
| Native encryption seems to be barely stable for production use;
| no actual data corruption but automatic synchronization (I use
| znapzend) was breaking frequently. Recent kernel updates fixed
| my problem although some of the bug reports are still open. I
| only moved on from FreeBSD because of more familiarity with
| Linux.
|
| A slightly annoying property of snapshots and clones is the
| inability to fully re-root a tree of snapshots, e.g.
| permanently split a clone from its original source and allow
| first-class send/receive from that clone. The snapshot which
| originated the clone needs to stick around forever[2]. This
| prevents a typical virtual machine imagine process of keeping a
| base image up to date over time that VMs can be cloned from
| when desired and eventually removing the storage used by the
| original base image after e.g. several OS upgrades.
|
| I don't have any big performance requirements and most file
| storage is throughput based on spinning disks which can easily
| saturate the gigabit network.
|
| I also use ZFS on my laptop's SSD under Ubuntu with about 1GB/s
| performance and no shortage of IOPS and the ability to send
| snapshots off to the backup system which is pretty nice. Ubuntu
| is going backwards on support for ZFS and native encryption
| uses a hacky intermediate key under LUKS, but it works.
|
| [0] https://github.com/openzfs/zfs/issues/12014 [1]
| https://github.com/openzfs/zfs/issues/12594
| [2]https://serverfault.com/questions/265779/split-a-zfs-clone
| albertzeyer wrote:
| Also see this issue: https://github.com/openzfs/zfs/issues/405
|
| > It is in FreeBSD main branch now, but disabled by default just
| to be safe till after 14.0 released, where it will be included.
| Can be enabled with loader tunable there.
|
| > more code is needed on the ZFS side for Linux integration. A
| few people are looking at it AFAIK.
| vlovich123 wrote:
| Do Btrfs or ext4 offer this?
| thrtythreeforty wrote:
| Btrfs yes, ext4 no (but I believe xfs does).
|
| This should end up being exposed through cp --reflink=always,
| so you could look up filesystem support for that.
| danudey wrote:
| XFS does, I've used it for specifically this feature before.
| wtallis wrote:
| This feature is basically the same as what underpins the
| reflink feature that btrfs has supported approximately forever
| and xfs has supported for at least several years.
| mustache_kimono wrote:
| Does anyone know whether btrfs or XFS support reflinks from
| snapshot datasets?
| Dylan16807 wrote:
| I can confirm BTRFS yes, but note that source and
| destination need to be on the same mount point before
| kernel 5.18
| ComputerGuru wrote:
| XFS doesn't have native snapshot support, though?
| danudey wrote:
| XFS doesn't have snapshot support, so the short answer
| there is no.
| mustache_kimono wrote:
| Shows what I know about XFS. Thanks!
| PlutoIsAPlanet wrote:
| You can get psuedo-snapshots on XFS with a tool like
| https://github.com/aravindavk/reflink-snapshot
|
| But, it still has to duplicate metadata which depending
| on the amount of files may cause inconsistency in the
| snapshot.
| plq wrote:
| This is only a tangent given we are talking about
| snapshots and reflink, but just wanted to mention that
| LVM has snapshots, so if you need XFS snapshots, create
| the XFS filesystem on top of an LVM logical volume.
| dsr_ wrote:
| You can get a similar effect on top of any file system that
| supports hard links with rdfind ( https://rdfind.pauldreik.se/
| ) -- but it's pretty slow.
|
| The Arch wiki says:
|
| "Tools dedicated to deduplicate a Btrfs formatted partition
| include duperemove, bees, bedup and btrfs-dedup. One may also
| want to merely deduplicate data on a file based level instead
| using e.g. rmlint, jdupes or dduper-git. For an overview of
| available features of those programs and additional
| information, have a look at the upstream Wiki entry.
|
| Furthermore, Btrfs developers are working on inband (also known
| as synchronous or inline) deduplication, meaning deduplication
| done when writing new data to the filesystem. Currently, it is
| still an experiment which is developed out-of-tree. Users
| willing to test the new feature should read the appropriate
| kernel wiki page."
| someplaceguy wrote:
| > You can get a similar effect on top of any file system that
| supports hard links with rdfind (
| https://rdfind.pauldreik.se/ ) -- but it's pretty slow.
|
| It's a similar effect only if you don't modify the files, I
| think.
|
| If you "clone" a file with a hard link and you modify the
| contents of one copy, the other copy would also be equally
| modified.
|
| As far as I understand this wouldn't happen with this type of
| block cloning: each copy of the file would be completely
| separate, except that they may (transparently) share data
| blocks on disk.
___________________________________________________________________
(page generated 2023-07-04 23:00 UTC)