[HN Gopher] The Btrfs inode-number epic (part 1: the problem)
___________________________________________________________________
The Btrfs inode-number epic (part 1: the problem)
Author : Deeg9rie9usi
Score : 58 points
Date : 2021-08-23 10:36 UTC (1 days ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| londons_explore wrote:
| The Linux file interface was not designed for subvolumes to
| masquerade as directories.
|
| The correct course of action would have been for Linus to not
| allow btrfs into mainline with such an interface. Each subvolume
| should have been seperately mounted, or the stat interface should
| have been extended to make subvolumes first class.
| crest wrote:
| An ugly workaround would be to embed the subvolume id into the
| inode number, but it would almost require 64 bit inode numbers
| and some tools make daring assumptions about the stability of
| inode numbers across reboots.
| Muromec wrote:
| Don't hardlinks have same inode, as they are essentially same
| file, included in a different directory? Is this also why we
| don't have hardlinks to directories?
| turminal wrote:
| They do have the same inode. Directories have inodes too, so
| that's not an obstacle.
|
| We don't have hardlinks to directories because that is a
| terrible idea functionality wise.
| ansible wrote:
| In our current setup, we only NFS export a btrfs subvolume, with
| no subvolumes inside that one. So that should be OK, if I'm
| reading this right.
|
| I'm a big fan of btrfs snapshots, but with some other recent
| issues, I'm wondering if we should migrate away from btrfs.
| baggy_trough wrote:
| I've run a service for 10 years and during that time, there
| have only been 2 frightening outages (that were difficult to
| understand and debug) - both of them were due to btrfs.
| aaron_m04 wrote:
| Perhaps inode numbers should never have been numbers in the first
| place. If they were strings it would be trivial to prepend a
| volume ID.
| turminal wrote:
| That would create a whole lot of other problems and probably
| some performance penalty, because strings are much, much more
| difficult to work with in the context of a kernel filesystem
| driver.
| NovemberWhiskey wrote:
| Inode non-uniqueness is a hell of a problem to debug: it's not
| necessarily a problem most of the time then you'll run into a
| corner case where something depends on it.
|
| e.g. we had a big issue with this back around 2008 where we
| accidentally had two AFS volumes with overlapping inode ranges
| (AFS was represented as a single device, even if there were
| multiple volumes mounted).
|
| Linux's dynamic linker had an optimization that cached
| device/inode for DSOs that had already been loaded, and wouldn't
| attempt to load them again.
|
| If you had a binary that depended on linking two DSOs from
| different volumes that had different names, but the same inode
| number, the first one would get linked OK but the linker would
| then ignore the second one and you'd get missing dynamic linker
| dependencies out of nowhere.
|
| This was the first and hopefully last time reading the source to
| ld.so during a production outage.
| mjevans wrote:
| From the first part alone, much of the 'sin' is likely non-unique
| inode numbers across a pool of BTRFS devices. I could understand
| partitioning ranges of new numbers for performance; the non-
| uniqueness seems just silly.
| aidenn0 wrote:
| Ranges aren't possible because there aren't enough bits in a
| 64-bit inode.
|
| ZFS doesn't have this problem because each dataset (the
| equivalent of subvolume) is treated as a separate mountpoint.
| This can be annoying with NFS because you need to export and
| mount each dataset separately.
| normaler wrote:
| Atleast when you use the build in zfs nfs, it also exports
| "subvolumes"
| aidenn0 wrote:
| Well on fairly recent TrueNAS at least you need to export
| each dataset separately, and they show up in /proc/mounts.
| normaler wrote:
| I habe the zpool "data" with the zfs Volumes like this
|
| data/media/video/movies data/media/video/series
|
| I set sharenfs with read write access for
|
| data/media/video
|
| Which gives me access via nfa to all the subvolumes.
| mbreese wrote:
| _> This can be annoying with NFS because you need to export
| and mount each dataset separately._
|
| I use ZFS on multiple servers, and appreciate this approach.
| The problem is -- you're going to feel pain one way or the
| other. In that case, make sure that the pain is obvious and
| predictable.
|
| If you're using subvolumes/datasets, you will have to deal
| with a problem at some point. Either you have to manually
| export multiple NFS volumes (ZFS), or _potentially_ have
| inode uniqueness issues (Btrfs).
|
| I'd much rather have the problems make excruciatingly
| obvious. I can script generating a config file for exporting
| many datasets (and I have done so). I can't really deal with
| non-unique inodes in the same manner.
| aidenn0 wrote:
| I overall agree, but on our build-server the person who can
| add new branches is not the same person who can add new
| entries to the autofs map, so we don't use one dataset per
| branch, which would make other things a lot easier.
| darknavi wrote:
| I have seen too many posts with users who have had their btrfs
| pools corrupted for me to try and dabble with it. (Using Unraid
| specifically).
|
| I'll stick with xfs.
| __david__ wrote:
| Similarly, after I got stuck with a corrupted xfs volume that
| couldn't be self-fixed or manually fixed (the process would
| take literal days and then fail in some unhelpful way) I gave
| up and went back to using ext4 for everything.
| SkyMarshal wrote:
| Why not ZFS or ZoL (https://zfsonlinux.org/)?
|
| B/c it's not built into the kernel, or some other reason?
| SkyMarshal wrote:
| _> I'll stick with xfs_
|
| Why not ZFS or ZoL (https://zfsonlinux.org/)?
|
| B/c it's not built into the kernel, or some other reason?
| yjftsjthsd-h wrote:
| So how does ZFS handle the same problem? (I would _guess_ that it
| just goes through with making each dataset a full filesystem and
| just deals with the overhead for NFS, but I don 't know that)
| Deeg9rie9usi wrote:
| Since ZFS does not really have the concept sub-volumes like
| btrfs it does not suffer from it.
| magicalhippo wrote:
| I'm no expert so maybe I'm missing something.
|
| To me it seems that ZFS snapshots represents a similar
| challenge[1] given that the snapshots can be accessed through
| the .zfs directory, and that ZFS also plays tricks[2] with
| the inodes.
|
| [1]: https://docs.oracle.com/cd/E19253-01/819-5461/gbiqe/inde
| x.ht...
|
| [2]: http://mikeboers.com/blog/2019/02/21/zfs-inode-
| generations
| rincebrain wrote:
| They're always explicit mounts on ZFS. (Even the .zfs
| directory triggers mounting on access of things under
| snapshot/.)
| yjftsjthsd-h wrote:
| How is a BTRFS sub volume different from a ZFS child
| dataset/filesystem?
| speed_spread wrote:
| BTRFS has volumes/subvolumes while ZFS has pools/filesystems.
| I suspect they have considerably different implementations
| but the userspace implications are similar.
| Deeg9rie9usi wrote:
| Here you can find part 2: the solutions:
| https://lwn.net/SubscriberLink/866709/671690ea60c1cb37/
| nextaccountic wrote:
| Thank you
___________________________________________________________________
(page generated 2021-08-24 23:01 UTC)