[HN Gopher] The Btrfs inode-number epic (part 1: the problem)
       ___________________________________________________________________
        
       The Btrfs inode-number epic (part 1: the problem)
        
       Author : Deeg9rie9usi
       Score  : 58 points
       Date   : 2021-08-23 10:36 UTC (1 days ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | londons_explore wrote:
       | The Linux file interface was not designed for subvolumes to
       | masquerade as directories.
       | 
       | The correct course of action would have been for Linus to not
       | allow btrfs into mainline with such an interface. Each subvolume
       | should have been seperately mounted, or the stat interface should
       | have been extended to make subvolumes first class.
        
         | crest wrote:
         | An ugly workaround would be to embed the subvolume id into the
         | inode number, but it would almost require 64 bit inode numbers
         | and some tools make daring assumptions about the stability of
         | inode numbers across reboots.
        
       | Muromec wrote:
       | Don't hardlinks have same inode, as they are essentially same
       | file, included in a different directory? Is this also why we
       | don't have hardlinks to directories?
        
         | turminal wrote:
         | They do have the same inode. Directories have inodes too, so
         | that's not an obstacle.
         | 
         | We don't have hardlinks to directories because that is a
         | terrible idea functionality wise.
        
       | ansible wrote:
       | In our current setup, we only NFS export a btrfs subvolume, with
       | no subvolumes inside that one. So that should be OK, if I'm
       | reading this right.
       | 
       | I'm a big fan of btrfs snapshots, but with some other recent
       | issues, I'm wondering if we should migrate away from btrfs.
        
         | baggy_trough wrote:
         | I've run a service for 10 years and during that time, there
         | have only been 2 frightening outages (that were difficult to
         | understand and debug) - both of them were due to btrfs.
        
       | aaron_m04 wrote:
       | Perhaps inode numbers should never have been numbers in the first
       | place. If they were strings it would be trivial to prepend a
       | volume ID.
        
         | turminal wrote:
         | That would create a whole lot of other problems and probably
         | some performance penalty, because strings are much, much more
         | difficult to work with in the context of a kernel filesystem
         | driver.
        
       | NovemberWhiskey wrote:
       | Inode non-uniqueness is a hell of a problem to debug: it's not
       | necessarily a problem most of the time then you'll run into a
       | corner case where something depends on it.
       | 
       | e.g. we had a big issue with this back around 2008 where we
       | accidentally had two AFS volumes with overlapping inode ranges
       | (AFS was represented as a single device, even if there were
       | multiple volumes mounted).
       | 
       | Linux's dynamic linker had an optimization that cached
       | device/inode for DSOs that had already been loaded, and wouldn't
       | attempt to load them again.
       | 
       | If you had a binary that depended on linking two DSOs from
       | different volumes that had different names, but the same inode
       | number, the first one would get linked OK but the linker would
       | then ignore the second one and you'd get missing dynamic linker
       | dependencies out of nowhere.
       | 
       | This was the first and hopefully last time reading the source to
       | ld.so during a production outage.
        
       | mjevans wrote:
       | From the first part alone, much of the 'sin' is likely non-unique
       | inode numbers across a pool of BTRFS devices. I could understand
       | partitioning ranges of new numbers for performance; the non-
       | uniqueness seems just silly.
        
         | aidenn0 wrote:
         | Ranges aren't possible because there aren't enough bits in a
         | 64-bit inode.
         | 
         | ZFS doesn't have this problem because each dataset (the
         | equivalent of subvolume) is treated as a separate mountpoint.
         | This can be annoying with NFS because you need to export and
         | mount each dataset separately.
        
           | normaler wrote:
           | Atleast when you use the build in zfs nfs, it also exports
           | "subvolumes"
        
             | aidenn0 wrote:
             | Well on fairly recent TrueNAS at least you need to export
             | each dataset separately, and they show up in /proc/mounts.
        
               | normaler wrote:
               | I habe the zpool "data" with the zfs Volumes like this
               | 
               | data/media/video/movies data/media/video/series
               | 
               | I set sharenfs with read write access for
               | 
               | data/media/video
               | 
               | Which gives me access via nfa to all the subvolumes.
        
           | mbreese wrote:
           | _> This can be annoying with NFS because you need to export
           | and mount each dataset separately._
           | 
           | I use ZFS on multiple servers, and appreciate this approach.
           | The problem is -- you're going to feel pain one way or the
           | other. In that case, make sure that the pain is obvious and
           | predictable.
           | 
           | If you're using subvolumes/datasets, you will have to deal
           | with a problem at some point. Either you have to manually
           | export multiple NFS volumes (ZFS), or _potentially_ have
           | inode uniqueness issues (Btrfs).
           | 
           | I'd much rather have the problems make excruciatingly
           | obvious. I can script generating a config file for exporting
           | many datasets (and I have done so). I can't really deal with
           | non-unique inodes in the same manner.
        
             | aidenn0 wrote:
             | I overall agree, but on our build-server the person who can
             | add new branches is not the same person who can add new
             | entries to the autofs map, so we don't use one dataset per
             | branch, which would make other things a lot easier.
        
       | darknavi wrote:
       | I have seen too many posts with users who have had their btrfs
       | pools corrupted for me to try and dabble with it. (Using Unraid
       | specifically).
       | 
       | I'll stick with xfs.
        
         | __david__ wrote:
         | Similarly, after I got stuck with a corrupted xfs volume that
         | couldn't be self-fixed or manually fixed (the process would
         | take literal days and then fail in some unhelpful way) I gave
         | up and went back to using ext4 for everything.
        
           | SkyMarshal wrote:
           | Why not ZFS or ZoL (https://zfsonlinux.org/)?
           | 
           | B/c it's not built into the kernel, or some other reason?
        
         | SkyMarshal wrote:
         | _> I'll stick with xfs_
         | 
         | Why not ZFS or ZoL (https://zfsonlinux.org/)?
         | 
         | B/c it's not built into the kernel, or some other reason?
        
       | yjftsjthsd-h wrote:
       | So how does ZFS handle the same problem? (I would _guess_ that it
       | just goes through with making each dataset a full filesystem and
       | just deals with the overhead for NFS, but I don 't know that)
        
         | Deeg9rie9usi wrote:
         | Since ZFS does not really have the concept sub-volumes like
         | btrfs it does not suffer from it.
        
           | magicalhippo wrote:
           | I'm no expert so maybe I'm missing something.
           | 
           | To me it seems that ZFS snapshots represents a similar
           | challenge[1] given that the snapshots can be accessed through
           | the .zfs directory, and that ZFS also plays tricks[2] with
           | the inodes.
           | 
           | [1]: https://docs.oracle.com/cd/E19253-01/819-5461/gbiqe/inde
           | x.ht...
           | 
           | [2]: http://mikeboers.com/blog/2019/02/21/zfs-inode-
           | generations
        
             | rincebrain wrote:
             | They're always explicit mounts on ZFS. (Even the .zfs
             | directory triggers mounting on access of things under
             | snapshot/.)
        
           | yjftsjthsd-h wrote:
           | How is a BTRFS sub volume different from a ZFS child
           | dataset/filesystem?
        
           | speed_spread wrote:
           | BTRFS has volumes/subvolumes while ZFS has pools/filesystems.
           | I suspect they have considerably different implementations
           | but the userspace implications are similar.
        
       | Deeg9rie9usi wrote:
       | Here you can find part 2: the solutions:
       | https://lwn.net/SubscriberLink/866709/671690ea60c1cb37/
        
         | nextaccountic wrote:
         | Thank you
        
       ___________________________________________________________________
       (page generated 2021-08-24 23:01 UTC)