[HN Gopher] Bcachefs may be headed out of the kernel
       ___________________________________________________________________
        
       Bcachefs may be headed out of the kernel
        
       Author : ksec
       Score  : 83 points
       Date   : 2025-07-04 13:32 UTC (9 hours ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | guerrilla wrote:
       | The drama with Linux filesystems is just nuts... It never ends.
        
         | mschuster91 wrote:
         | The stakes are the highest across the entire kernel. Data
         | that's corrupt cannot (easily) be uncorrupted.
        
           | tpolzer wrote:
           | Bad drivers could brick (parts of) your hardware permanently.
           | 
           | While you should have a backup of your data anyway.
        
         | quotemstr wrote:
         | At least Kent hasn't murdered his wife
        
           | Tostino wrote:
           | First thing that came to mind when I saw this drama.
        
         | msgodel wrote:
         | It's crazy people spend so much time paying attention to
         | Hollywood celebrity drama.
         | 
         |  _Opens LKML archive hoping for another Linus rant._
        
         | rendaw wrote:
         | I'm sure there's just as much political allstar programmer
         | fighting at google/apple/microsoft/whatever too, just this is
         | done in public.
        
       | chasil wrote:
       | So the assertion is that users with (critical) data loss bugs
       | need complete solutions for recovery and damage containment with
       | all possible speed, and without this "last mile" effort,
       | stability will never be achieved.
       | 
       | The objection is the tiniest bug-fix windows get everything but
       | the kitchen sink.
       | 
       | These are both uncomfortable positions to occupy, without doubt.
        
         | koverstreet wrote:
         | No, the assertion is that the proper response to a bug often
         | (and if it's high impact - always) involves a lot more than
         | just the bugfix.
         | 
         | And the whole reason for a filesystem's existence is to store
         | and maintain your data, so if that is what the patch if for,
         | yes, it should be under consideration as a hotfix.
         | 
         | There's also the broader context: it's a major problem for
         | stabilization if we can't properly support the people using it
         | so they can keep testing.
         | 
         | More context: the kernel as a whole is based on fixed time
         | tables and code review, which it needs because QA (especially
         | automated testing) is extremely spotty. bcachefs's QA, both
         | automated testing and community testing, is extremely good, and
         | we've had bugfix patchsets either held up or turn into
         | flamewars because of this mismatch entirely too many times.
        
           | WesolyKubeczek wrote:
           | > No, the assertion is that the proper response to a bug
           | often (and if it's high impact - always) involves a lot more
           | than just the bugfix.
           | 
           | Then what you do is you try to split your work in two. You
           | could think of a stopgap measure or a workaround which is
           | small, can be reviewed easily, and will reduce the impact of
           | the bug while not being a "proper" fix, and prepare the
           | "properer" fix when the merge window opens.
           | 
           | I would ask, since the bug probably lived since the last
           | stable release, how come it fell through the crack and had
           | only been noticed recently? Could it be that not all setups
           | are affected? If so, can't they live with it until the next
           | merge window?
           | 
           | By making a "feature that fixes the bug for real", you
           | greatly expand the area in which new, unknown bugs may land,
           | with very little time to give it proper testing. This is
           | inevitable, evident by the simple fact that the bug you were
           | trying to fix exists. You can be good, but not _that_ good.
           | Nobody is that good. If anybody was that good, they wouldn 't
           | have the bug in the first place.
           | 
           | If you have commercial clients who use your filesystem and
           | you have contractual obligations to fix their bugs and keep
           | their data intact, you could (I'd even say "should") maintain
           | an out-of-tree version with its own release and bugfix
           | schedule. This is IMO the only reasonable way to have it,
           | because the kernel is a huge administrative machine with lots
           | of people, and by mainlining stuff, you necessarily become
           | co-dependent on the release schedule for the whole kernel. I
           | think a conflict between kernel's release schedule and
           | contractual obligations, if you have any, is only a matter of
           | time.
        
             | koverstreet wrote:
             | > Then what you do is you try to split your work in two.
             | You could think of a stopgap measure or a workaround which
             | is small, can be reviewed easily, and will reduce the
             | impact of the bug while not being a "proper" fix, and
             | prepare the "properer" fix when the merge window opens.
             | 
             | That is indeed what I normally do. For example, 6.14 and
             | 6.15 had people discovering btree iterator locking bugs
             | (manifesting as assertion pops) while running evacuates on
             | large filesystems (it's hard to test a sufficiently deep
             | tree depth in virtual machine tests with our large btree
             | nodes); some small hotfixes went out in rc kernels, but the
             | majority of the work (a whole project to add assertions for
             | path->should_be_locked, which should shut these down for
             | good) waited until the 6.16 merge window.
             | 
             | That was for a less critical bug - your machine crashing is
             | somewhat less severe than losing a filesystem.
             | 
             | In this case, we had a bug pop up in 6.15 where the link
             | count in the VFS inode getting screwed up caused an inode
             | to be deleted that shouldn't have been - a subvolume root -
             | and then an untested repair path took out the entire
             | subvolume.
             | 
             | Ouuuuch.
             | 
             | That's why the repair code was rushed; it had already
             | gotten one filesystem back, and I'd just gotten another
             | report of someone else hitting it - and for every bug
             | report there are almost always more people who hit it and
             | don't report it.
             | 
             | And considering that a lot of people running bcachefs now
             | are getting it from distro kernels and don't know how to
             | build kernels - that is why it was important to get this
             | out quickly through the normal channels.
             | 
             | In addition, the patch wasn't risky, contrary to what Ted
             | was saying. It's a code path that's very well covered by
             | automated tests, including KASAN/UBSAN/lockdep variants -
             | those would exploded if this patch was incorrect.
             | 
             | When to ship a patch is always a judgement call, and part
             | of how you make that call is how well your QA process can
             | guarantee the patch is correct. Part of what was going on
             | here is a disconnect between those of us who do make heavy
             | use of modern QA infrastructure and those who do it the old
             | school way, relying heavily on manual review and long
             | testing periods for rc kernels.
        
           | magicalhippo wrote:
           | While I absolutely think you're taking a stand in the wrong
           | fights, like I don't see why you needed to push it so far on
           | this hill in particular, I am sympathetic to your argument
           | that experimental kernel modules like filesystems might need
           | a different release approach at times.
           | 
           | At work we have our main application which also contains a
           | lot of customer integrations. Our policy has been new
           | features in trunk only, except if it's entirely contained
           | inside a customer-specific integration module.
           | 
           | We do try to avoid it, but this does allow us to be flexible
           | with regards to customer needs, while keeping the base
           | application stable.
           | 
           | This new recovery feature was, as far as I could see,
           | entirely contained within the bcachefs kernel code. Given the
           | experimental status, as long as it was clearly communicated
           | to users, I don't see a huge problem allowing such self-
           | contained features during the RC phase.
           | 
           | Obviously a requirement must be that it doesn't break the
           | build.
        
         | jethro_tell wrote:
         | Who's using an experimental filesystem and risking critical
         | data loss? Rule one of experimental file systems is have a copy
         | on a not experimental file system.
        
       | shmerl wrote:
       | May be bcachefs should have been governed by a group of people,
       | not a single person.
        
         | mananaysiempre wrote:
         | Committees are good-to-acceptable for keeping things going, but
         | bad for initial design or anything requiring a coherent vision
         | and taste. There are some examples of groups that straddled the
         | boundary between a committee and a creative collaboration and
         | produced good designs (Algol 60; RnRS for n <= 5; IIRC the
         | design of ZFS was produced by a three-person team), but they
         | are more of an exception, and the secret of tying together such
         | groups remotely doesn't seem to have been cracked. Even in the
         | keeping things going department, a committee's inbuilt and
         | implicit self-preservation mechanisms can lead it to keep
         | fiddling with things far longer than would be advisable.
        
           | shmerl wrote:
           | In this case it's more about keeping things in check and not
           | letting one person with an attitude to ignore kernel
           | development rules derail the whole project.
           | 
           | I'm not saying those concerns are wrong, but when it's
           | causing a fallout like being kicked out from the kernel, the
           | downsides clearly are more severe than any potential
           | benefits.
        
           | koverstreet wrote:
           | Actually, I think remote collaboration can work with the
           | right medium and tools. For bcachefs, that's been IRC; we
           | have an extremely active channel where we do a lot of
           | collaborative debugging, design discussion, helping new
           | users, etc.
           | 
           | I know a lot of people heavily use slack/discord these days,
           | but personally I find the web interfaces way too busy. IRC
           | all the way, for me.
           | 
           | But the problem of communicating effectively enough to
           | produce a coherent design is very real - this goes back to
           | Fred Brooks (Mythical Man Month). I think bcachefs turned out
           | very well with the way the process has gone to date, and now
           | that it's gotten bigger, with more distinct subsystems, I am
           | very eagerly looking forward to the date when I can hand off
           | ownership of some of those subsystems. Lately we've had some
           | sharp developers getting involved - for the past several
           | years it's been mainly users testing it (and some of them
           | have gotten very good at debugging at this point).
           | 
           | So it's happening.
        
       | charcircuit wrote:
       | If Linux would add a stable kernel module API this wouldn't be a
       | huge a problem and it would be easy for bcachefs to ship as a
       | kernel module with his own independent release schedule.
        
         | josephcsible wrote:
         | The slight benefit for out-of-tree module authors wouldn't be
         | worth the negative effects on the rest of the kernel to
         | everyone else.
        
           | charcircuit wrote:
           | "slight benefit"? Having a working system after upgrading
           | your kernel is not just a slight benefit. It's table stakes.
           | Especially for something critical like a filesystem it should
           | never break.
           | 
           | >negative effects on the rest of the kernel
           | 
           | Needing to design and support an API is not purely negative
           | for kernel developers. It also gives a change to have a
           | proper interface for drivers to use and follow. Take a look
           | at the Rust for Linux which keeps running into undocumented
           | APIs that make little sense and are just whatever <insert
           | most popular driver> does.
        
             | josephcsible wrote:
             | > Having a working system after upgrading your kernel is
             | not just a slight benefit. It's table stakes.
             | 
             | We already have that, with the "don't break userspace"
             | policy combined with all of the modules being in-tree.
             | 
             | > Needing to design and support an API is not purely
             | negative for kernel developers.
             | 
             | Sure, it's not _purely_ negative, but it 's overall a big
             | _net_ negative.
             | 
             | > Take a look at the Rust for Linux which keeps running
             | into undocumented APIs that make little sense and are just
             | whatever <insert most popular driver> does.
             | 
             | That's an argument _against_ a stable module API! Those
             | things are getting fixed as they get found, but if we had a
             | stable module API, we 'd be stuck with them forever.
             | 
             | I recommend reading https://docs.kernel.org/process/stable-
             | api-nonsense.html
        
               | charcircuit wrote:
               | >We already have that, with the "don't break userspace"
               | 
               | Bcachefs is not user space.
               | 
               | >with all of the modules being in-tree.
               | 
               | That is not true. There are out of tree modules such as
               | ZFS.
               | 
               | >That's an argument against a stable module API!
               | 
               | My point was that there was 0 thought put into creating a
               | good API. Additionally API could be evolved over time and
               | have a support period if you care about being able to
               | evolve it and deprecate the old one. And likely even with
               | a better interface there is probably a way to make the
               | old API still function.
        
               | josephcsible wrote:
               | > Bcachefs is not user space.
               | 
               | bcachefs is still in-tree.
               | 
               | > That is not true. There are out of tree modules such as
               | ZFS.
               | 
               | ZFS could be in-tree in no time at all if Oracle would
               | fix its license. And until they do that, it's not safe to
               | use ZFS-on-Linux anyway, since Oracle could sue you for
               | it.
               | 
               | > My point was that there was 0 thought put into creating
               | a good API.
               | 
               | There is thought put into it: it's exactly what we need
               | right now, because if what we need ever changes, we'll
               | change the API too, thus avoiding YAGNI and similar
               | problems.
               | 
               | > Additionally API could be evolved over time and have a
               | support period if you care about being able to evolve it.
               | 
               | If a temporary "support period" is what you want, then
               | just use the LTS kernels. That's already exactly what
               | they give you.
               | 
               | > And likely even with a better interface there is
               | probably a way to make the old API still function.
               | 
               | That's the big net negative I was mentioning and that
               | https://docs.kernel.org/process/stable-api-nonsense.html
               | talks about too. Sometimes there isn't a feasible way to
               | support part of an old API anymore, and it's not worth
               | holding the whole kernel back just for the out-of-tree
               | modules.
        
               | yjftsjthsd-h wrote:
               | > ZFS could be in-tree in no time at all if Oracle would
               | fix its license. And until they do that, it's not safe to
               | use ZFS-on-Linux anyway, since Oracle could sue you for
               | it.
               | 
               | IANAL, but I don't believe either of these things are
               | true.
               | 
               | OpenZFS contains enough code not authored by Sun/Oracle
               | that relicensing it now is effectively impossible.
               | 
               | OTOH, it _is_ under the CDDL, which is a perfectly good
               | open source license; AFAICT the problem, if one exists at
               | all[0], only manifests when _distributing_ the
               | combination of CDDL (OpenZFS) and GPL (Linux) software.
               | If you download CDDL software and compile it into GPL
               | software yourself (say, with DKMS) then it should be fine
               | because you aren 't distributing it.
               | 
               | [0] This is a case where I'm going to really emphasize
               | that I'm really not a lawyer and merely point out that
               | ex. Canonical's lawyers _do_ seem to think CDDL+GPL is
               | okay.
        
               | timschmidt wrote:
               | > it should be fine because you aren't distributing it.
               | 
               | Which excludes a vast amount of activity one might want
               | to use Linux for which is otherwise allowed. Like selling
               | a device with a Linux installation, distributing VM or
               | system restore images, etc.
        
               | yjftsjthsd-h wrote:
               | Sure, I happily grant that the licensing situation is
               | really annoying and restricts the set of safe actions. I
               | only object to claims that all use of ZFS is legally
               | risky.
        
               | charcircuit wrote:
               | >it's not safe to use ZFS-on-Linux anyway, since Oracle
               | could sue you for it.
               | 
               | It's not against the license to use them together.
               | 
               | >If a temporary "support period" is what you want, then
               | just use the LTS kernels. That's already exactly what
               | they give you.
               | 
               | Only the Android one does. The regular LTS one has no
               | such guarantee.
        
             | msgodel wrote:
             | Does your system have some critical out of tree driver?
             | That should have been recompiled with the new kernel, that
             | sounds like a failure of whoever maintains the
             | driver/kernel/distro (which may be you if you're building
             | it yourself.)
        
         | homebrewer wrote:
         | It would also have a lot less FOSS drivers, neither we nor
         | FreeBSD (which is often invoked in these complaints) would have
         | amdgpu for example.
        
           | charcircuit wrote:
           | I would actually posture that making it easier to make
           | drivers would actually have the opposite effect and result in
           | more FOSS drivers.
           | 
           | >FreeBSD (which is often invoked in these complaints) would
           | have amdgpu for example.
           | 
           | In such a hypothetical FreeBSD could reimplement the stable
           | API of Linux.
        
             | throw0101d wrote:
             | > _In such a hypothetical FreeBSD could reimplement the
             | stable API of Linux._
             | 
             | Like it does with the userland API of Linux, which is
             | stable:
             | 
             | * https://wiki.freebsd.org/Linuxulator
        
             | smcameron wrote:
             | No, every gpu vendor out there would prefer proprietary
             | drivers and with a stable ABI, they could do it, and would
             | do, there is no question about it.
             | 
             | I worked for HP on storage drivers for a decade or so, and
             | had their been a stable ABI, HP would have shipped
             | proprietary storage drivers for everything. Even without a
             | stable ABI, they shipped proprietary drivers at
             | considerable effort, compiling for myriad different distro
             | kernels. It was a nightmare, and good thing too, or there
             | wouldn't be any open source drivers.
        
               | charcircuit wrote:
               | I never said they wouldn't. Having more and better
               | drivers is a good thing for Linux users. It's okay for
               | proprietary drivers to exist. The kernel isn't meant to
               | be a vehicle to push the free software agenda.
        
             | msgodel wrote:
             | It's plenty easy to make drivers now, it's just hard to
             | distribute them without sharing the source.
             | 
             | There is absolutely no good reason not to share driver
             | source though so that's a terrible use case to optimize
             | for.
        
           | Nextgrid wrote:
           | What's so bad about it? Windows to this day doesn't have FOSS
           | drivers as standard and despite that is pretty successful. In
           | practice, as long as a driver works it's fine for the vast
           | majority of users, and you can always disassemble and binary-
           | patch if really needed.
           | 
           | (it's not obvious that having to occasionally
           | disassemble/patch closed-source drivers is worse than the
           | collective effort wasted trying to get every single thing in
           | the kernel and keep it up to date).
        
         | heavyset_go wrote:
         | The unstable interface is Linux's moat, and IMO, is the reason
         | we're able to enjoy such a large ecosystem of hardware via open
         | source operating systems.
        
           | zahlman wrote:
           | I'm afraid I don't follow your reasoning.
        
       | dralley wrote:
       | I donate to Kent's patreon and I'm very enthusiastic about
       | bcachefs.
       | 
       | However, Kent, if you read this: please just settle down and
       | follow the rules. Quit deliberately antagonizing Linus. The
       | constant drama is incredibly offputting. Don't jeopardize the
       | entire future of bcachefs over the silliest and most temporary
       | concerns.
       | 
       | If you absolutely must argue about some rule or other, then make
       | that argument without having your opening move be to blatantly
       | violate them and then complain when people call you out.
       | 
       | You were the one who wanted into the kernel despite many
       | suggestions that it was too early. That comes with tradeoffs. You
       | need to figure out how to live with that, at least for a year or
       | two. Stop making your self-imposed problems everyone else's
       | problems.
        
         | NewJazz wrote:
         | Seriously how hard is it to say "I'm unhappy users won't have
         | access to this data recovery option but will postpone its
         | inclusion until the next merge window". Yeah, maybe it sucks
         | for users who want the new option or what have you, but like
         | you said it is a temporary concern.
        
           | vbezhenar wrote:
           | Why does it suck for users? Those brave enough to use new
           | filesystem, surely can use custom kernel for the time being,
           | while merge effort is underway and vanilla kernel might not
           | be the most stable option.
        
         | thrtythreeforty wrote:
         | I _did_ subscribe to his Patreon but I stopped because of this
         | - vote with your wallet and all that. I would happily
         | resubscribe if he can demonstrate he can work within the Linux
         | development process. This isn 't the first time this flavor of
         | personality clash has come up.
         | 
         | Kent is absolutely technically capable of, and has the vision
         | to, finally displace ext4, xfs, and zfs with a new filesystem
         | that Does Not Lose Data. To jeopardize that by refusing to work
         | within the well-established structure is madness.
        
       | baggy_trough wrote:
       | No matter how good the code is, Overstreet's behavior and the
       | apparent bus factor of 1 leave me reluctant to investigate this
       | technology.
        
         | dsp_person wrote:
         | Curious about this process. Can anyone submit patches to
         | bcachefs and Kent is just the only one doing it? Is there a
         | community with multiple contributors hacking on the features,
         | or just Kent? If not, what could he do to grow this? And how
         | does a single person receiving patreon donations affect the
         | ability of a project like this to get passed bus factor of 1?
        
           | nolist_policy wrote:
           | Generally you need a maintainer for your subsystem who sends
           | pull requests to Linus.
        
           | koverstreet wrote:
           | I take patches from quite a few people. If the patch looks
           | good, I'll generally apply it.
           | 
           | And I encourage anyone who wants to contribute to join the
           | IRC channel. It's not a one man show, I work with a lot of
           | people there.
        
       | devwastaken wrote:
       | Good. There is no place for unstable developers in a stable
       | kernel.
        
       | msgodel wrote:
       | The older I get the more I feel like anything other than the
       | ExtantFS family is just silly.
       | 
       | The filesystem should do files, if you want something more
       | complex do it in userspace. We even have FUSE if you want to use
       | the Filesystem API with your crazy network database thing.
        
         | anonnon wrote:
         | > The older I get the more I feel like anything other than the
         | ExtantFS family is just silly.
         | 
         | The extended (not extant) family (including ext4) don't support
         | copy-on-write. Using them as your primary FS after 2020 (or
         | even 2010) is like using a non-journaling file system after
         | 2010 (or even 2001)--it's a non-negotiable feature at this
         | point. Btrfs has been stable for a decade, and if you don't
         | like or trust it, there's always ZFS, which has been stable 20
         | years now. Apple now has AppFS, with CoW, on _all_ their
         | devices, while MSFT still treats ReFS as unstable, and Windows
         | servers still rely heavily on NTFS.
        
           | msgodel wrote:
           | Again I don't really want the kernel managing a database for
           | me like that, the few applications that need that can do it
           | themselves just fine. (IME mostly just RDBMSs and Qemu.)
        
           | robotnikman wrote:
           | >Windows will at some point have ReFS
           | 
           | They seem to be slowly introducing it to the masses, Dev
           | drives you set up on Windows automatically use ReFS
        
           | milkey_mouse wrote:
           | Hell, there's XFS if you love stability but want CoW.
        
             | josephcsible wrote:
             | XFS doesn't support whole-volume snapshots, which is the
             | main reason I want CoW filesystems. And it also stands out
             | as being basically the only filesystem that you can't
             | arbitrarily shrink without needing to wipe and reformat.
        
               | leogao wrote:
               | you can always have an LVM layer for atomic snapshots
        
               | josephcsible wrote:
               | There are advantages to having the filesystem do the
               | snapshots itself. For example, if you have a really big
               | file that you keep deleting and restoring from a
               | snapshot, you'll only pay the cost of the space once with
               | Btrfs, but will pay it every time over with LVM.
        
               | kzrdude wrote:
               | there was the "old dog new tricks" xfs talk long time
               | ago, but I suppose it was for fun and exploration and not
               | really a sneak peek into snapshots
        
               | MertsA wrote:
               | You can shrink XFS, but only the realtime volume. All you
               | need is xfs_db and a steady hand. I once had to pull this
               | off for a shortened test program for a new server
               | platform at Meta. Works great except some of those
               | filesystems did somehow get this weird corruption around
               | used space tracking that xfs_repair couldn't detect... It
               | was mostly fine.
        
           | leogao wrote:
           | btrfs has eaten my data within the last decade. (not even
           | because of the broken erasure coding, which I was careful to
           | avoid!) not sure I'm willing to give it another chance. I'd
           | much rather use zfs.
        
             | bombcar wrote:
             | I used reiserfs for awhile after I noticed it eating data
             | (tail packing for the power loss) but quickly switched to
             | xfs when it became available.
             | 
             | Speed is sometimes more important than absolute
             | reliability, but it's still an undesirable tradeoff.
        
           | NewJazz wrote:
           | CoW is an efficiency gain. Does it do anything to ensure data
           | integrity, like journaling does? I think it is an
           | unreasonable comparison you are making.
        
             | webstrand wrote:
             | I use CoW a lot just managing files. It's only an
             | efficiency gain if you have enough space to do the data-
             | copying operation. And that's not necessarily true in all
             | cases.
             | 
             | Being able to quickly take a "backup" copy of some multi-gb
             | directory tree before performing some potentially
             | destructive operation on it is such a nice safety net to
             | have.
             | 
             | It's also a handy way to backup file metadata, like mtime,
             | without having to design a file format for mapping saved
             | mtimes back to their host files.
        
             | anonnon wrote:
             | > CoW is an efficiency gain.
             | 
             | You're thinking of the _optimization technique_ of CoW, as
             | in what Linux does when spawning a new thread or forking a
             | process. I 'm talking about it in the context of only ever
             | modifying _copies_ of file system data and metadata blocks,
             | for the purpose of ensuring file system integrity, even in
             | the context of sudden power loss (EDIT: wrong link): https:
             | //www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...
             | 
             | If anything, ordinary file IO is likely to be _slightly
             | slower_ on a CoW file system, due to it always having to
             | copy a block before said block can be modified and updating
             | block pointers.
        
             | throw0101d wrote:
             | > _Does it do anything to ensure data integrity, like
             | journaling does?_
             | 
             | What kind of journaling though? By default ext4 only uses
             | journaling for metadata updates, not data updates (see
             | "ordered" mode in _ext4(5)_ ).
             | 
             | So if you have a (e.g.) 1000MB file, and you update 200MB
             | in the middle of it, you can have a situation where the
             | first 100MB is written out and the system dies with the
             | other 100MB vanishing.
             | 
             | With a CoW, if the second 100MB is not written out and the
             | file sync'd, then on system recovery you're back to the
             | original file being completely intact. With ext4 in the
             | default configuration you have a file that has both
             | new-100MB and stale-100MB in the middle of it.
             | 
             | The updating of the file data and the metadata are two
             | separate steps (by default) in ext4:
             | 
             | * https://www.baeldung.com/linux/ext-journal-modes
             | 
             | * https://michael.kjorling.se/blog/2024/ext4-defaulting-to-
             | dat...
             | 
             | * https://fy.blackhats.net.au/blog/2024-08-13-linux-
             | filesystem...
             | 
             | Whereas with a proper CoW (like ZFS), updates are ACID.
        
           | tbrownaw wrote:
           | > _The extended (not extant) family (including ext4)_
           | 
           | I read that more as "we have filesystems at home, and also
           | get off my lawn".
        
           | zahlman wrote:
           | ... NTFS does copy-on-write?
           | 
           | ... It does hard links? After checking: It does hard links.
           | 
           | ... Why didn't any programs I had noticeably take advantage
           | of that?
        
         | yjftsjthsd-h wrote:
         | I mean, I'd really like some sort of data error detection (and
         | ideally correction). If a disk bitflips one of my files, ext*
         | won't do anything about it.
        
           | timewizard wrote:
           | > some sort of data error detection (and ideally correction).
           | 
           | That's pretty much built into most mass storage devices
           | already.
           | 
           | > If a disk bitflips one of my files
           | 
           | The likelihood and consequence of this occurring is in many
           | situations not worth the overhead of adding additional ECC on
           | top of what the drive does.
           | 
           | > ext* won't do anything about it.
           | 
           | What should it do? Blindly hand you the data without any
           | indication that there's a problem with the underlying block?
           | Without an fsck what mechanism do you suppose would manage
           | these errors as they're discovered?
        
             | throw0101d wrote:
             | >> _> some sort of data error detection (and ideally
             | correction)._
             | 
             | > _That 's pretty much built into most mass storage devices
             | already._
             | 
             | And ZFS has shown that it is not sufficient (at least for
             | some use-cases, perhaps less of a big deal for
             | 'residential' users).
             | 
             | > _The likelihood and consequence of this occurring is in
             | many situations not worth the overhead of adding additional
             | ECC on top of what the drive does._
             | 
             | Not worth it to whom? Not having the option available _at
             | all_ is the problem. I can do a _zfs set checksum=off
             | pool_name /dataset_name_ if I really want that extra couple
             | percentage points of performance.
             | 
             | > _Without an fsck what mechanism do you suppose would
             | manage these errors as they 're discovered?_
             | 
             | Depends on the data involved: if it's part of the file
             | system tree metadata there are often multiple copies even
             | for a single disk on ZFS. So instead of the kernel
             | consuming corrupted data and potentially panicing (or going
             | off into the weeds) it can find a correct copy elsewhere.
             | 
             | If you're in a fancier configuration with some level of
             | RAID, then there could be other copies of the data, or it
             | could be rebuilt through ECC.
             | 
             | With ext*, LVM, and mdadm no such possibility exists
             | because there are no checksums at any of those layers
             | (perhaps if you glom on dm-integrity?).
             | 
             | And with ZFS one can _set copies=2_ on a per-dataset basis
             | (perhaps just for  /home?), and get multiple copies strewn
             | across the disk: won't save you from a drive dying, but
             | could save you from corruption.
        
               | yjftsjthsd-h wrote:
               | > (perhaps if you glom on dm-integrity?).
               | 
               | I looked at that, in hopes of being able to protect my
               | data. Unfortunately, I considered this something of a
               | fatal flaw:
               | 
               | > It uses journaling for guaranteeing write atomicity by
               | default, which effectively halves the write speed.
               | 
               | - https://wiki.archlinux.org/title/Dm-integrity
        
             | ars wrote:
             | > The likelihood .. of this occurring
             | 
             | That's 10^14 bits for a consumer drive. That's just 12TB. A
             | heavy user (lots of videos or games) would see a bit flip a
             | couple times a year.
        
               | magicalhippo wrote:
               | I do monthly scrubs on my NAS, I have 8 14-20TB drives
               | that are quite full.
               | 
               | According to that 10^14 metric I should see read errors
               | just about every month. Except I have just about zero.
               | 
               | Current disks are ~4 years, runs 24/7, and excluding a
               | bad cable incident I've had a single case of a read error
               | (recoverable, thanks ZFS).
               | 
               | I suspect those URE numbers are made by the manufacturers
               | figuring out they can be sure the disk will do 10^14, but
               | they don't actually try to find the real number because
               | 10^14 is good enough.
        
             | yjftsjthsd-h wrote:
             | To your first couple points: I trust hardware less than
             | you.
             | 
             | > What should it do? Blindly hand you the data without any
             | indication that there's a problem with the underlying
             | block?
             | 
             | Well, that's what it does now, and I think that's a
             | problem.
             | 
             | > Without an fsck what mechanism do you suppose would
             | manage these errors as they're discovered?
             | 
             | Linux can fail a read, and IMHO _should_ do so if it cannot
             | return _correct_ data. (I support the ability to override
             | this and tell it to give you the corrupted data, but
             | certainly not by default.) On ZFS, if a read fails its
             | checksum, the OS will first try to get a valid copy (ex.
             | from a mirror or if you 've set copies=2), and then if the
             | error can't be recovered then the file read fails and the
             | system reports/records the failure, at which point the user
             | should probably go do a full scrub (which for our purposes
             | should probably count as fsck) and restore the affected
             | file(s) from backup. (Or possibly go buy a new hard drive,
             | depending on the extent of the problem.) I would consider
             | that ideal.
        
           | eptcyka wrote:
           | Bitflips in my files? Well, there's a high likelihood that
           | the corruption won't be too bad. Bit flips in the filesystem
           | metadata? There's a significant chance all of the data is
           | lost.
        
         | heavyset_go wrote:
         | Transparent compression, checksumming, copy-on-write, snapshots
         | and virtual subvolumes should be considered the minimum default
         | feature set for new OS installations in TYOOL 2025.
         | 
         | You get that with APFS by default on macOS these days and those
         | features come for free in btrfs, some in XFS, etc on Linux.
        
           | riobard wrote:
           | APFS checksums only fs metadata not user data which is a
           | pita. Presumably because APFS is used on single drive systems
           | and there's no redundancy to recover from anyway. Still, not
           | ideal.
        
             | vbezhenar wrote:
             | Apple trusts their hardware to do their own checksums
             | properly. Modern SSD uses checksums and parity codes for
             | blocks. SATA/NVMe include checksums for protocol frames.
             | The only unreliable component is RAM, but FS checksums
             | can't help here, because RAM bit likely will be flipped
             | before checksum is calculated or after checksum is
             | verified.
        
               | riobard wrote:
               | If they do trust their hardware, APFS won't need to
               | checksum fs metadata either, so I guess they don't trust
               | it well enough? Also I have external drives that is not
               | Apple sanctioned to store files and I don't trust them
               | enough either, and there's no choice of user data
               | checksumming at all.
        
               | londons_explore wrote:
               | Most SSD's can't be trusted to maintain proper data
               | ordering in the case of a sudden power off.
               | 
               | That makes checksums and journals of only marginal
               | usefulness.
               | 
               | I wish some review website would have a robot plug and
               | unplug the power cable in a test rig for a few weeks and
               | rate which SSD manufacturers are robust to this stuff.
        
       | criticalfault wrote:
       | I've been following this for a while now.
       | 
       | Kent is in the wrong. Having a lead position in development I
       | would kick Kent of the team.
       | 
       | One thing is to challenge things. What Kent is doing is something
       | completely different. It is obvious he introduced a feature, not
       | only a Bugfix.
       | 
       | If the rules are set in a way that rc1+ gets only Bugfixes, then
       | this is absolutely clear what happens with the feature.
       | Tolerating this once or twice is ok, but Kent is doing this all
       | the time, testing Linus.
       | 
       | Linus is absolutely in the right to kick this out and it's Kent's
       | fault if he does so.
        
         | Pet_Ant wrote:
         | Why take it out of the kernel? Why not just make someone
         | responsible the maintainer so they can say "no, next release"
         | to his shenanigans? It can't be the license.
        
           | nolist_policy wrote:
           | Kent can appoint a suitable maintainer if he wishes. That's
           | his job, not Linus'.
        
           | criticalfault wrote:
           | This is for me unclear as well, but I'm saying I wouldn't
           | hold it against Linus if he did this. And based on Kent's
           | behavior he has full right to do so.
           | 
           | A way to handle this would be with one person (or more) in
           | between Kent and Linus. And maybe a separate tree only for
           | changes and fixes from bcachefs that those people in between
           | would forward to Linus. A staging of sorts.
        
           | tliltocatl wrote:
           | Maintainers aren't getting paid and so cannot be "appointed".
           | Someone must volunteer - and most people qualified and
           | motivated enough are already doing something else.
        
             | timewizard wrote:
             | Presumably there would be an open call where people would
             | nominate themselves for consideration. These are problems
             | that have come up and been solved in human organizations
             | for hundreds of years before the kernel even existed.
        
               | xorcist wrote:
               | There is no call. Anyone can volunteer at any time.
               | 
               | Software take up no space and there is no scarcity.
               | Theoretically there could be any number of maintainers
               | and what gets uptake is the de facto upstream. That's
               | what people refer to when they talk about free software
               | development in terms of meritocracy.
        
         | pmarreck wrote:
         | This can happen with primadonna devs who haven't had to
         | collaborate in a team environment for a long time.
         | 
         | It's a damn shame too because bcachefs has some unique
         | features/potential
        
         | bgwalter wrote:
         | bcachefs is experimental and Kent writes in the LWN comments
         | that nothing would get done if he didn't develop it this way.
         | Filesystems are a massive undertaking and you can have all the
         | rules you want. It doesn't help if nothing gets developed.
         | 
         | It would be interesting how strict the rules are in the Linux
         | kernel for other people. Other projects have nepotistic
         | structures where some developers can do what they want but
         | others cannot.
         | 
         | Anyway, if Linus had developed the kernel with this kind of
         | strictness from the beginning, maybe it wouldn't have taken
         | off. I don't see why experimental features should follow the
         | rules for stable features.
        
           | yjftsjthsd-h wrote:
           | If it's an experimental feature, then why not let changes go
           | into the next version?
        
             | bgwalter wrote:
             | That is a valid objection, but I still think that for some
             | huge and difficult features the month long pauses imposed
             | by release cycles are absolutely detrimental.
             | 
             | Ideally they'd be developed outside the kernel until they
             | are perfect, but Kent addresses this in his LWN comment:
             | There is no funding/time to make that ideal scenario
             | possible.
        
               | jethro_tell wrote:
               | He could release a patch that can be pulled by the people
               | that need it.
               | 
               | If you're using experimental file systems, I'd expect you
               | to be pretty competent in being able to hold your own in
               | a storage emergency, like compiling a kernel if that's
               | the way out.
               | 
               | This is a made up emergency, to break the rules.
        
       | layer8 wrote:
       | For some reason I always read this as "BCA chefs".
        
       | kzrdude wrote:
       | _today_ Kent posted another rc patch with a new filesystem
       | option. But it was merged..
        
       | ajb wrote:
       | Yeah.. the thing is, suppose Kent was 100% right that this needed
       | to be merged in a bugfix phase, even though it's not a bug fix.
       | It's _still_ a massive trust issue that he didn 't flag up that
       | the contents of his PR was well outside the expected.
       | 
       | That means Linus has to check each of his PRs assuming that it
       | might be pushing the boundaries without warning.
       | 
       | No amount of post hoc justification gets you that trust back, not
       | when this has happened multiple times now.
        
         | NewJazz wrote:
         | He mentioned it in his PR summary as a new option. About half
         | of the summary of the original PR was talking about the new
         | option and why it was important.
         | 
         | https://lore.kernel.org/linux-fsdevel/4xkggoquxqprvphz2hwnir...
        
           | ajb wrote:
           | I'm not saying he made a PR just saying "Fixes" like a
           | rookie. What I'm saying is that in there should have been
           | something along the lines of "heads up - I know this doesn't
           | comply with the usual process for the following commits,
           | here's why I think they should be given a waiver under these
           | circumstances" followed by the justifications that appeared
           | _after_ Linus got upset.
           | 
           | The PR description would have been fine - if it had been in
           | the right stage of the process.
        
       | gdgghhhhh wrote:
       | In this context, this is worth a read:
       | https://hachyderm.io/@josefbacik/114755106269205960
        
         | wmf wrote:
         | A lot of open source volunteers can't really be replaced
         | because there is no one willing to volunteer to maintain that
         | thing. This is complicated by the fact that people mostly get
         | credit for creating new projects and no credit for maintenance.
         | Anyone who could take over bcachefs would probably be better
         | off creating their own new filesystem.
        
         | ajb wrote:
         | Ehh. I don't think Kent is an arsehole. The problem with terms
         | like "arsehole" that is that they conflate a bunch of different
         | issues. It doesn't really have much explanatory power. Someone
         | who is difficult to work with can be that way for loads of
         | different reasons: Ego, tunnel vision, stress, neuro divergence
         | (of various kinds), commercial pressures , greed, etc etc.
         | 
         | There is always a point where you have to say "no I can't work
         | with this person any more", but while you are still trying to
         | it's worth trying to figure out why someone is behaving as they
         | do.
        
       | ars wrote:
       | This happened about a year ago as well:
       | https://news.ycombinator.com/item?id=41407768
        
       | jagged-chisel wrote:
       | For the uninitiated:
       | 
       | bCacheFS, not BCA Chefs. I'm not clued into the kernel at this
       | level so I racked my brain a bit.
        
         | zahlman wrote:
         | I had to think about it the first time, too.
        
       | anonfordays wrote:
       | Linux needs a true answer to ZFS that's not btrfs. Sadly the ship
       | has sailed for btrfs, after 15+ years it's still not something
       | trustable.
       | 
       | Apparently bcachefs won't be the successor. Filesystem
       | development for Linux needs a big shakeup.
        
         | bombcar wrote:
         | ZFS is good enough for 90% of people who need that so no real
         | money is available for anything new.
         | 
         | Maybe a university could do it.
        
           | anonfordays wrote:
           | Indeed, and it's inclusion in Ubuntu is fantastic. It's also
           | showing it's age, 20 years now. Tso, where are you when we
           | need you most!?
        
             | bombcar wrote:
             | Or someday a file system will somehow piss off Linus and
             | he'll write one in a weekend or something ;)
        
             | XorNot wrote:
             | I mean, is it? It's a filesystem and it works. How is it
             | "showing its age"?
        
         | em-bee wrote:
         | several people i know are using btrfs without problems for
         | years now. i use it on half a dozen devices. what's your
         | evidence that it is not trustable?
        
           | anonfordays wrote:
           | https://btrfs.readthedocs.io/en/latest/Status.html
           | 
           | The amount of "mostly OK" and still an "unstable" RAID6
           | implementation. Not going to trust a file system with "mostly
           | OK" device replace. Anecdotally, you can search the LKML and
           | here for tons of data loss stories.
        
       | zahlman wrote:
       | Does the filesystem actually need to be part of the kernel
       | project to work? I can see where you'd need that _for the root
       | filesystem_ , but even then, couldn't one migrate an existing
       | installation to a new partition with a different filesystem?
        
         | teekert wrote:
         | We ZFS for that. What we want is something in kernel, ready to
         | go, 100% supported on root ok any Linux system with no license
         | ambiguity. We want to replace ext4. Maybe btrfs can do it. I
         | hear it has outgrown its rocky puberty.
        
       ___________________________________________________________________
       (page generated 2025-07-04 23:00 UTC)