[HN Gopher] Bcachefs may be headed out of the kernel
___________________________________________________________________
Bcachefs may be headed out of the kernel
Author : ksec
Score : 83 points
Date : 2025-07-04 13:32 UTC (9 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| guerrilla wrote:
| The drama with Linux filesystems is just nuts... It never ends.
| mschuster91 wrote:
| The stakes are the highest across the entire kernel. Data
| that's corrupt cannot (easily) be uncorrupted.
| tpolzer wrote:
| Bad drivers could brick (parts of) your hardware permanently.
|
| While you should have a backup of your data anyway.
| quotemstr wrote:
| At least Kent hasn't murdered his wife
| Tostino wrote:
| First thing that came to mind when I saw this drama.
| msgodel wrote:
| It's crazy people spend so much time paying attention to
| Hollywood celebrity drama.
|
| _Opens LKML archive hoping for another Linus rant._
| rendaw wrote:
| I'm sure there's just as much political allstar programmer
| fighting at google/apple/microsoft/whatever too, just this is
| done in public.
| chasil wrote:
| So the assertion is that users with (critical) data loss bugs
| need complete solutions for recovery and damage containment with
| all possible speed, and without this "last mile" effort,
| stability will never be achieved.
|
| The objection is the tiniest bug-fix windows get everything but
| the kitchen sink.
|
| These are both uncomfortable positions to occupy, without doubt.
| koverstreet wrote:
| No, the assertion is that the proper response to a bug often
| (and if it's high impact - always) involves a lot more than
| just the bugfix.
|
| And the whole reason for a filesystem's existence is to store
| and maintain your data, so if that is what the patch if for,
| yes, it should be under consideration as a hotfix.
|
| There's also the broader context: it's a major problem for
| stabilization if we can't properly support the people using it
| so they can keep testing.
|
| More context: the kernel as a whole is based on fixed time
| tables and code review, which it needs because QA (especially
| automated testing) is extremely spotty. bcachefs's QA, both
| automated testing and community testing, is extremely good, and
| we've had bugfix patchsets either held up or turn into
| flamewars because of this mismatch entirely too many times.
| WesolyKubeczek wrote:
| > No, the assertion is that the proper response to a bug
| often (and if it's high impact - always) involves a lot more
| than just the bugfix.
|
| Then what you do is you try to split your work in two. You
| could think of a stopgap measure or a workaround which is
| small, can be reviewed easily, and will reduce the impact of
| the bug while not being a "proper" fix, and prepare the
| "properer" fix when the merge window opens.
|
| I would ask, since the bug probably lived since the last
| stable release, how come it fell through the crack and had
| only been noticed recently? Could it be that not all setups
| are affected? If so, can't they live with it until the next
| merge window?
|
| By making a "feature that fixes the bug for real", you
| greatly expand the area in which new, unknown bugs may land,
| with very little time to give it proper testing. This is
| inevitable, evident by the simple fact that the bug you were
| trying to fix exists. You can be good, but not _that_ good.
| Nobody is that good. If anybody was that good, they wouldn 't
| have the bug in the first place.
|
| If you have commercial clients who use your filesystem and
| you have contractual obligations to fix their bugs and keep
| their data intact, you could (I'd even say "should") maintain
| an out-of-tree version with its own release and bugfix
| schedule. This is IMO the only reasonable way to have it,
| because the kernel is a huge administrative machine with lots
| of people, and by mainlining stuff, you necessarily become
| co-dependent on the release schedule for the whole kernel. I
| think a conflict between kernel's release schedule and
| contractual obligations, if you have any, is only a matter of
| time.
| koverstreet wrote:
| > Then what you do is you try to split your work in two.
| You could think of a stopgap measure or a workaround which
| is small, can be reviewed easily, and will reduce the
| impact of the bug while not being a "proper" fix, and
| prepare the "properer" fix when the merge window opens.
|
| That is indeed what I normally do. For example, 6.14 and
| 6.15 had people discovering btree iterator locking bugs
| (manifesting as assertion pops) while running evacuates on
| large filesystems (it's hard to test a sufficiently deep
| tree depth in virtual machine tests with our large btree
| nodes); some small hotfixes went out in rc kernels, but the
| majority of the work (a whole project to add assertions for
| path->should_be_locked, which should shut these down for
| good) waited until the 6.16 merge window.
|
| That was for a less critical bug - your machine crashing is
| somewhat less severe than losing a filesystem.
|
| In this case, we had a bug pop up in 6.15 where the link
| count in the VFS inode getting screwed up caused an inode
| to be deleted that shouldn't have been - a subvolume root -
| and then an untested repair path took out the entire
| subvolume.
|
| Ouuuuch.
|
| That's why the repair code was rushed; it had already
| gotten one filesystem back, and I'd just gotten another
| report of someone else hitting it - and for every bug
| report there are almost always more people who hit it and
| don't report it.
|
| And considering that a lot of people running bcachefs now
| are getting it from distro kernels and don't know how to
| build kernels - that is why it was important to get this
| out quickly through the normal channels.
|
| In addition, the patch wasn't risky, contrary to what Ted
| was saying. It's a code path that's very well covered by
| automated tests, including KASAN/UBSAN/lockdep variants -
| those would exploded if this patch was incorrect.
|
| When to ship a patch is always a judgement call, and part
| of how you make that call is how well your QA process can
| guarantee the patch is correct. Part of what was going on
| here is a disconnect between those of us who do make heavy
| use of modern QA infrastructure and those who do it the old
| school way, relying heavily on manual review and long
| testing periods for rc kernels.
| magicalhippo wrote:
| While I absolutely think you're taking a stand in the wrong
| fights, like I don't see why you needed to push it so far on
| this hill in particular, I am sympathetic to your argument
| that experimental kernel modules like filesystems might need
| a different release approach at times.
|
| At work we have our main application which also contains a
| lot of customer integrations. Our policy has been new
| features in trunk only, except if it's entirely contained
| inside a customer-specific integration module.
|
| We do try to avoid it, but this does allow us to be flexible
| with regards to customer needs, while keeping the base
| application stable.
|
| This new recovery feature was, as far as I could see,
| entirely contained within the bcachefs kernel code. Given the
| experimental status, as long as it was clearly communicated
| to users, I don't see a huge problem allowing such self-
| contained features during the RC phase.
|
| Obviously a requirement must be that it doesn't break the
| build.
| jethro_tell wrote:
| Who's using an experimental filesystem and risking critical
| data loss? Rule one of experimental file systems is have a copy
| on a not experimental file system.
| shmerl wrote:
| May be bcachefs should have been governed by a group of people,
| not a single person.
| mananaysiempre wrote:
| Committees are good-to-acceptable for keeping things going, but
| bad for initial design or anything requiring a coherent vision
| and taste. There are some examples of groups that straddled the
| boundary between a committee and a creative collaboration and
| produced good designs (Algol 60; RnRS for n <= 5; IIRC the
| design of ZFS was produced by a three-person team), but they
| are more of an exception, and the secret of tying together such
| groups remotely doesn't seem to have been cracked. Even in the
| keeping things going department, a committee's inbuilt and
| implicit self-preservation mechanisms can lead it to keep
| fiddling with things far longer than would be advisable.
| shmerl wrote:
| In this case it's more about keeping things in check and not
| letting one person with an attitude to ignore kernel
| development rules derail the whole project.
|
| I'm not saying those concerns are wrong, but when it's
| causing a fallout like being kicked out from the kernel, the
| downsides clearly are more severe than any potential
| benefits.
| koverstreet wrote:
| Actually, I think remote collaboration can work with the
| right medium and tools. For bcachefs, that's been IRC; we
| have an extremely active channel where we do a lot of
| collaborative debugging, design discussion, helping new
| users, etc.
|
| I know a lot of people heavily use slack/discord these days,
| but personally I find the web interfaces way too busy. IRC
| all the way, for me.
|
| But the problem of communicating effectively enough to
| produce a coherent design is very real - this goes back to
| Fred Brooks (Mythical Man Month). I think bcachefs turned out
| very well with the way the process has gone to date, and now
| that it's gotten bigger, with more distinct subsystems, I am
| very eagerly looking forward to the date when I can hand off
| ownership of some of those subsystems. Lately we've had some
| sharp developers getting involved - for the past several
| years it's been mainly users testing it (and some of them
| have gotten very good at debugging at this point).
|
| So it's happening.
| charcircuit wrote:
| If Linux would add a stable kernel module API this wouldn't be a
| huge a problem and it would be easy for bcachefs to ship as a
| kernel module with his own independent release schedule.
| josephcsible wrote:
| The slight benefit for out-of-tree module authors wouldn't be
| worth the negative effects on the rest of the kernel to
| everyone else.
| charcircuit wrote:
| "slight benefit"? Having a working system after upgrading
| your kernel is not just a slight benefit. It's table stakes.
| Especially for something critical like a filesystem it should
| never break.
|
| >negative effects on the rest of the kernel
|
| Needing to design and support an API is not purely negative
| for kernel developers. It also gives a change to have a
| proper interface for drivers to use and follow. Take a look
| at the Rust for Linux which keeps running into undocumented
| APIs that make little sense and are just whatever <insert
| most popular driver> does.
| josephcsible wrote:
| > Having a working system after upgrading your kernel is
| not just a slight benefit. It's table stakes.
|
| We already have that, with the "don't break userspace"
| policy combined with all of the modules being in-tree.
|
| > Needing to design and support an API is not purely
| negative for kernel developers.
|
| Sure, it's not _purely_ negative, but it 's overall a big
| _net_ negative.
|
| > Take a look at the Rust for Linux which keeps running
| into undocumented APIs that make little sense and are just
| whatever <insert most popular driver> does.
|
| That's an argument _against_ a stable module API! Those
| things are getting fixed as they get found, but if we had a
| stable module API, we 'd be stuck with them forever.
|
| I recommend reading https://docs.kernel.org/process/stable-
| api-nonsense.html
| charcircuit wrote:
| >We already have that, with the "don't break userspace"
|
| Bcachefs is not user space.
|
| >with all of the modules being in-tree.
|
| That is not true. There are out of tree modules such as
| ZFS.
|
| >That's an argument against a stable module API!
|
| My point was that there was 0 thought put into creating a
| good API. Additionally API could be evolved over time and
| have a support period if you care about being able to
| evolve it and deprecate the old one. And likely even with
| a better interface there is probably a way to make the
| old API still function.
| josephcsible wrote:
| > Bcachefs is not user space.
|
| bcachefs is still in-tree.
|
| > That is not true. There are out of tree modules such as
| ZFS.
|
| ZFS could be in-tree in no time at all if Oracle would
| fix its license. And until they do that, it's not safe to
| use ZFS-on-Linux anyway, since Oracle could sue you for
| it.
|
| > My point was that there was 0 thought put into creating
| a good API.
|
| There is thought put into it: it's exactly what we need
| right now, because if what we need ever changes, we'll
| change the API too, thus avoiding YAGNI and similar
| problems.
|
| > Additionally API could be evolved over time and have a
| support period if you care about being able to evolve it.
|
| If a temporary "support period" is what you want, then
| just use the LTS kernels. That's already exactly what
| they give you.
|
| > And likely even with a better interface there is
| probably a way to make the old API still function.
|
| That's the big net negative I was mentioning and that
| https://docs.kernel.org/process/stable-api-nonsense.html
| talks about too. Sometimes there isn't a feasible way to
| support part of an old API anymore, and it's not worth
| holding the whole kernel back just for the out-of-tree
| modules.
| yjftsjthsd-h wrote:
| > ZFS could be in-tree in no time at all if Oracle would
| fix its license. And until they do that, it's not safe to
| use ZFS-on-Linux anyway, since Oracle could sue you for
| it.
|
| IANAL, but I don't believe either of these things are
| true.
|
| OpenZFS contains enough code not authored by Sun/Oracle
| that relicensing it now is effectively impossible.
|
| OTOH, it _is_ under the CDDL, which is a perfectly good
| open source license; AFAICT the problem, if one exists at
| all[0], only manifests when _distributing_ the
| combination of CDDL (OpenZFS) and GPL (Linux) software.
| If you download CDDL software and compile it into GPL
| software yourself (say, with DKMS) then it should be fine
| because you aren 't distributing it.
|
| [0] This is a case where I'm going to really emphasize
| that I'm really not a lawyer and merely point out that
| ex. Canonical's lawyers _do_ seem to think CDDL+GPL is
| okay.
| timschmidt wrote:
| > it should be fine because you aren't distributing it.
|
| Which excludes a vast amount of activity one might want
| to use Linux for which is otherwise allowed. Like selling
| a device with a Linux installation, distributing VM or
| system restore images, etc.
| yjftsjthsd-h wrote:
| Sure, I happily grant that the licensing situation is
| really annoying and restricts the set of safe actions. I
| only object to claims that all use of ZFS is legally
| risky.
| charcircuit wrote:
| >it's not safe to use ZFS-on-Linux anyway, since Oracle
| could sue you for it.
|
| It's not against the license to use them together.
|
| >If a temporary "support period" is what you want, then
| just use the LTS kernels. That's already exactly what
| they give you.
|
| Only the Android one does. The regular LTS one has no
| such guarantee.
| msgodel wrote:
| Does your system have some critical out of tree driver?
| That should have been recompiled with the new kernel, that
| sounds like a failure of whoever maintains the
| driver/kernel/distro (which may be you if you're building
| it yourself.)
| homebrewer wrote:
| It would also have a lot less FOSS drivers, neither we nor
| FreeBSD (which is often invoked in these complaints) would have
| amdgpu for example.
| charcircuit wrote:
| I would actually posture that making it easier to make
| drivers would actually have the opposite effect and result in
| more FOSS drivers.
|
| >FreeBSD (which is often invoked in these complaints) would
| have amdgpu for example.
|
| In such a hypothetical FreeBSD could reimplement the stable
| API of Linux.
| throw0101d wrote:
| > _In such a hypothetical FreeBSD could reimplement the
| stable API of Linux._
|
| Like it does with the userland API of Linux, which is
| stable:
|
| * https://wiki.freebsd.org/Linuxulator
| smcameron wrote:
| No, every gpu vendor out there would prefer proprietary
| drivers and with a stable ABI, they could do it, and would
| do, there is no question about it.
|
| I worked for HP on storage drivers for a decade or so, and
| had their been a stable ABI, HP would have shipped
| proprietary storage drivers for everything. Even without a
| stable ABI, they shipped proprietary drivers at
| considerable effort, compiling for myriad different distro
| kernels. It was a nightmare, and good thing too, or there
| wouldn't be any open source drivers.
| charcircuit wrote:
| I never said they wouldn't. Having more and better
| drivers is a good thing for Linux users. It's okay for
| proprietary drivers to exist. The kernel isn't meant to
| be a vehicle to push the free software agenda.
| msgodel wrote:
| It's plenty easy to make drivers now, it's just hard to
| distribute them without sharing the source.
|
| There is absolutely no good reason not to share driver
| source though so that's a terrible use case to optimize
| for.
| Nextgrid wrote:
| What's so bad about it? Windows to this day doesn't have FOSS
| drivers as standard and despite that is pretty successful. In
| practice, as long as a driver works it's fine for the vast
| majority of users, and you can always disassemble and binary-
| patch if really needed.
|
| (it's not obvious that having to occasionally
| disassemble/patch closed-source drivers is worse than the
| collective effort wasted trying to get every single thing in
| the kernel and keep it up to date).
| heavyset_go wrote:
| The unstable interface is Linux's moat, and IMO, is the reason
| we're able to enjoy such a large ecosystem of hardware via open
| source operating systems.
| zahlman wrote:
| I'm afraid I don't follow your reasoning.
| dralley wrote:
| I donate to Kent's patreon and I'm very enthusiastic about
| bcachefs.
|
| However, Kent, if you read this: please just settle down and
| follow the rules. Quit deliberately antagonizing Linus. The
| constant drama is incredibly offputting. Don't jeopardize the
| entire future of bcachefs over the silliest and most temporary
| concerns.
|
| If you absolutely must argue about some rule or other, then make
| that argument without having your opening move be to blatantly
| violate them and then complain when people call you out.
|
| You were the one who wanted into the kernel despite many
| suggestions that it was too early. That comes with tradeoffs. You
| need to figure out how to live with that, at least for a year or
| two. Stop making your self-imposed problems everyone else's
| problems.
| NewJazz wrote:
| Seriously how hard is it to say "I'm unhappy users won't have
| access to this data recovery option but will postpone its
| inclusion until the next merge window". Yeah, maybe it sucks
| for users who want the new option or what have you, but like
| you said it is a temporary concern.
| vbezhenar wrote:
| Why does it suck for users? Those brave enough to use new
| filesystem, surely can use custom kernel for the time being,
| while merge effort is underway and vanilla kernel might not
| be the most stable option.
| thrtythreeforty wrote:
| I _did_ subscribe to his Patreon but I stopped because of this
| - vote with your wallet and all that. I would happily
| resubscribe if he can demonstrate he can work within the Linux
| development process. This isn 't the first time this flavor of
| personality clash has come up.
|
| Kent is absolutely technically capable of, and has the vision
| to, finally displace ext4, xfs, and zfs with a new filesystem
| that Does Not Lose Data. To jeopardize that by refusing to work
| within the well-established structure is madness.
| baggy_trough wrote:
| No matter how good the code is, Overstreet's behavior and the
| apparent bus factor of 1 leave me reluctant to investigate this
| technology.
| dsp_person wrote:
| Curious about this process. Can anyone submit patches to
| bcachefs and Kent is just the only one doing it? Is there a
| community with multiple contributors hacking on the features,
| or just Kent? If not, what could he do to grow this? And how
| does a single person receiving patreon donations affect the
| ability of a project like this to get passed bus factor of 1?
| nolist_policy wrote:
| Generally you need a maintainer for your subsystem who sends
| pull requests to Linus.
| koverstreet wrote:
| I take patches from quite a few people. If the patch looks
| good, I'll generally apply it.
|
| And I encourage anyone who wants to contribute to join the
| IRC channel. It's not a one man show, I work with a lot of
| people there.
| devwastaken wrote:
| Good. There is no place for unstable developers in a stable
| kernel.
| msgodel wrote:
| The older I get the more I feel like anything other than the
| ExtantFS family is just silly.
|
| The filesystem should do files, if you want something more
| complex do it in userspace. We even have FUSE if you want to use
| the Filesystem API with your crazy network database thing.
| anonnon wrote:
| > The older I get the more I feel like anything other than the
| ExtantFS family is just silly.
|
| The extended (not extant) family (including ext4) don't support
| copy-on-write. Using them as your primary FS after 2020 (or
| even 2010) is like using a non-journaling file system after
| 2010 (or even 2001)--it's a non-negotiable feature at this
| point. Btrfs has been stable for a decade, and if you don't
| like or trust it, there's always ZFS, which has been stable 20
| years now. Apple now has AppFS, with CoW, on _all_ their
| devices, while MSFT still treats ReFS as unstable, and Windows
| servers still rely heavily on NTFS.
| msgodel wrote:
| Again I don't really want the kernel managing a database for
| me like that, the few applications that need that can do it
| themselves just fine. (IME mostly just RDBMSs and Qemu.)
| robotnikman wrote:
| >Windows will at some point have ReFS
|
| They seem to be slowly introducing it to the masses, Dev
| drives you set up on Windows automatically use ReFS
| milkey_mouse wrote:
| Hell, there's XFS if you love stability but want CoW.
| josephcsible wrote:
| XFS doesn't support whole-volume snapshots, which is the
| main reason I want CoW filesystems. And it also stands out
| as being basically the only filesystem that you can't
| arbitrarily shrink without needing to wipe and reformat.
| leogao wrote:
| you can always have an LVM layer for atomic snapshots
| josephcsible wrote:
| There are advantages to having the filesystem do the
| snapshots itself. For example, if you have a really big
| file that you keep deleting and restoring from a
| snapshot, you'll only pay the cost of the space once with
| Btrfs, but will pay it every time over with LVM.
| kzrdude wrote:
| there was the "old dog new tricks" xfs talk long time
| ago, but I suppose it was for fun and exploration and not
| really a sneak peek into snapshots
| MertsA wrote:
| You can shrink XFS, but only the realtime volume. All you
| need is xfs_db and a steady hand. I once had to pull this
| off for a shortened test program for a new server
| platform at Meta. Works great except some of those
| filesystems did somehow get this weird corruption around
| used space tracking that xfs_repair couldn't detect... It
| was mostly fine.
| leogao wrote:
| btrfs has eaten my data within the last decade. (not even
| because of the broken erasure coding, which I was careful to
| avoid!) not sure I'm willing to give it another chance. I'd
| much rather use zfs.
| bombcar wrote:
| I used reiserfs for awhile after I noticed it eating data
| (tail packing for the power loss) but quickly switched to
| xfs when it became available.
|
| Speed is sometimes more important than absolute
| reliability, but it's still an undesirable tradeoff.
| NewJazz wrote:
| CoW is an efficiency gain. Does it do anything to ensure data
| integrity, like journaling does? I think it is an
| unreasonable comparison you are making.
| webstrand wrote:
| I use CoW a lot just managing files. It's only an
| efficiency gain if you have enough space to do the data-
| copying operation. And that's not necessarily true in all
| cases.
|
| Being able to quickly take a "backup" copy of some multi-gb
| directory tree before performing some potentially
| destructive operation on it is such a nice safety net to
| have.
|
| It's also a handy way to backup file metadata, like mtime,
| without having to design a file format for mapping saved
| mtimes back to their host files.
| anonnon wrote:
| > CoW is an efficiency gain.
|
| You're thinking of the _optimization technique_ of CoW, as
| in what Linux does when spawning a new thread or forking a
| process. I 'm talking about it in the context of only ever
| modifying _copies_ of file system data and metadata blocks,
| for the purpose of ensuring file system integrity, even in
| the context of sudden power loss (EDIT: wrong link): https:
| //www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...
|
| If anything, ordinary file IO is likely to be _slightly
| slower_ on a CoW file system, due to it always having to
| copy a block before said block can be modified and updating
| block pointers.
| throw0101d wrote:
| > _Does it do anything to ensure data integrity, like
| journaling does?_
|
| What kind of journaling though? By default ext4 only uses
| journaling for metadata updates, not data updates (see
| "ordered" mode in _ext4(5)_ ).
|
| So if you have a (e.g.) 1000MB file, and you update 200MB
| in the middle of it, you can have a situation where the
| first 100MB is written out and the system dies with the
| other 100MB vanishing.
|
| With a CoW, if the second 100MB is not written out and the
| file sync'd, then on system recovery you're back to the
| original file being completely intact. With ext4 in the
| default configuration you have a file that has both
| new-100MB and stale-100MB in the middle of it.
|
| The updating of the file data and the metadata are two
| separate steps (by default) in ext4:
|
| * https://www.baeldung.com/linux/ext-journal-modes
|
| * https://michael.kjorling.se/blog/2024/ext4-defaulting-to-
| dat...
|
| * https://fy.blackhats.net.au/blog/2024-08-13-linux-
| filesystem...
|
| Whereas with a proper CoW (like ZFS), updates are ACID.
| tbrownaw wrote:
| > _The extended (not extant) family (including ext4)_
|
| I read that more as "we have filesystems at home, and also
| get off my lawn".
| zahlman wrote:
| ... NTFS does copy-on-write?
|
| ... It does hard links? After checking: It does hard links.
|
| ... Why didn't any programs I had noticeably take advantage
| of that?
| yjftsjthsd-h wrote:
| I mean, I'd really like some sort of data error detection (and
| ideally correction). If a disk bitflips one of my files, ext*
| won't do anything about it.
| timewizard wrote:
| > some sort of data error detection (and ideally correction).
|
| That's pretty much built into most mass storage devices
| already.
|
| > If a disk bitflips one of my files
|
| The likelihood and consequence of this occurring is in many
| situations not worth the overhead of adding additional ECC on
| top of what the drive does.
|
| > ext* won't do anything about it.
|
| What should it do? Blindly hand you the data without any
| indication that there's a problem with the underlying block?
| Without an fsck what mechanism do you suppose would manage
| these errors as they're discovered?
| throw0101d wrote:
| >> _> some sort of data error detection (and ideally
| correction)._
|
| > _That 's pretty much built into most mass storage devices
| already._
|
| And ZFS has shown that it is not sufficient (at least for
| some use-cases, perhaps less of a big deal for
| 'residential' users).
|
| > _The likelihood and consequence of this occurring is in
| many situations not worth the overhead of adding additional
| ECC on top of what the drive does._
|
| Not worth it to whom? Not having the option available _at
| all_ is the problem. I can do a _zfs set checksum=off
| pool_name /dataset_name_ if I really want that extra couple
| percentage points of performance.
|
| > _Without an fsck what mechanism do you suppose would
| manage these errors as they 're discovered?_
|
| Depends on the data involved: if it's part of the file
| system tree metadata there are often multiple copies even
| for a single disk on ZFS. So instead of the kernel
| consuming corrupted data and potentially panicing (or going
| off into the weeds) it can find a correct copy elsewhere.
|
| If you're in a fancier configuration with some level of
| RAID, then there could be other copies of the data, or it
| could be rebuilt through ECC.
|
| With ext*, LVM, and mdadm no such possibility exists
| because there are no checksums at any of those layers
| (perhaps if you glom on dm-integrity?).
|
| And with ZFS one can _set copies=2_ on a per-dataset basis
| (perhaps just for /home?), and get multiple copies strewn
| across the disk: won't save you from a drive dying, but
| could save you from corruption.
| yjftsjthsd-h wrote:
| > (perhaps if you glom on dm-integrity?).
|
| I looked at that, in hopes of being able to protect my
| data. Unfortunately, I considered this something of a
| fatal flaw:
|
| > It uses journaling for guaranteeing write atomicity by
| default, which effectively halves the write speed.
|
| - https://wiki.archlinux.org/title/Dm-integrity
| ars wrote:
| > The likelihood .. of this occurring
|
| That's 10^14 bits for a consumer drive. That's just 12TB. A
| heavy user (lots of videos or games) would see a bit flip a
| couple times a year.
| magicalhippo wrote:
| I do monthly scrubs on my NAS, I have 8 14-20TB drives
| that are quite full.
|
| According to that 10^14 metric I should see read errors
| just about every month. Except I have just about zero.
|
| Current disks are ~4 years, runs 24/7, and excluding a
| bad cable incident I've had a single case of a read error
| (recoverable, thanks ZFS).
|
| I suspect those URE numbers are made by the manufacturers
| figuring out they can be sure the disk will do 10^14, but
| they don't actually try to find the real number because
| 10^14 is good enough.
| yjftsjthsd-h wrote:
| To your first couple points: I trust hardware less than
| you.
|
| > What should it do? Blindly hand you the data without any
| indication that there's a problem with the underlying
| block?
|
| Well, that's what it does now, and I think that's a
| problem.
|
| > Without an fsck what mechanism do you suppose would
| manage these errors as they're discovered?
|
| Linux can fail a read, and IMHO _should_ do so if it cannot
| return _correct_ data. (I support the ability to override
| this and tell it to give you the corrupted data, but
| certainly not by default.) On ZFS, if a read fails its
| checksum, the OS will first try to get a valid copy (ex.
| from a mirror or if you 've set copies=2), and then if the
| error can't be recovered then the file read fails and the
| system reports/records the failure, at which point the user
| should probably go do a full scrub (which for our purposes
| should probably count as fsck) and restore the affected
| file(s) from backup. (Or possibly go buy a new hard drive,
| depending on the extent of the problem.) I would consider
| that ideal.
| eptcyka wrote:
| Bitflips in my files? Well, there's a high likelihood that
| the corruption won't be too bad. Bit flips in the filesystem
| metadata? There's a significant chance all of the data is
| lost.
| heavyset_go wrote:
| Transparent compression, checksumming, copy-on-write, snapshots
| and virtual subvolumes should be considered the minimum default
| feature set for new OS installations in TYOOL 2025.
|
| You get that with APFS by default on macOS these days and those
| features come for free in btrfs, some in XFS, etc on Linux.
| riobard wrote:
| APFS checksums only fs metadata not user data which is a
| pita. Presumably because APFS is used on single drive systems
| and there's no redundancy to recover from anyway. Still, not
| ideal.
| vbezhenar wrote:
| Apple trusts their hardware to do their own checksums
| properly. Modern SSD uses checksums and parity codes for
| blocks. SATA/NVMe include checksums for protocol frames.
| The only unreliable component is RAM, but FS checksums
| can't help here, because RAM bit likely will be flipped
| before checksum is calculated or after checksum is
| verified.
| riobard wrote:
| If they do trust their hardware, APFS won't need to
| checksum fs metadata either, so I guess they don't trust
| it well enough? Also I have external drives that is not
| Apple sanctioned to store files and I don't trust them
| enough either, and there's no choice of user data
| checksumming at all.
| londons_explore wrote:
| Most SSD's can't be trusted to maintain proper data
| ordering in the case of a sudden power off.
|
| That makes checksums and journals of only marginal
| usefulness.
|
| I wish some review website would have a robot plug and
| unplug the power cable in a test rig for a few weeks and
| rate which SSD manufacturers are robust to this stuff.
| criticalfault wrote:
| I've been following this for a while now.
|
| Kent is in the wrong. Having a lead position in development I
| would kick Kent of the team.
|
| One thing is to challenge things. What Kent is doing is something
| completely different. It is obvious he introduced a feature, not
| only a Bugfix.
|
| If the rules are set in a way that rc1+ gets only Bugfixes, then
| this is absolutely clear what happens with the feature.
| Tolerating this once or twice is ok, but Kent is doing this all
| the time, testing Linus.
|
| Linus is absolutely in the right to kick this out and it's Kent's
| fault if he does so.
| Pet_Ant wrote:
| Why take it out of the kernel? Why not just make someone
| responsible the maintainer so they can say "no, next release"
| to his shenanigans? It can't be the license.
| nolist_policy wrote:
| Kent can appoint a suitable maintainer if he wishes. That's
| his job, not Linus'.
| criticalfault wrote:
| This is for me unclear as well, but I'm saying I wouldn't
| hold it against Linus if he did this. And based on Kent's
| behavior he has full right to do so.
|
| A way to handle this would be with one person (or more) in
| between Kent and Linus. And maybe a separate tree only for
| changes and fixes from bcachefs that those people in between
| would forward to Linus. A staging of sorts.
| tliltocatl wrote:
| Maintainers aren't getting paid and so cannot be "appointed".
| Someone must volunteer - and most people qualified and
| motivated enough are already doing something else.
| timewizard wrote:
| Presumably there would be an open call where people would
| nominate themselves for consideration. These are problems
| that have come up and been solved in human organizations
| for hundreds of years before the kernel even existed.
| xorcist wrote:
| There is no call. Anyone can volunteer at any time.
|
| Software take up no space and there is no scarcity.
| Theoretically there could be any number of maintainers
| and what gets uptake is the de facto upstream. That's
| what people refer to when they talk about free software
| development in terms of meritocracy.
| pmarreck wrote:
| This can happen with primadonna devs who haven't had to
| collaborate in a team environment for a long time.
|
| It's a damn shame too because bcachefs has some unique
| features/potential
| bgwalter wrote:
| bcachefs is experimental and Kent writes in the LWN comments
| that nothing would get done if he didn't develop it this way.
| Filesystems are a massive undertaking and you can have all the
| rules you want. It doesn't help if nothing gets developed.
|
| It would be interesting how strict the rules are in the Linux
| kernel for other people. Other projects have nepotistic
| structures where some developers can do what they want but
| others cannot.
|
| Anyway, if Linus had developed the kernel with this kind of
| strictness from the beginning, maybe it wouldn't have taken
| off. I don't see why experimental features should follow the
| rules for stable features.
| yjftsjthsd-h wrote:
| If it's an experimental feature, then why not let changes go
| into the next version?
| bgwalter wrote:
| That is a valid objection, but I still think that for some
| huge and difficult features the month long pauses imposed
| by release cycles are absolutely detrimental.
|
| Ideally they'd be developed outside the kernel until they
| are perfect, but Kent addresses this in his LWN comment:
| There is no funding/time to make that ideal scenario
| possible.
| jethro_tell wrote:
| He could release a patch that can be pulled by the people
| that need it.
|
| If you're using experimental file systems, I'd expect you
| to be pretty competent in being able to hold your own in
| a storage emergency, like compiling a kernel if that's
| the way out.
|
| This is a made up emergency, to break the rules.
| layer8 wrote:
| For some reason I always read this as "BCA chefs".
| kzrdude wrote:
| _today_ Kent posted another rc patch with a new filesystem
| option. But it was merged..
| ajb wrote:
| Yeah.. the thing is, suppose Kent was 100% right that this needed
| to be merged in a bugfix phase, even though it's not a bug fix.
| It's _still_ a massive trust issue that he didn 't flag up that
| the contents of his PR was well outside the expected.
|
| That means Linus has to check each of his PRs assuming that it
| might be pushing the boundaries without warning.
|
| No amount of post hoc justification gets you that trust back, not
| when this has happened multiple times now.
| NewJazz wrote:
| He mentioned it in his PR summary as a new option. About half
| of the summary of the original PR was talking about the new
| option and why it was important.
|
| https://lore.kernel.org/linux-fsdevel/4xkggoquxqprvphz2hwnir...
| ajb wrote:
| I'm not saying he made a PR just saying "Fixes" like a
| rookie. What I'm saying is that in there should have been
| something along the lines of "heads up - I know this doesn't
| comply with the usual process for the following commits,
| here's why I think they should be given a waiver under these
| circumstances" followed by the justifications that appeared
| _after_ Linus got upset.
|
| The PR description would have been fine - if it had been in
| the right stage of the process.
| gdgghhhhh wrote:
| In this context, this is worth a read:
| https://hachyderm.io/@josefbacik/114755106269205960
| wmf wrote:
| A lot of open source volunteers can't really be replaced
| because there is no one willing to volunteer to maintain that
| thing. This is complicated by the fact that people mostly get
| credit for creating new projects and no credit for maintenance.
| Anyone who could take over bcachefs would probably be better
| off creating their own new filesystem.
| ajb wrote:
| Ehh. I don't think Kent is an arsehole. The problem with terms
| like "arsehole" that is that they conflate a bunch of different
| issues. It doesn't really have much explanatory power. Someone
| who is difficult to work with can be that way for loads of
| different reasons: Ego, tunnel vision, stress, neuro divergence
| (of various kinds), commercial pressures , greed, etc etc.
|
| There is always a point where you have to say "no I can't work
| with this person any more", but while you are still trying to
| it's worth trying to figure out why someone is behaving as they
| do.
| ars wrote:
| This happened about a year ago as well:
| https://news.ycombinator.com/item?id=41407768
| jagged-chisel wrote:
| For the uninitiated:
|
| bCacheFS, not BCA Chefs. I'm not clued into the kernel at this
| level so I racked my brain a bit.
| zahlman wrote:
| I had to think about it the first time, too.
| anonfordays wrote:
| Linux needs a true answer to ZFS that's not btrfs. Sadly the ship
| has sailed for btrfs, after 15+ years it's still not something
| trustable.
|
| Apparently bcachefs won't be the successor. Filesystem
| development for Linux needs a big shakeup.
| bombcar wrote:
| ZFS is good enough for 90% of people who need that so no real
| money is available for anything new.
|
| Maybe a university could do it.
| anonfordays wrote:
| Indeed, and it's inclusion in Ubuntu is fantastic. It's also
| showing it's age, 20 years now. Tso, where are you when we
| need you most!?
| bombcar wrote:
| Or someday a file system will somehow piss off Linus and
| he'll write one in a weekend or something ;)
| XorNot wrote:
| I mean, is it? It's a filesystem and it works. How is it
| "showing its age"?
| em-bee wrote:
| several people i know are using btrfs without problems for
| years now. i use it on half a dozen devices. what's your
| evidence that it is not trustable?
| anonfordays wrote:
| https://btrfs.readthedocs.io/en/latest/Status.html
|
| The amount of "mostly OK" and still an "unstable" RAID6
| implementation. Not going to trust a file system with "mostly
| OK" device replace. Anecdotally, you can search the LKML and
| here for tons of data loss stories.
| zahlman wrote:
| Does the filesystem actually need to be part of the kernel
| project to work? I can see where you'd need that _for the root
| filesystem_ , but even then, couldn't one migrate an existing
| installation to a new partition with a different filesystem?
| teekert wrote:
| We ZFS for that. What we want is something in kernel, ready to
| go, 100% supported on root ok any Linux system with no license
| ambiguity. We want to replace ext4. Maybe btrfs can do it. I
| hear it has outgrown its rocky puberty.
___________________________________________________________________
(page generated 2025-07-04 23:00 UTC)