[HN Gopher] Bug hunting in Btrfs
___________________________________________________________________
Bug hunting in Btrfs
Author : todsacerdoti
Score : 137 points
Date : 2024-03-20 12:51 UTC (10 hours ago)
(HTM) web link (tavianator.com)
(TXT) w3m dump (tavianator.com)
| throw0101d wrote:
| Given that (Open)ZFS[1] is quite mature, and Bcachefs[2] seems to
| be gaining popularity, how much of a future does Btrfs have?
|
| [1] https://en.wikipedia.org/wiki/ZFS
|
| [2] https://en.wikipedia.org/wiki/Bcachefs
| opengears wrote:
| I big future (unless ZFS licensing incompatibility is solved)
| maxloh wrote:
| Maybe replace Oracle-owned code with a clean room
| implementation?
| DaSHacka wrote:
| Would you not basically be starting over at that point,
| though?
| arp242 wrote:
| They did that. It's called "btrfs".
|
| A stable clean-room ZFS with on-disk compatibility would be
| a huge task. How long did stable NTFS write capability
| take? And NTFS is a much simpler filesystem. It would also
| be a huge waste of time given that btrfs and bcachefs
| exist, and that ZFS is fine to use license-wise - it's just
| distribution that's a tad awkward (but not overly so).
| rascul wrote:
| Interesting to note here that btrfs came from Oracle.
| tadfisher wrote:
| "Chris Mason is the founding developer of btrfs, which he
| began working on in 2007 while working at Oracle. This
| leads many people to believe that btrfs is an Oracle
| project--it is not. The project belonged to Mason, not to
| his employer, and it remains a community project
| unencumbered by corporate ownership to this day."
|
| https://arstechnica.com/gadgets/2021/09/examining-btrfs-
| linu...
| yjftsjthsd-h wrote:
| I don't think that would work; all of the changes since the
| fork are also CDDL and they aren't owned by any one
| entity/person. (IANAL)
| maxloh wrote:
| GPL only forbids ZFS to be distributed alongside Linux, it
| doesn't prevent users from installing it manually. (IANAL)
| mustache_kimono wrote:
| > GPL only forbids ZFS to be distributed alongside Linux
|
| But does it even do that? You might be surprised when/if
| you read a little more widely. Position of OpenZFS
| project[0] is which I find persuasive (my emphasis added):
| In the case of the Linux Kernel, this prevents us from
| distributing OpenZFS as part of the Linux Kernel binary.
| *However, there is nothing in either license that prevents
| distributing it in the form of a binary module* or in the
| form of source code.
|
| [0]: https://openzfs.github.io/openzfs-docs/License.html
|
| You might see also:
|
| [1]: https://www.networkworld.com/article/836039/smb-
| encouraging-...
|
| [2]: https://law.resource.org/pub/us/case/reporter/F2/977/9
| 77.F2d...
| lmz wrote:
| It doesn't matter what the project thinks as long as
| there is code owned by Oracle in the FS.
| mustache_kimono wrote:
| > It doesn't matter what the project thinks as long as
| there is code owned by Oracle in the FS.
|
| Might we agree then that the only thing that really
| matters is the law? And we should ignore other opinions
| _cough_ like from the FSF/the SFC _cough_ which don 't
| make reference to the law? Or which ignore long held
| copyright law principles, like fair use?
|
| Please take a look at the case law. The so far
| theoretical claim of OpenZFS/Linux incompatibility is
| especially _weak_ re: a binary kernel module.
| wtallis wrote:
| What matters in practice isn't the law, but how much
| trouble Oracle could cause should the lawnmower veer in
| that direction. Even a complete rewrite of ZFS would have
| some risk associated with it given Oracle's history.
| mustache_kimono wrote:
| > What matters in practice isn't the law, but how much
| trouble Oracle could cause should the lawnmower veer in
| that direction.
|
| Veer in what direction? The current state of affairs re:
| Canonical and Oracle is a ZFS binary kernel module
| shipped with the Linux kernel. Canonical has _literally_
| done the thing that which you are speculating is
| impermissible. And, for 8 years, Oracle has done nothing.
| Oracle 's attorneys have even publicly disagreed _with
| the SFC_ that a ZFS binary kernel module violates the GPL
| or the CDDL.[0]
|
| Given this state of affairs, the level of legal certainty
| re: this question is far greater than the legal certainty
| we have re: pretty much any other open IP question in
| tech.
|
| What matters is practice, is that you stop your/the
| SFC's/the FSF's torrent of FUD.
|
| > Even a complete rewrite of ZFS would have some risk
| associated with it given Oracle's history.
|
| I'd ask "How?", but it'd be another torrent of "What
| if..."s.
|
| [0]: https://youtu.be/PFMPjt_RgXA?t=2260
| vladvasiliu wrote:
| In addition to what mustache_kimono said, there apparently
| seem to be issues with kernel functions moving / changing
| and breaking ZFS which then needs a while to catch up. For
| the latest reoccurrence of this, see
| https://github.com/openzfs/zfs/pull/15931 for Linux 6.8
| compatibility.
|
| There's also the fact that not everything is OOB compatible
| with ZFS. For example, newer versions of systemd have been
| able to use the TPM to unlock drives encrypted with LUKS.
| AFAIK it doesn't work with ZFS.
|
| I use ZFS on my daily driver Linux box and mostly love it,
| but as long as these things happen, I can see why people
| may want to try to find an in-kernel solution. I personally
| use Arch so expect for bleeding edge updates to not work
| perfectly right away. But I recall seeing folks complaining
| about issues on Fedora, too, which I expect to be a bit
| more conservative than Arch.
| didntcheck wrote:
| LUKS and the associated systemd hook shouldn't care about
| the filesystem, right? It's just block layer
|
| But presumably you meant native ZFS encryption, which
| unfortunately is considered to be somewhat experimental
| and neglected, as I understand. Which surprised me, since
| I thought data at rest encryption would be pretty
| important for an "enterprise" filesystem
|
| Still, apparently lots of people successfully run ZFS on
| LUKS. It does mean you don't get zero-trust zfs send
| backups, but apparently that's where a lot of the bugs
| have been anyway
| vladvasiliu wrote:
| Yes, I was talking about ZFS native encryption.
|
| > But presumably you meant native ZFS encryption, which
| unfortunately is considered to be somewhat experimental
| and neglected, as I understand. Which surprised me, since
| I thought data at rest encryption would be pretty
| important for an "enterprise" filesystem.
|
| Yeah, I've happened about someone saying something
| similar, but I've never seen anything about that from
| "official" sources. Wouldn't mind a link or something if
| you have one on hand. But there is the fact that this
| encryption scheme seems limited when compared to LUKS:
| there's no support for multiple passphrases or using
| anything more convenient, like say, a U2F token.
|
| > Still, apparently lots of people successfully run ZFS
| on LUKS. It does mean you don't get zero-trust zfs send
| backups, but apparently that's where a lot of the bugs
| have been anyway
|
| I'd say I'm one of those people, never had any issue with
| this in ~ten years of use. And, indeed, the main reason
| for using this on my laptop is being able to send the
| snapshots around without having to deal with another tool
| for managing encryption. Also, on my servers on which I
| run RAIDZ, having to configure LUKS on top of each drive
| is a PITA.
| throw0101d wrote:
| Why would Btrfs have a big(ger) future than Bcachefs when the
| latter seems to have the same functionality and less
| 'historical baggage'+?
|
| I remember when Btrfs was initially released (I was doing
| Solaris sysadmining, and it was billed as "ZFS for Linux"),
| and yet here we are all these years later and it still seems
| to be 'meh'.
|
| + E.g., RAID5+ that's less likely to eat data:
|
| * https://btrfs.readthedocs.io/en/latest/btrfs-
| man5.html#raid5...
|
| * https://bcachefs.org/ErasureCoding/ /
| https://github.com/koverstreet/bcachefs/issues/657
| bhaney wrote:
| By the time future-bcachefs has feature parity with
| present-btrfs, who knows what more will be in future-btrfs?
|
| For your specific example, bcachefs's erasure coding is
| very experimental and currently pretty much unusable, while
| btrfs is actively working towards fixing the raid56 write
| hole with the recent addition of the raid-stripe-tree. By
| the time bcachefs has a trustworthy parity profile, btrfs's
| _may_ be just as good.
| curt15 wrote:
| Bcachefs has not advertised erasure coding as production
| ready only to renege on that claim later. So nobody has
| been unwittingly burned yet.
| throw0101d wrote:
| > _For your specific example_
|
| My specific example says that the bcachefs are "actively
| working towards fixing the raid56 write hole" as well--or
| rather, their way of doing things doesn't have one in the
| first place.
| bhaney wrote:
| > bcachefs are "actively working towards fixing the
| raid56 write hole" as well
|
| Yep, that's my point. Neither btrfs nor bcachefs have a
| write-hole-less parity raid profile implementation yet,
| and both are working towards one. We don't know if one
| will be finished and battle tested significantly before
| the other, or if one will prove to be more performant or
| reliable. Just have to wait and see.
| e145bc455f1 wrote:
| When can we expect debian to ship Bcachefs?
| throw0101d wrote:
| Bcachefs is in the Linux 6.7 kernel, and that is available in
| Debian _unstable_ and _experimental_ :
|
| * https://packages.debian.org/search?keywords=linux-image-6.7
|
| * Search "6.7.1-1~exp1": https://metadata.ftp-
| master.debian.org/changelogs//main/l/li...
| candiddevmike wrote:
| Bcachefs is not a drop-in replacement for btrfs yet. It's
| still missing critical things like scrub:
|
| https://bcachefs.org/Roadmap/
| bhaney wrote:
| As a heavy btrfs user, I do expect bcachefs to fully replace it
| eventually. But that's still many years off.
| nolist_policy wrote:
| I fully expect bcachefs will initially hit similar issues like
| btrfs.
|
| Until now (bcachefs has been merged) it has only been used by
| people running custom kernels. As more people try it, it will
| hit more and more drives with buggy firmware and whatnot.
| viraptor wrote:
| They're in a slightly different position though. Bcache
| itself has existed for many years and has been used in
| production. Bcachefs changes it quite a bit, but I wouldn't
| expect any basic issues we've seen elsewhere.
| e145bc455f1 wrote:
| Just last week my btrfs filesystem got irrecoverably corrupted.
| This is like the fourth time it has happened to me in the last 10
| years. Do not use it in consumer grade hardware. Compared to
| this, ext4 is rock solid. It was even able to survive me
| accidentally passing the currently running host's hard disk to a
| VM guest, which booted from it.
| londons_explore wrote:
| > last week my btrfs filesystem got irrecoverably corrupted.
|
| This is 2 bugs really. 1, the file system got corrupted. 2,
| tooling didn't exist to automatically scan through the disk
| data structures and recover as much of your drive as possible
| from whatever fragments of metadata and data were left.
|
| For 2, it should happen by default. Most users don't want a
| 'disk is corrupt, refusing to mount' error. Most users want any
| errors to auto-correct if possible and get on with their day.
| Keep a recovery logfile with all the info needed to reverse any
| repairs for that small percentage of users who want to use a
| hex editor to dive into data corruption by hand.
| mrob wrote:
| Where is that log file supposed to be stored? It can't be on
| the same filesystem it was created for or it negates the
| purpose of its creation.
| londons_explore wrote:
| If I were designing it, the recovery process would:
|
| * scan through the whole disk and, for every sector, decide
| if it is "definitely free space (part of the free space
| table, not referenced by any metadata)", "definitely
| metadata/file data", "unknown/unsure (ie. perhaps
| referenced by some dangling metadata/an old version of some
| tree nodes)".
|
| * I would then make a new file containing a complete image
| of the whole filesystem pre-repair, but leaving out the
| 'definitely free space' parts.
|
| * such a file takes nearly zero space, considering btrfs's
| copy-on-write and sparse-file abilities.
|
| * I would then repair the filesystem to make everything
| consistent. The pre-repair file would still be available
| for any tooling wanting to see what the filesystem looked
| like before it was repaired. You could even loopmount it or
| try other repair options on it.
|
| * I would probably encourage distros to auto-delete this
| recovery file if disk space is low/after some time, since
| otherwise the recovery image will end up pinning user data
| to using up disk space for years and users will be unhappy.
|
| The above fails in only one case: Free space on the drive
| is very low. In that case, I would probably just do the
| repairs in-RAM and mount the filesystem readonly, and have
| a link to a wiki page on possible manual repair routes.
| TillE wrote:
| Yeah the last time I had a btrfs volume die, there were a few
| troubleshooting/recovery steps on the wiki which I dutifully
| followed. Complete failure, no data recoverable. The last
| step was "I dunno, go ask someone on IRC." Great.
|
| It's understandable that corruption can happen due to bugs or
| hardware failure or user insanity, but my experience was that
| the recovery tools are useless, and that's a big problem.
| rcthompson wrote:
| Writing to a corrupted filesystem by default is bad design.
| The corruption could be caused by a hardware problem that is
| exacerbated by further writes, leading to additional data
| loss.
| londons_explore wrote:
| > It was even able to survive me accidentally passing the
| currently running host's hard disk to a VM guest, which booted
| from it.
|
| I have also done this, and was also happy that the only
| corruption was to a handful of unimportant log files. Part of a
| robust filesystem is that when the user does something stupid,
| the blast radius is small.
|
| Other less-smart filesystems could easily have said "root of
| btree version mismatch, deleting bad btree node, deleting a
| bunch of now unused btree nodes, your filesystem is now empty,
| have a nice day".
| nolist_policy wrote:
| Best send a bugreport to the btrfs mailing list at linux-
| btrfs@vger.kernel.org.
|
| If possible include the last kernel log entries before it
| corrupted. Include kernel version, drive model and drive
| firmware version.
| riku_iki wrote:
| how do you know it was issue with FS and not actual
| hardware/disk?..
| viraptor wrote:
| Yeah, that's the fun part of the ext/btrfs corruption posts.
| If you got repeating corruption on btrfs on the same drive
| but not on ext, how do you know it's not just a drive failure
| that ext is not able to notice? What would happen if you
| tried ext with dm-integrity?
| gmokki wrote:
| I have had same btrfs filesystem in use for 15+ years, with 6
| disks of various sizes. And all hardware components changed at
| least once during the fileystsen lifetime.
|
| Worst corruption was when one DIMM started corrupting data. As
| a result computer kept crashing and eventually refused to mount
| because of btrfs checksum mismatches.
|
| Fix was to buy new HW. Then run btrfs filesystem repairs, which
| failed at some point but at least got the filesystem running as
| long as I did not touch the most corrupted locations, luckily
| it was RAID1 so most checksums had a correct value on another
| disk. Unfortunately the checksum tree had on two locations
| corruption on both copies. I had to open the raw disks with hex
| editor and change the offending byte to correct value, after
| which the filesystem has been running again smoothly for 5
| years.
|
| And to find the location to modify on the disks I built a
| custom kernel that printed the expected value and absolute disk
| position when it detected the specific corruption. Plus had to
| ask a friend to double check my changes since I did not have
| any backups.
| matja wrote:
| > running again smoothly for 5 years
|
| So did you bite the bullet and get ECC, or are you just
| waiting for the next corruption caused by memory errors? :)
| matheusmoreira wrote:
| > Compared to this, ext4 is rock solid.
|
| Ext4 is the most reliable file system I have ever used. Just
| works and has never failed on me, not even once. No idea why
| btrfs can't match its quality despite over a decade of
| development.
| Daunk wrote:
| I recently tried (for the first time) Btrfs on my low-end laptop
| (no snapshots), and I was surprised to see that the laptop ran
| even worse than it usually does! Turns out there was something
| like a "btrfs-cleaner" (or similar) running in the background,
| eating up almost all the CPU at all time. After about 2 days I
| jumped over to ext4 and everything ran just fine.
| eru wrote:
| Interesting that the 'cleaner' doesn't run as nice?
| rcthompson wrote:
| I'm pretty sure it's a kernel thread, not a process, since
| it's part of the filesystem. So it can't be renice'd.
| mustache_kimono wrote:
| > I recently tried Btrfs on my low-end laptop (no snapshots)
|
| Do snapshots degrade the performance of btrfs?
| nolist_policy wrote:
| What was your workload? Do you have quotas enabled?
| Compression? Are you running OpenSuse by any chance?
| Daunk wrote:
| Workload was literally zero, I just logged into XFCE and
| could barely do anything for 2 days straight. No quotas and
| no compression, but it was indeed openSUSE!
| nolist_policy wrote:
| That explains it, because openSUSE uses snapshots and
| quotas by default. It creates a snapshot before and one
| after every package manager interaction and cleans up old
| snapshots once per day.
|
| Unfortunately, deleting snapshots with quotas is an
| expensive operation that needs to rescan some structures to
| keep the quota information consistent and that is what
| you're seeing.
| Daunk wrote:
| I'm not sure that's correct. When you install openSUSE
| (this was a clean install) there's a checkbox that asks
| if you want snapshots, which I did not enable. But either
| way, a fresh openSUSE install with XFCE on Btrfs
| rendering the computer unusable for, at least, 2 days is
| not okay in my book, even if snapshots were enabled.
| mritzmann wrote:
| Had a similar problem but can't remember the Btrfs process.
| Anyway, after I switched off Btrfs quotas, everything was fine.
| londons_explore wrote:
| One btrfs bug which is 100% reproducible:
|
| * Start with an ext3 filesystem 70% full.
|
| * Convert to btrfs using btrfs-convert.
|
| * Delete the ext3_saved snapshot of the original filesystem as
| recommended by the convert utility.
|
| * Enable compression (-o compress) and defrag the filesystem as
| recommended by the man page for how to compress all existing
| data.
|
| It fails with out of disk space, leaving a filesystem which isn't
| repairable - deleting files will not free any space.
|
| The fact such a bug seems to have existed for years, with such a
| basic following of the man pages for a common use case (migration
| to btrfs to make use of its compression abilities to get more
| free space), tells me that it isn't yet ready for primetime.
| cesarb wrote:
| > deleting files will not free any space.
|
| Does a rebalance fix it? I have once (and only once, back when
| it was new) hit a "out of disk space" situation with btrfs, and
| IIRC rebalancing was enough to fix it.
|
| > for a common use case
|
| It might have been a common use case back when btrfs was new
| (though I doubt it, most users of btrfs probably created the
| filesystem from scratch even back then), but I doubt it's a
| common use case nowadays.
| eru wrote:
| > It might have been a common use case back when btrfs was
| new (though I doubt it, most users of btrfs probably created
| the filesystem from scratch even back then), but I doubt it's
| a common use case nowadays.
|
| It's perhaps not as common as it once was, but you'd expect
| it to be common enough to work, and not some obscure corner
| case.
| bayindirh wrote:
| From my perspective, A filesystem is a _critical
| infrastructure_ in an OS, and failing here and there and not
| fixing these bugs because they 're not common is not
| acceptable.
|
| Same for the RAID5/6 bugs in BTRFS. What's their solution? A
| simple warning in the docs:
|
| > RAID5/6 has known problems and should not be used in
| production. [0]
|
| Also the CLI _discourages you_ from creating these things.
| Brilliant.
|
| This is why I don't use BTRFS anywhere. An FS shall be
| bulletproof. Errors must only cause from hardware problems.
| Not random bugs in a filesystem.
|
| [0]: https://btrfs.readthedocs.io/en/latest/mkfs.btrfs.html#m
| ulti...
| chronid wrote:
| Machines die. Hardware has bugs, or is broken. Things just
| bork. It's a fact of life.
|
| Would I build a file storage system around btrfs? No -
| without proper redundancy at least. But I'm told at least
| Synology does.
|
| I'm pretty sure there's plenty of cases where it's
| perfectly usable - the feature set it has today is plenty
| useful and the worst case scenario is an host reimage.
|
| I can live with that. applications will generally break
| production ten billion times before btrfs does.
| bayindirh wrote:
| > Machines dies. Hardware has bugs, or is broken. Things
| just bork. It's a fact of life.
|
| I know, I'm a sysadmin. I care for hardware, mend it,
| heal it, and sometimes donate, cann-bird or bury it. I'm
| used to it.
|
| > worst case scenario is an host reimage...
|
| While hosting PBs of data on it? No, thanks.
|
| > Would I build a file storage system around btrfs? No -
| without proper redundancy at least.
|
| Everything is _easy_ for small n. When you store 20TB on
| 4x5TB drives, everything can be done. When you have a
| >5PB of storage on racks, you need at least a copy of
| that system running hot-standby. That's not cheap in any
| sense.
|
| Instead, I'd use ZFS, Lustre, anything, but not BTRFS.
|
| > I can live with that - applications will generally
| break production ten billion times before btrfs does.
|
| In our case, no. Our systems doesn't stop because a
| daemon decided to stop because a server among many fried
| itself.
| wongarsu wrote:
| My impression of btrfs is that it's very useful and
| stable if you stay away from the sharp edges. Until you
| run into some random scenario that leads you to an
| unrecoverable file system.
|
| But it has been that way for now 14 years. Sure, there
| are far fewer sharp edges now than there were back then.
| For a host you can just reimage it's fine, for a well-
| tested fairly restricted system it's fine. I stay far
| away from it for personal computers and my home-built
| NAS, because just about any other fs seems to be more
| stable.
| nolist_policy wrote:
| You do you.
|
| Personally, btrfs just works and the features are worth it.
|
| Btrfs raid always gets brought up in these discussions, but
| you can just not use it. The reality is that it didn't have
| a commercial backer until now with Western Digital.
| lxgr wrote:
| There's literally no way I could migrate my NAS other than
| through an in-place FS conversion since it's >>50% full.
|
| The same probably applies to may consumer devices.
| chasil wrote:
| A rebalance means that every file on the filesystem will be
| rewritten.
|
| This is drastic, and I'd rather perform such an operation on
| an image copy.
|
| This is one case where ZFS is absolutely superior; if a drive
| goes offline, and is returned to a set at a later date, the
| resilver only touches the changed/needed blocks. Btrfs forces
| the entire filesystem to be rewritten in a rebalance, which
| is much more drastic.
|
| I am very willing to allow ZFS mirrors to be degraded; I
| would never, ever let this happen to btrfs if at all
| avoidable.
| o11c wrote:
| The desired "compress every file" operation will also cause
| every file on the filesystem to be rewritten though ...
| orev wrote:
| As a Linux user since kernel 0.96, I have never once considered
| doing an in-place migration to a new file system. That seems
| like a crazy thing to try to do, and I would hope it's only
| done as a last resort and with all data fully backed up before
| trying it.
|
| I would agree that if this is presented in the documentation as
| something it supports, then it should work as expected. If it
| doesn't work, then a pull request to remove it from the docs
| might be the best course of action.
| tuyiown wrote:
| I don't know if you talk only about linux or you meant your
| comment as a generalization, but have you heard of in place
| APFS migration from HFS+ ?
| bayindirh wrote:
| Similarly Microsoft offered FAT32 to NTFS in-place
| migration, and it did the required checks _before starting_
| to ensure it completes successfully. It was more than 20
| years ago IIRC.
| londons_explore wrote:
| The design of the convert utility is pretty good - the
| convert is effectively atomic - at any point during
| conversion, if you kill power midway, the disk is either a
| valid ext4 filesystem, or a valid btrfs filesystem.
| thfuran wrote:
| >That seems like a crazy thing to try to do
|
| It seems like a reasonable thing to want to do. Would you
| never update an installed application or the kernel to get
| new features or fixes? I don't really think there's much
| fundamental difference. If what you mean is "it seems likely
| to fail catastrophically", well that seems like an indication
| that either the converter or the target filesystem isn't in a
| good state.
| bartvk wrote:
| > It seems like a reasonable thing to want to do
|
| This was actually a routine thing to do, under iOS and
| macOS, with the transition from HFS to APFS.
| thijsvandien wrote:
| Don't forget about FAT32 to NTFS long before that.
| KronisLV wrote:
| > Would you never update an installed application or the
| kernel to get new features or fixes?
|
| Honestly, if that was an _option_ without the certainty of
| getting pwned due to some RCE down the road, then yes,
| there are cases where I absolutely wouldn 't want to update
| some software and just have it chugging away for years in
| its present functional state.
| thfuran wrote:
| And there are no cases where you actually want a new
| feature?
| yjftsjthsd-h wrote:
| There's a world of difference between a version update and
| completely switching software. Enabling new features on an
| existing ext4 file system is something I would expect to be
| perfectly safe. In-place converting an ext4 file system
| into btrfs... in an ideal world that would work, of course,
| but it sounds vastly more tricky even in the best case.
| orev wrote:
| Live migration of a file system is like replacing the
| foundation of your house while still living in it. It's not
| the same thing as updating apps which would be more like
| redecorating a room. Sure it's possible, but it's very
| risky and a lot more work than simply doing a full backup
| and restore.
| jcalvinowens wrote:
| It's open source: if things don't get fixed, it's because no
| user cares enough to fix it. I'm certainty not wasting my time
| on that, nobody uses ext3 anymore!
|
| You are as empowered to fix this as anybody else, if it
| presents a real problem for you.
| bayindirh wrote:
| You can reliably change ext3 with ext4 in the GP's comment.
| It's not an extX problem, it's an BTRFS problem.
|
| If nobody cares that much, maybe deprecate the tool, then?
|
| Also, just because it's Open Source (TM) doesn't mean
| developers will accept any patch regardless of its quality.
| Like everything, FOSS is 85% people, 15% code.
|
| > You are as empowered to fix this as anybody else, if it
| presents a real problem for you.
|
| I have reported many bugs in Open Source software. If I had
| the time to study code and author a fix, I'd submit the patch
| itself, which I was able to do, a couple of times.
| jcalvinowens wrote:
| > If nobody cares that much, maybe deprecate the tool,
| then?
|
| If you think that's the right thing to do, you're as free
| as anybody else to send documentation patches. I doubt
| anybody would argue with you here, but who knows :)
|
| > Also, just because it's Open Source (TM) doesn't mean
| developers will accept any patch regardless of its quality.
|
| Of course not. If you want to make a difference, you have
| to put in the work. It's worth it IMHO.
| bayindirh wrote:
| > If you think that's the right thing to do, you're as
| free as anybody else to send documentation patches :)
|
| That won't do anything. Instead I can start a small
| commotion by sending a small request (to the mailing
| lists) to deprecate the tool, which I don't want to do.
| Because I'm busy. :)
|
| Also, I don't like commotions, and prefer civilized
| discussions.
|
| > Of course not. If you want to make a difference, you
| have to put in the work. It's worth it IMHO.
|
| Of course. This is what I do. For example, I have a one
| liner in Debian Installer. I had a big patch in GDM, but
| after coordinating with the developers, they decided to
| not merge the fix + new feature, for example.
| jcalvinowens wrote:
| > For example, I have a one liner in Debian Installer. I
| had a big patch in GDM, but after coordinating with the
| developers, they decided to not merge the fix + new
| feature, for example
|
| Surely you were given some justification as to why they
| didn't want to merge it? I realize sometimes these things
| are intractable, but in my experience that's rare...
| usually things can iterate towards a mutually agreeable
| solution.
| bayindirh wrote:
| The sad part is they didn't.
|
| GTK has (or had) a sliding infoline widget, which is used
| to show notifications. GDM used it for password related
| prompts. It actually relayed PAM messages to that widget.
|
| We were doing mass installations backed by an LDAP server
| which had password policies, including expiration
| enabled.
|
| That widget had a bug, prevented it displaying a new
| message when it was in the middle of an animation, which
| effectively ate messages related to LDAP (Your password
| will expire in X days, etc.).
|
| Also we needed a keyboard selector in that window, which
| was absent.
|
| I gave heads up to the GDM team, they sent a "go ahead"
| as a reply. I have written an elaborate patch, which they
| rejected and wanted a simpler one. I iterated the way
| they wanted, they said it passes the muster and will be
| merged.
|
| Further couple of mails never answered. I'm basically
| ghosted.
|
| But the merge never came, and I moved on.
| baq wrote:
| This sort of thing happens _all the time_ in big orgs
| which develop a single product internally with a closed
| source... there isn't any fix for this part of human
| nature, apparently.
| Zardoz84 wrote:
| for me it's a problem of the tool converting in place EXT
| to BTRFS. Not necessarily a problem of BTRFS.
| bayindirh wrote:
| I tried to say it's a problem of BTRFS the project. Not
| BTRFS the file system.
| noncoml wrote:
| > You are as empowered to fix this as anybody else, if it
| presents a real problem for you.
|
| I'm getting sick and tired of this argument. It's very user
| unfriendly
|
| Do you work in software? If yes, do you own a component? Have
| you tried using this argument with your colleagues that have
| no idea about your component?
|
| Of course they theoretically are empowered to fix the bug,
| but that doesn't make it easier. They may have no idea about
| how filesystem internal. Or they may be just users and have
| no programming background at all
| romanows wrote:
| I think the parent comment's fix is to not use btrfs and warn
| others about how risky they estimate it is.
| zaggynl wrote:
| For what it's worth, I'm happy with using Btrfs on OpenSUSE
| Tumbleweed and the provided tooling like snapper and restoring
| Btrfs snapshots from grub, saved me a few times.
|
| SSD used: Samsung SSD 970 PRO 1T, same installation since
| 2020-05-02.
| xyzzy_plugh wrote:
| I'm so sad. I was in the btrfs corner for over a decade, and it
| saddens me to say that ZFS has won. But it has.
|
| And ZFS is actually good. I'm happy with it. I don't think about
| it. I've moved on.
|
| Sorry, btrfs, but I don't think it's ever going to work out
| between us. Maybe in a different life.
| doublepg23 wrote:
| I'm personally cheering for Bcachefs now.
| matheusmoreira wrote:
| What saddens me is the fact they _still_ haven 't managed to
| put ZFS into the Linux kernel tree because of licensing
| nonsense.
| west0n wrote:
| The vast majority of databases currently recommend using XFS.
| west0n wrote:
| I'm very curious if there are any databases running in
| production environments on Btrfs or ZFS.
| yjftsjthsd-h wrote:
| It wasn't a company you've heard of, but I can absolutely
| tell you that there are:) We _really_ benefited from
| compressing our postgres databases; not only did it save
| space, but in some cases it actually improved throughput
| because data could be read and uncompressed faster than the
| disks were physically capable of.
|
| Edit: Also there was a time when it was normal for databases
| to be running on Solaris, in which case ZFS is expected; my
| recollection is that Sun even marketed how great of a
| combination that was.
| eru wrote:
| Interesting. The compression options that postgres offers
| itself were not enough?
| yjftsjthsd-h wrote:
| It went the other way; we were using ZFS to protect
| against data errors, then found that it did compression
| as well. But looking now, the only native postgres
| options I see are TOAST, which seems to only work for
| certain data types in the database, and WAL compression
| (that's only existed since pg 15), so unless I've missed
| something I would tend to say yes it's far superior to
| the native options.
| riku_iki wrote:
| postgres allows to compress only large text and blob
| column values.
| dfox wrote:
| Sun marketed the combination of PostgreSQL and ZFS quite
| heavily. But IIRC most of the success stories they
| presented involved what was decidedly non-OLTP read heavy
| workload with somewhat ridiculously large databases. One of
| the case studies even involved using PostgreSQL on ZFS as
| an archival layer for Oracle, with the observation that
| what is a large Oracle database is medium-sized one for
| PostgreSQL (and probably even more so if it is read mostly
| and on ZFS).
|
| Sunhad some recommendations for tuning PostgreSQL
| performance on ZFS, but the recommendations seemed weird or
| even wrong for OLTP on top of CoW filesystem (I assume that
| these recommendations were meant for the above mentioned
| DWH/archival scenarios).
| chasil wrote:
| Only 12 pages. Oracle database on ZFS best practices.
|
| https://www.oracle.com/technetwork/server-
| storage/solaris10/...
| dilyevsky wrote:
| afaik meta (who are very large btrfs user) do not use it with
| mysql but do use it with rocksdb backends.
|
| i think you can tweak it to make high random write load less
| painful but it generally will struggle with that.
| londons_explore wrote:
| All modern databases do large block streaming appending writes,
| and small random reads, usually of just a handful of files.
|
| It ought to be easy to design a filesystem which can have
| pretty much zero overhead for those two operations. I'm kinda
| disappointed that every filesystem doesn't perform identically
| for the database workload.
|
| I totally understand that different file systems would do
| different tradeoffs affecting directory listing, tiny file
| creation/deletion, traversing deep directory trees, etc. But
| random reads of a huge file ought to perform near identically
| to the underlying storage medium.
| rubiquity wrote:
| What you're describing is basically ext4. I do know there are
| some proprietary databases that use raw block devices. The
| downside being that you also need to make your own user space
| utilities for inspecting and managing it.
| applied_heat wrote:
| Btrfs seems to work perfectly well in synology NAS. It must be
| some other combination of options or functions in use from what
| is available in synology that garners the bad reputation
| gh02t wrote:
| Synology uses BTRFS weirdly, on top of dmraid, which negates
| the most well-known BTRFS bugs in their RAID5/6
| implementation*. Best I can tell they also have some custom
| modifications in their implementation as well, though it's hard
| to find much info on it.
|
| * FWIW, I used native BTRFS RAID5 for years and never had an
| issue but that's just anecdata
| jcalvinowens wrote:
| This bug is very very rare in practice: all my dev and testing
| machines run btrfs, and I haven't hit it once in 100+ machine-
| hours of running on 6.8-rc.
|
| The actual patch is buried at the end the article:
| https://lore.kernel.org/linux-btrfs/1ca6e688950ee82b1526bb30...
| tavianator wrote:
| Do they run btrfs on top of dm-crypt? I suspect it's impossible
| to reproduce on a regular block device.
| jcalvinowens wrote:
| Yes, most of them do, precisely because I'm trying to catch
| more bugs :)
| jcalvinowens wrote:
| Did this ever trip any of the debugging ASSERT stuff for you?
| I'm really curious if some more generic debugging instrument
| might be able to flag this failure mode more explicitly, it's
| far from the first ABA problem with ->bflags.
|
| Also, if you don't mind sharing your kconfig that would be
| interesting to see too.
| tavianator wrote:
| No, but I may have been missing some debugging CONFIGs. I
| was just using the stock Arch kconfig.
|
| I did submit a patch to add a WARN_ON() for this case:
| https://lore.kernel.org/linux-
| btrfs/d4a055317bdb8ecbd7e6d9bd...
|
| But to catch the general class of bugs you'd need something
| like KCSAN. I did try that but the kernel is not KCSAN-
| clean so it was hard to know if any reports were relevant.
| jcalvinowens wrote:
| The WARN is great.
|
| Something as general as KCSAN isn't necessary: it's a
| classic ABA problem, double checking the ->bflags value
| on transitions is sufficient to catch it. Like a lockdep-
| style thing where *_bit() are optionally replaced by
| helpers that check the current value and WARN if the
| transition is unexpected.
|
| Using the event numbering from your patch description,
| such a thing would have flagged seeing UPTODATE at (2).
| But the space of invalid transitions is is much larger
| than the space of valid ones, which is why I think it
| might help catch other future bugs sooner.
|
| Dunno if it's actually worth it, but I definitely recall
| bugs of this flavor in the past. It'll take some work to
| unearth them all, alas...
| tavianator wrote:
| I like that idea, but it isn't really compatible with my
| fix. `bflags` still makes the same transitions, I just
| skip the actual read in the `UPTODATE | READING` case.
| jcalvinowens wrote:
| I think after your fix the set of valid states in that
| spot would just include UPTODATE. But this is all very
| poorly thought out on my part...
| lxgr wrote:
| 100 error-free machine hours isn't exactly evidence of anything
| when it comes to FS bugs, though.
| jcalvinowens wrote:
| Of course it is: it's evidence the average person will never
| hit this bug. Statistics, and all that.
|
| Having worked on huge deployments of Linux servers before, I
| can tell you that modelling race condition bugs as having a
| uniform random chance of happening per unit time is
| _shockingly_ predictive. But it 's not proof, obviously.
| lxgr wrote:
| > modelling race condition bugs as having a uniform random
| chance of happening per unit time is shockingly predictive
|
| I don't generally disagree with that methodology, but 100
| hours is just not a lot of time.
|
| If you have a condition that takes, on average, 1000 hours
| to occur, you have a 9 in 10 chance of missing it based on
| 100 error-free hours observed, and yet it will still affect
| nearly 100% of all of your users after a bit more than a
| month!
|
| For file systems, the aim should be (much more than) five
| nines, not nine fives.
| jcalvinowens wrote:
| > If you have a condition that takes, on average, 1000
| hours to occur, you have a 9 in 10 chance of missing it
| based on 100 error-free hours observed
|
| Yes. Which means 9/10 users who used their machines 100
| or fewer hours on the new kernel will never hit the
| hypothetical bug. Thank you for proving my point!
|
| I'm not a filesystem developer, I'm a user: as a user, I
| don't care about the long tail, I only care about the
| average case as it relates to my deployment size. As you
| correctly point out, my deployment is of negligible size,
| and the long tail is far far beyond my reach.
|
| Aside: your hypothetical event has a 0.1% chance of
| happening each hour. That means it has a 99.9% chance of
| not happening each hour. The odds it doesn't happen after
| 100 hours is 0.999^100, or 90.5%. I think you know that,
| I just don't want a casual reader to infer it's 90%
| because 1-(100/1000) is 0.9.
| lxgr wrote:
| > Which means 9/10 users who used their machines 100 or
| fewer hours on the new kernel will never hit the bug.
|
| No, that's not how probabilities work at all for a bug
| that happens with uniform probability (i.e. _not_ bugs
| that deterministically happen after n hours since boot).
| If you have millions of users, some of them will hit it
| within hours or even minutes after boot!
|
| > As you correctly point out, my deployment is of
| negligible size, and the long tail is far far beyond my
| reach.
|
| So you don't expect to accrue on the order of 1000
| machine-hours in your deployment? That's only a month for
| a single machine, or half a week for 10. That would be
| way too much for me even for my home server RPi, let
| alone anything that holds customer data.
|
| > I'm not a filesystem developer, I'm a user: I don't
| care about the long tail, I only care about the average
| case as it relates to my deployment size.
|
| Yes, but unfortunately you seem to either have the math
| completely wrong or I'm not understanding your deployment
| properly.
| jcalvinowens wrote:
| > So you don't expect to accrue on the order of 1000
| machine-hours in your deployment?
|
| The 1000 number came from you. I have no idea where you
| got it from. I suspect the "real number" is several
| orders of magnitude higher, but I have no idea, and it's
| sort of artificial in the first place.
|
| My overarching point is that mine is such a vanishingly
| small portion of the universe of machines running btrfs
| that I am virtually guaranteed that bugs will be found
| and fixed before they affect me, exactly as happened
| here. Unless you run a rather large business, that's
| probably true for you too.
|
| The filesystem with the most users has the least bugs.
| Nothing with the feature set of btrfs has even 1% the
| real world deployment footprint it does.
|
| > If you have millions of users, some of them will hit it
| within hours or even minutes after boot!
|
| This is weirdly sensationalist: I don't get it. Nobody
| dies when their filesystem gets corrupted. Nobody even
| loses money, unless they've been negligent. At worst it's
| a nuisance to restore a backup.
| lxgr wrote:
| > The 1000 number came from you. I have no idea where you
| got it from,
|
| It's an arbitrary example of an error rate you'd have a
| 90% chance of missing in your sample size of 100 machine-
| hours, yet much too high for almost any meaningful
| application.
|
| I have no idea what the actual error rate of that btrfs
| bug is; my only point is that your original assertion of
| "I've experienced 100 error-free hours, so this is a non-
| issue for me and my users" is a non sequitur.
|
| > This is weirdly sensationalist: I don't get it. Nobody
| dies when their filesystem gets corrupted. Nobody even
| loses money, unless they've been negligent.
|
| I don't know what to say to that other than that I wish I
| had your optimism on reliable system design practices
| across various industries.
|
| Maybe there's a parallel universe where people treat
| every file system as having an error rate of something
| like "data corruption/loss once every four days", but
| it's not the one I'm familiar with.
|
| For better or worse, the bar for file system reliability
| is much, much, much, much higher than anything you could
| reasonably produce empirical data for unless you're
| operating at Google/AWS etc. scale.
| jcalvinowens wrote:
| > "I've experienced 100 error-free hours, so this is a
| non-issue for me and my users"
|
| It's a statement of fact: it has been a non-issue for me.
| If you're like me, it's statistically reasonable to
| assume it will be a non-issue for you too. Also, no
| users, just me. "Proabably okay" is more than good enough
| for me, and I'm sure many people have similar
| requirements (clearly not you).
|
| I have no optimism, just no empathy for the negligent: I
| learned my lesson with backups a long time ago. Some
| people blame the filesystem instead of their backup
| practices when their data is corrupted, but I think
| that's naive. The filesystem did you a favor, fix your
| shit. Next time it will be your NAS power supply frying
| your storage.
|
| It's also a double edged sword: the more reliable a
| filesystem is, the longer users can get away without
| backups before being bitten, and the greater their
| ultimate loss will be.
| lxgr wrote:
| > It's a statement of fact: it has been a non-issue for
| me.
|
| Yes...
|
| > If you're like me, it's statistically reasonable to
| assume it will be a non-issue for you too.
|
| No! This simply does not follow from the first statement,
| statistically or otherwise.
|
| You and I might or might not be fine; you having been
| fine for 100 hours on the same configuration just offers
| next-to-zero predictive power for that.
| jcalvinowens wrote:
| > No! This simply does not follow from the first
| statement, statistically or otherwise.
|
| > You and I might or might not be fine; you having been
| fine for 100 hours on the same configuration just offers
| next-to-zero predictive power for that.
|
| You're missing the forest for the trees here.
|
| It is predictive _ON AVERAGE_. I don 't care about the
| worst case like you do: I only care about the _expected
| case_. If I died when my filesystem got corrupted... I
| would hope it 's obvious I wouldn't approach it this way.
|
| Adding to this: my laptop has this btrfs bug right now.
| I'm not going to do anything about it, because it's not
| worth 20 minutes of my time to rebuild my kernel for a
| bug that is unlikely to bite before I get the fix in
| 6.9-rc1, and would only cost me 30 minutes of time in the
| worst case if it did.
|
| I'll update if it bites me. I've bet on much worse poker
| hands :)
| lxgr wrote:
| Well, from your data (100 error-free hours, sample size
| 1) alone, we can only conclude this: "The bug probably
| happens less frequently than every few hours".
|
| Is that reliable enough for you? Great! Is that "very
| rare"? Absolutely not for almost any type of
| user/scenario I can imagine.
|
| If you're making any statistical arguments beyond that
| data, or are implying more data than that, please provide
| either, otherwise this will lead nowhere.
| kbolino wrote:
| A single 9 in reliability over 100 hours would be
| colossally bad for a filesystem. For the average office
| user, 100 hours is not even a month's worth of daily use.
|
| Even as an anecdote this is completely useless. A couple
| thousand hours and dozens of mount/unmount cycles would
| just be a good start.
| paulddraper wrote:
| > Yes. Which means 9/10 users who used their machines 100
| or fewer hours on the new kernel will never hit the
| hypothetical bug. Thank you for proving my point!
|
| So....that's really bad.
| jcalvinowens wrote:
| You edited this in after I replied, or maybe I missed it:
|
| > If you have a condition that takes, on average, 1000
| hours to occur, you have a 9 in 10 chance of missing it
| based on 100 error-free hours observed, and yet it will
| still affect nearly 100% of all of your users after a bit
| more than a month!
|
| I understand the point you're trying to make here, but
| 1000 is just an incredibly unrealistically small number
| if we're modelling bugs like that. The real number might
| be on the order of millions. The effect you're describing
| in real life might take decades: weeks is unrealistic.
| lxgr wrote:
| I agree that a realistic error rate for this particular
| bug is much lower than 1 in 1000 hours (or it would have
| long been caught by others).
|
| But that makes your evidence of 100 error-free hours
| _even less useful_ to make any predictions about
| stability!
| 7bit wrote:
| > Of course it is: it's evidence the average person will
| never hit this bug. Statistics, and all that.
|
| Anecdotal statistics maybe.
| BenjiWiebe wrote:
| If I'm running btrfs on my NAS, that's only ~4 days of
| runtime. If there's a bug that trashes the filesystem every
| month on average, that's really bad and yet is very
| unlikely to get caught in 4 days of running.
| valicord wrote:
| You do understand that 100 hours is not even 5 days, right?
| streb-lo wrote:
| As someone who uses a rolling release, I use btrfs because I
| don't want to deal with keeping ZFS up to date.
|
| It's been really good for me. And btrbk is the best backup
| solution I've had on Linux, btrfs send/receive is a lot faster
| than rsync even when sending non-incremental snapshots.
| kccqzy wrote:
| Same here: I use a rolling release and btrfs. Personally I
| really enjoy btrfs's snapshot feature. Most of the time when I
| need backups it's not because of a hardware failure but because
| of a fat finger mistake where I rm'ed a file I need. Periodic
| snapshots completely solved that problem for me.
|
| (Of course, backing up to another disk is absolutely still
| needed, but you probably need it less than you think.)
| matja wrote:
| > I don't want to deal with keeping ZFS up to date
|
| That's what DKMS is for, which most distros use for ZFS.
| Install and forget.
| streb-lo wrote:
| I think the larger issue is that openzfs doesn't move in sync
| with the kernel so you have to check
| https://raw.githubusercontent.com/openzfs/zfs/master/META to
| make sure you can actually upgrade each time. On a rolling
| distro this is a pretty common breakage AFAIK. It's not the
| end of the world, but it is annoying.
| yjftsjthsd-h wrote:
| Depends on the rolling release; some distros specifically
| provide a kernel package that is still rolling, but is also
| always the latest version compatible with ZFS.
| TZubiri wrote:
| reminder that btrfs stands for butter filesystem and not better
| filesystem
| cies wrote:
| I had my /home on a subvolume (great as the sub and super share
| the same space).
|
| When I wanted to reinstall I naively thought I could format the
| root super volume and keep /home subvolume -- but this was
| impossible: I had to format /home as well according to the
| OpenSUSE Tumbleweed installer.
|
| Major problem for me. I now have separate root (btrfs) and home
| (ext4) partitions.
| stryan wrote:
| You can do it but it's not a very happy path. Easiest way is
| probably to map the old home subvolume into a different path
| and either re-label the subvolumes once you're installed or
| just copy everything over.
|
| Separate BTRFS root and ext4 home partitions is either the
| default filesystem layout now if you're not doing FDE or the
| second recommended one.
| apitman wrote:
| I really wish you could use custom filesystems such as btrfs with
| WSL2. I don't think there's currently any way to do snapshotting,
| which means you can never be sure a backup taken within WSL is
| corrupt.
| chasil wrote:
| Can this be used? I knew ReactOS would use it natively.
|
| https://github.com/maharmstone/btrfs
| apitman wrote:
| Maybe that could be adapted, but I don't think it would solve
| my problem currently. Basically I want to be able to do btrfs
| snapshots from within my WSL distros, so that I can run
| restic or similar on the snapshots.
| aseipp wrote:
| Hell, even just being able to use XFS would be an improvement,
| because ext4 has painful degradation scenarios when you hit
| cases like exhausting the inode count.
|
| (Somewhat related, but there has been a WIP 6.1 kernel for WSL2
| "in preview" for a while now... I wonder why it hasn't become
| the default considering both it and 5.12 are LTS... For
| filesystems like btrfs I often want a newer kernel to pick up
| every bugfix.)
| cogman10 wrote:
| You can, it's just a bit of a pain.
|
| https://blog.bryanroessler.com/2020-12-14-btrfs-on-wsl2/
| apitman wrote:
| Nice. Unfortunately for my use case I can't use physical
| storage devices. I need a something similar to a qcow2 that
| can be created and completely managed by my app.
| mappu wrote:
| Hope that https://github.com/veeam/blksnap/issues/2 becomes
| available soon. It's on v7 of posting to linux-block and will
| make snapshotting available for all block devices.
| apitman wrote:
| This is the first I've heard of blksnap. Looks very
| interesting. There's not much documentation in the repo. Am I
| understanding correctly that if I were to build a custom WSL
| kernel with those patches, I would be able to do snapshotting
| in WSL today?
|
| Can you give or link to a brief description of how blksnap
| works?
| JonChesterfield wrote:
| The visualisation of the data race in this post is superb. Worth
| reading just for that.
|
| Handrolled concurrency using atomic integers in C. Without the
| proof system or state machine model to support it. Seems likely
| that's not going to be their only race condition.
| re wrote:
| The animations also stood out to me. I took a look at the
| source to see if they used a library or were written completely
| by hand and was surprised to see that they were entirely CSS-
| based, no JavaScript required (although the "metadata"
| overwrite animation is glitched if you disable JS, for some
| reason not immediately apparent to me).
___________________________________________________________________
(page generated 2024-03-20 23:00 UTC)