[HN Gopher] Bcachefs - A New COW Filesystem
___________________________________________________________________
Bcachefs - A New COW Filesystem
Author : jlpcsl
Score : 263 points
Date : 2023-05-11 08:50 UTC (14 hours ago)
(HTM) web link (lore.kernel.org)
(TXT) w3m dump (lore.kernel.org)
| graderjs wrote:
| Is there an optimal filesystem or is it all just trade offs? And
| how far if we come since you know when we were first creating
| file systems plan nine or whatever to now like it's as there been
| any sort of technological Leap like killer algorithm it's really
| improve things?
| dsr_ wrote:
| An optimal filesystem... for what?
|
| There is no single filesystem which is optimized for
| everything, so you need to specify things like
|
| cross-platform transportability, network transparency, hardware
| interfaces, hardware capability, reliability requirements,
| cost-effectiveness, required features, expected workload,
| licensing
|
| and what the track record is in the real world.
| the8472 wrote:
| It's all tradeoffs.
| jacknews wrote:
| "The COW filesystem for Linux that won't eat your data"
|
| LOL, they know what the problem is at least. I will try it out on
| some old hard disks. The others (esp. looking at you btrfs) are
| not good at not losing your entire volumes when disks start to go
| bad.
| eis wrote:
| I really hope Linux can get a modern FS into common usage (as in
| default FS for most distros). After more than a decade ZFS and
| BTRFS didn't go anywhere. Something that's just there as a
| default, is stable, performs decently (at least on ext4 level)
| and brings modern features like snapshots. Bcachefs seems to have
| a decent shot.
|
| What I'd like to see even more though would be a switch from the
| existing posix based filesystem APIs to a transaction based
| system. It is way too complicated to do filesystem operations
| that are not prone to data corruption should there be any issues.
| viraptor wrote:
| Btrfs is the default on a few systems already like Fedora,
| suse, Garuda, easynas, rockstor, and some others. It's not the
| default in Ubuntu and Debian, but I wouldn't say it didn't go
| anywhere either.
| jadbox wrote:
| I'm using BTRfs on Fedora (by default install) and it's been
| great over the last year.
|
| The only thing to be aware of is to disable CoW/Hashing on
| database stores or streaming download folders. Otherwise
| it'll rehash each file update, which isn't needed.
| giantrobot wrote:
| Why is it hashing files and not blocks? If a block is
| hashed and written there's no need to touch it again.
| dnzm wrote:
| I've been running it on my NAS-slash-homeserver for... 5 or
| 6 years now, I think. Root on a single SSD, data on a few
| HDDs in RAID1. It's been great so far. My desktops are all
| btrfs too, and the integration between OpenSUSE's package
| manager and btrfs snapshots has been useful more than once.
| curt15 wrote:
| It looks like Fedora's adoption of btrfs unearthed another
| data corruption bug recently:
| https://bugzilla.redhat.com/show_bug.cgi?id=2169947
| hackernudes wrote:
| Wow, that's funny - almost looks like bcachefs explaining a
| similar issue here https://lore.kernel.org/lkml/20230509165
| 657.1735798-7-kent.o...
| johnisgood wrote:
| HAMMER2 supports snapshots. I do not have any experiences with
| it though.
| aidenn0 wrote:
| Is HAMMER2 supported on Linux? I thought it was Dragonfly
| only.
| joshbaptiste wrote:
| yup Dragonfly and soon NetBSD
| https://www.phoronix.com/news/NetBSD-HAMMER2-Port
| renewiltord wrote:
| Is there a high-performance in-kernel FS that acts as a
| hierarchical cache that I can export over NFS? Presently I use
| `catfs` over `goofys` and then I export the `catfs` mount.
| cdavid wrote:
| Not sure I understand your use case, but if you have to use
| nfs, cachefilesd is very effective for read heavy workload:
| https://access.redhat.com/documentation/en-us/red_hat_enterp...
| throw0101b wrote:
| > _These are RW btrfs-style snapshots_
|
| There's a word for 'RW snapshots': clones. E.g.
|
| * https://docs.netapp.com/us-en/ontap/task_admin_clone_data.ht...
|
| * http://doc.isilon.com/onefs/9.4.0/help/en-us/ifs_t_clone_a_f...
|
| * https://openzfs.github.io/openzfs-docs/man/8/zfs-clone.8.htm...
|
| * http://www.voleg.info/lvm2-clone-logical-volume.html
|
| In every other implementation I've come across the word
| "snapshot" is about read-only copies. I'm not sure why btrfs (and
| now bcachefs?) thinks it needs to muddy the nomenclature waters.
| webstrand wrote:
| Cloning can also mean simple duplication. I think calling it a
| RW snapshot is clearer because a snapshot generally doesn't
| mean simple duplication.
| throw0101a wrote:
| > _I think calling it a RW snapshot_ [...]
|
| So what do you call a RO snapshot? Or do you now need to
| write the prefix "RO" and "RW" _everywhere_ when referring to
| a "snapshot"?
|
| How do CLI commands work? Will you have "btrfs snapshot" and
| then have to always define whether you want RO or RW on every
| invocation? This smells like git's bad front-end CLI
| porcelain all over again (regardless of how nice the back-end
| plumbing may be).
|
| This is a solved problem with an established nomenclature
| IMHO: just use the already-existing nouns/CLI-verbs of
| "snapshot" and "clone".
|
| > [...] _is clearer because a snapshot generally doesn 't
| mean simple duplication._
|
| A snapshot generally means a static copy of the data, with
| bachefs (and ZFS and btrfs) being CoW then it implies new
| copies are not needed unless/until the source is altered.
|
| If you want deduplication use "dedupe" in your CLI.
| nextaccountic wrote:
| > So what do you call a RO snapshot
|
| There should be no read-only snapshot: it's just a writable
| snapshot where you don't happen to perform a write
| throw0101a wrote:
| > _There should be no read-only snapshot: it 's just a
| writable snapshot where you don't happen to perform a
| write_
|
| So when malware comes along and goes after the live copy,
| and happens to find the 'snapshot', but is able to hose
| that snapshot data as well, the solution is to go to
| tape?
|
| As opposed to any other file system that implements read-
| only snapshots, if the live copy is hosed, one can simply
| clone/revert to the read-only copy. (This is not a
| hypothetical: I've done this personally.)
|
| (Certainly one should have off-device/site backups, but
| being able to do a quick revert is great for MTTR.)
| londons_explore wrote:
| I would like to see filesystems benchmarked for robustness.
|
| Specifically, robustness to everything around them not performing
| as required. For example, imagine an evil SSD which had a 1%
| chance of rolling a sector back to a previous version, a 1%
| chance of saying a write failed when it didn't, a 1% chance of
| writing data to the wrong sector number, a 1% chance of flipping
| some bits in the written data, and a 1% chance of disconnecting
| and reconnecting a few seconds later.
|
| Real SSD's have bugs that make them do all of these things.
|
| Given this evil SSD, I want to know how long the filesystem can
| keep going serving the users usecase.
| crest wrote:
| A 1% error rates for corrupting other blocks is prohibitively.
| A file system would have do extensive forward error correction
| in addition to checksumming to have a chance to work with this.
| It would also have to perform a lot of background scrubbing to
| stay ahead of the rot. While interesting to model and maybe
| even relevant as a research problem given the steadily
| worsening bandwidth to capacity ratio of affordable bulk
| storage I don't expect there are many users willing to accept
| the overhead required to come even close to a usable file
| system an a device as bad as the one you described.
| comex wrote:
| Well, it's sort of redundant. According to [1], the raw bit
| error rate of the flash memory inside today's SSDs is already
| in the 0.1%-1% range. And so the controllers inside the SSDs
| already do forward error correction, more efficiently than
| the host CPU could do it since they have dedicated hardware
| for it. Adding another layer of error correction at the
| filesystem level could help with some of the remaining
| failure modes, but you would still have to worry about RAM
| bitflips after the data has already been read into RAM and
| validated.
|
| [1] https://ieeexplore.ieee.org/document/9251942
| simcop2387 wrote:
| ZFS will do this. Give it a RAIDz-{1..3} setup and you've got
| the FEC/Parity calculations that happen. Every read has it's
| checksum checked, and when reading if it finds issues it'll
| start resilvering them asap. You are of course right in that
| it will eventually start getting to worse and worse
| performance as it's having to do much more rewriting and full
| on scrubbing if there are constant amounts of errors
| happening but it can generally handle things pretty well.
| Dylan16807 wrote:
| I don't know...
|
| Let's say you have 6 drives in raidz2. If you have a 1%
| silent failure chance per block, then writing a set of 6
| blocks has a 0.002% silent failure rate. And ZFS doesn't
| immediately verify writes, so it won't try again.
|
| If that's applied to 4KB blocks, then we have a 0.002%
| failure rate per 16KB of data. It will take about 36
| thousand sets of blocks to reach 50% odds of losing data,
| which is only half a gigabyte. If we look at the larger
| block ZFS uses internally then it's a handful of gigabytes.
|
| And that's without even adding the feature where writing
| one block will corrupt other blocks.
| toxik wrote:
| With this level of adversarial problems, you'd better formulate
| your whole IO stack as a two-player minimax game.
| throw0101a wrote:
| > _For example, imagine an evil SSD which had a 1% chance of
| rolling a sector back to a previous version, a 1% chance of
| saying a write failed when it didn 't, a 1% chance of writing
| data to the wrong sector number, a 1% chance of flipping some
| bits in the written data, and a 1% chance of disconnecting and
| reconnecting a few seconds later._
|
| There are stories from the ZFS folks of dealing with these
| issues and things ran just fine.
|
| While not directly involved with ZFS development (IIRC), Bryan
| Cantrill was very 'ZFS-adjacent' since he used Solaris-based
| systems for a lot of his career, and he has several rants about
| firmware that you can find online.
|
| A video that went viral many years ago, with Cantrill and
| Brendan Gregg, is "Shouting in the Datacenter":
|
| * https://www.youtube.com/watch?v=tDacjrSCeq4
|
| * https://www.youtube.com/watch?v=lMPozJFC8g0 (making of)
| amluto wrote:
| ISTM one could design a filesystem as a Byzantine-fault-
| tolerant distributed system that happens to have many nodes
| (disks) partially sharing hardware (CPU, memory, etc). The
| result would not look that much like RAID, but would look quite
| a bit like Ceph and its relatives.
|
| Bonus points for making the result efficiently support multiple
| nodes, each with multiple disks.
| PhilipRoman wrote:
| I have basically the opposite problem. I've been looking for a
| filesystem that maximizes performance (and minimizes actual
| disk writes) at the cost of reliability. As long as it loses
| all my data less than once a week, I can live with it.
| viraptor wrote:
| Have you tried allowing ext4 to ignore all safety?
| data=writeback, barrier=0, bump up dirty_ratio, tune
| ^has_journal, maybe disable flushes with
| https://github.com/stewartsmith/libeatmydata
| PhilipRoman wrote:
| Thanks, this looks promising
| the8472 wrote:
| You can also add journal_async_commit,noauto_da_alloc
|
| > maybe disable flushes with
| https://github.com/stewartsmith/libeatmydata
|
| overlayfs has a volatile mount option that has that effect.
| So stacking a volatile overlayfs with the upper and lower
| on the same ext4 could provide that behavior even for
| applications that can't be intercepted with LD_PRELOAD
| seunosewa wrote:
| How would you cope with losing all you data once a week?
| PhilipRoman wrote:
| "once a week" was maybe a too extreme example. For my case
| specifically: lost data can be recomputed. Basically a
| bunch of compiler outputs, indexes and analysis results on
| the input files, typically an order of magnitude larger
| than the original files themselves.
|
| Any files that are important would go to a separate, more
| reliable filesystem (or uploaded elsewhere).
| kadoban wrote:
| On top of other suggestions I've seen you get already,
| raid0 might be worth looking at. That has some good speed
| vs reliability tradeoffs (in the direction you want).
| dur-randir wrote:
| Some video production workflows are run on 4xraid0 just for
| the speed - it fails rarely enough and intermediate output
| is just re-created.
| desro wrote:
| Can confirm. When I can't work off my internal MacBook
| storage, my working drive is a RAID0 NVME array over
| Thunderbolt. Jobs setup in Carbon Copy Cloner make
| incremental hourly backups to a NAS on site as well as a
| locally-attached RAID6 HDD array.
|
| If the stripe dies, worst case is I lose up to one hour
| of work, plus let's say another hour copying assets back
| to a rebuilt stripe.
|
| There are _so many_ MASSIVE files created in intermediate
| stages of audiovisual [post] production.
| sph wrote:
| It all depends on how much reliability are you willing to
| give up for performance.
|
| Because I have the best storage performance you'll ever find
| anywhere, 100% money-back guaranteed: write to /dev/null. It
| comes with the downside of 0% reliability.
|
| You can write to a disk without a file-system, sequentially,
| until space ends. Quite fast actually, and reliable, until
| you reach the end, then reliability drops dramatically.
| [deleted]
| PhilipRoman wrote:
| Yeah, I've had good experience with bypassing fs layer in
| the past, especially on a HDD the gains can be insane. But
| it won't help as I still need a more-or-less posixy
| read/write API.
|
| P.S. I'm fairly certain that /dev/null would lose my data a
| bit more often than once a week.
| jasomill wrote:
| Trouble is you can't use /dev/null as a filesystem, even
| for testing.
|
| On a related note, though, I've considered the idea of
| creating a "minimally POSIX-compliant" filesystem that
| randomly reorders and delays I/O operations whenever
| standards permit it to do so, along with any other odd
| behavior I can find that remains within the _letter_ of
| published standards (unusual path limitations, support for
| exactly two hard links per file, sparse files that require
| holes to be aligned on 4,099-byte boundaries in spite of
| the filesystem 's reported 509-byte block size, etc., all
| properly reported by applicable APIs).
| dralley wrote:
| Cue MongoDB memes
| ilyt wrote:
| Probably just using it for cache of some kind
| magicalhippo wrote:
| Tongue-in-cheek solution: use a ramdisk[1] for dm-writecache
| in writeback mode[2]?
|
| [1]: https://www.kernel.org/doc/Documentation/blockdev/ramdis
| k.tx...
|
| [2]: https://blog.delouw.ch/2020/01/29/using-lvm-cache-for-
| storag...
| bionade24 wrote:
| Not sure if this is feasible, but have you considered dumping
| binary on the raw disk like done with tapes?
| crabbone wrote:
| You could even use partitions as files. You could only have
| 128 files, but maybe that's enough for OP?
| rwmj wrote:
| Don't you want a RAM disk for this? It'll lose all your data
| (reliably!) when you reboot.
|
| You could also look at this:
| https://rwmj.wordpress.com/2020/03/21/new-nbdkit-remote-
| tmpf... We use it for Koji builds, where we actually don't
| care about keeping the build tree around (we persist only the
| built objects and artifacts elsewhere). This plugin is pretty
| fast for this use case because it ignores FUA requests from
| the filesystem. Obviously don't use it where you care about
| your data.
| ilyt wrote:
| > Don't you want a RAM disk for this? It'll lose all your
| data (reliably!) when you reboot.
|
| Uhh hello, pricing ?
| fwip wrote:
| Depends on how much space you need.
| mastax wrote:
| You should be able to do this with basically any file system
| by using the mount options `async`(default) `noatime`
| disabling journalling, and massively increasing
| vm.dirty_background_ratio, vm.dirty_ratio, and
| vm.dirty_expire_centisecs.
| rwmj wrote:
| nbdkit memory 10G --filter=error error-rate=1%
|
| ... and then nbd-loop-mount that as a block device and create
| your filesystem on top.
|
| Notes:
|
| We're working on making a ublk interface to nbdkit plugins so
| the loop mount wouldn't be needed.
|
| There are actually better ways to use the error filter, such as
| triggering it from a file, see the manual:
| https://www.libguestfs.org/nbdkit-error-filter.1.html
|
| It's an interesting idea to have an "evil" filter that flips
| bits at random. I might write that!
| antongribok wrote:
| How does this compare with dm-flakey [0] ?
|
| [0]: https://www.kernel.org/doc/html/latest/admin-
| guide/device-ma...
| Dwedit wrote:
| Suddenly I'm reminded of the time someone made a Bad Internet
| simulator (causing packet loss or other problems) and named
| the program "Comcast".
| seized wrote:
| I've had ZFS pools survive (at different times over the span of
| years):
|
| - A RAIDz1 (RAID5) pool with a second disk start failing while
| rebuilding from an earlier disk failure (data was fine)
|
| - A water to air CPU cooler leaking, CPU overheated and the
| water shorted and killed the HBA running a pool (data was fine)
|
| - An SFF-8088 cable half plugged in for months, pool would
| sometimes hiccup, throw off errors, take a while to list files,
| but worked fine after plugging it in properly (data was fine
| after)
|
| Then the usual disk failures which are a non-event with ZFS.
| gigatexal wrote:
| This is why I always opt for ZFS.
| ilyt wrote:
| I recovered from 3 disk RAID 6 failure (which itself was
| organization failure driven...) in linux's mdadm... ddrescue
| to the rescue, I guess I got "lucky" the bad blocks didn't
| happen in same place on all drives (one died, other started
| returning bad blocks), but chance for that to happen are
| infinitesly small in the first place
|
| So _shrug_
| avianlyric wrote:
| How do you know you got lucky with corrupted blocks?
|
| mdadm doesn't checksum data, and just trusts the HDD to
| either return correct data, or an error. But HDDs return
| incorrect data all the time, their specs even tell you how
| much incorrect data they'll return, and for anything over
| about 8TB you're basically guaranteed some silent
| corruption if you read every byte.
| johnmaguire wrote:
| Yes, I also had a "1 disk failed, second disk failed during
| rebuild" event (like the parent, not your story) with mdadm
| & RAID 6 with no issues.
|
| People seem to love ZFS but I had no issues running mdadm.
| I'm now running a ZFS pool and so far it's been more work
| (and things to learn), requires a lot more RAM, and the
| benefits are... escaping me.
| ysleepy wrote:
| Are you sure the data survived? ZFS is sure and proves it
| with checksums over metadata and data. I don't know madm
| well enough to know if it does this too.
| 112233 wrote:
| Please, where can i read more about this. I remember bricking
| OCZ drive by setting ATA password, as was fashionable to do
| back then, but 1% writes going to wrong sector - what are these
| drives, fake sd cards from aliexpress?
|
| Like, which manufacturer goes, like "tests show we cannot write
| more that 400kB without corrupting drive, let us ship this!" ?
| jlokier wrote:
| _> Like, which manufacturer goes, like "tests show we cannot
| write more that 400kB without corrupting drive, let us ship
| this!" ?_
|
| According to https://www.sqlite.org/howtocorrupt.html there
| are such drives:
|
| _4.2. Fake capacity USB sticks_
|
| _There are many fraudulent USB sticks in circulation that
| report to have a high capacity (ex: 8GB) but are really only
| capable of storing a much smaller amount (ex: 1GB). Attempts
| to write on these devices will often result in unrelated
| files being overwritten. Any use of a fraudulent flash memory
| device can easily lead to database corruption, therefore.
| Internet searches such as "fake capacity usb" will turn up
| lots of disturbing information about this problem._
|
| Bit flips amd overwriting wrong sectors from unrelated files
| being written are also mentioned. You might think this sort
| of thing is just cheap USB flash drives, bit I've been told
| about NVMe SSDs violating their guarantees and causing very
| strange corruption patterns too. Unfortunately when the cause
| is a bug in the storage's device own algorithms, the rare
| corruption event is not necessarily limited to a few bytes or
| just 1 sector here or there, nor to just the sectors being
| written.
|
| I don't know how prevalent any of these things are really.
| The sqlite.org says "most" consumer HDDs lie about commiting
| data to the platter before reporting they've done so, but
| when I worked with ext3 barriers back in the mid 2000s, the
| HDDs I tested had timing consistent with flushing write cache
| correctly, and turning off barriers did in fact lead to
| observable filesystem corruption on power loss, which was
| prevented by turning on barriers. The barriers were so
| important they made the difference between embedded devices
| that could be reliably power cycled, vs those which didn't
| reliably recover on boot.
| Dylan16807 wrote:
| > Fake capacity USB sticks
|
| Those drives have a sudden flip from 0% corruption to
| 95-100% corruption when you hit their limits. I wouldn't
| count that as the same thing. And you can't reasonably
| expect anything to work on those.
|
| > The sqlite.org says "most" consumer HDDs lie about
| commiting data to the platter before reporting they've done
| so, but when I worked with ext3 barriers back in the mid
| 2000s, the HDDs I tested had timing consistent with
| flushing write cache correctly
|
| Losing a burst of writes every once in a while also
| manifests extremely differently from steady 1% loss and
| needs to be handled in a very different way. And if it's at
| power loss it might be as simple as rolling back the last
| checkpoint during mount if verification fails.
| londons_explore wrote:
| Writes going to the wrong sector are usually wear levelling
| algorithms gone wrong. Specifically, it normally means the
| information about which logical sector maps to which physical
| sector was updated not in sync with the actual writing of the
| data. This is a common performance 'trick' - by delaying and
| aggregating these bookkeeping writes, and taking them off the
| critical path, you avoid writing so much data and the user
| sees lower latency.
|
| However, if something like a power failure or firmware crash
| happens, and the bookkeeping writes never happen, then the
| end result that the user sees after a reboot is their data
| written to the wrong sector.
| jeffbee wrote:
| But that would require Linux hackers to read and understand the
| literature, and absorb the lessons of industry practice,
| instead of just blurting out their aesthetic ideal of a
| filesystem.
| sangnoir wrote:
| Isn't Linux the most deployed OS in industry (by
| practitioners)? Are the Linux hyperscalers hiding their FS
| secret-sauce, or perhaps the "aesthetic ideal" filesystems
| available to Linux good enough?
| jeffbee wrote:
| I imagine the hyperscalers are all handling integrity at a
| higher level where individual filesystems on single hosts
| are irrelevant to the outcome. In such applications, any
| old filesystem will do.
|
| For people who do not have application-level integrity, the
| systems that offer robustness in the face of imperfect
| storage devices are sold by companies like NetApp, which a
| lot of people would sneer at but they've done the math.
| Datagenerator wrote:
| Seen NetApp's boot messages, it's FreeBSD under the hood
| seized wrote:
| As is EMC Isilon.
| jeffbee wrote:
| They have their own filesystem with all manner of
| integrity protection.
| https://en.wikipedia.org/wiki/Write_Anywhere_File_Layout
| deathanatos wrote:
| On the whole, I'm not sure that a FS can work around such a
| byzantine drive, at least not if its the only such drive in the
| system. I'd rather FSes not try to pave over these: these disks
| are faulty, and we need to demand, with our wallets, better
| quality hardware.
|
| > _1% chance of disconnecting and reconnecting a few seconds
| later_
|
| I actually have such an SSD. It's unusable, when it is in that
| state. The FS doesn't corrupt the data, but it's hard for the
| OS to make forward progress, and obviously a lot of writes fail
| at the application level. (It's a shitty USB implementation on
| the disk: it disconnects if it's on a USB-3 capable port, and
| too much transfer occurs. It's USB-2, though; connecting it to
| a USB-2-only port makes it work just fine.)
| mprovost wrote:
| At the point where it disappears for seconds it's a
| distributed system not an attached disk. At this point you
| have to start applying the CAP theorem.
|
| At least in Unix the assumption is that disks are always
| attached (and reliable...) so write errors don't typically
| bubble up to the application layer. This is why losing an NFS
| mount typically just hangs the system until it recovers.
| deathanatos wrote:
| > _At least in Unix the assumption is that disks are always
| attached (and reliable...)_
|
| I want to say a physical disk being yanked from the system
| (essentially what was happening, as far as the OS could
| tell) does cause I/O errors in Linux? I could be wrong
| though, this isn't exactly something I try to exercise.
|
| As for it being a distributed system ... I suppose? But
| that's what an FS's log is for: when the drive reconnects,
| there will either be a pending WAL entry, or not. If there
| is, the write can be persisted, otherwise, it is lost. But
| consistency should still happen.
|
| Now, an _app_ might not be ready for that, but that 's an
| app bug.
|
| But it can always happen that the power goes out, which in
| my situation is equivalent to a disk yank. There's also
| small children, the SO tripping over a cable, etc.
|
| But these are different failure modes from some of what the
| above post listed, such as disks undoing acknowledged
| writes, or lying about having persisted the write. (Some of
| the examples are byzantine, some are not.)
| [deleted]
| brnt wrote:
| Erasure coding at the filesystem level? Finally!
|
| I've not dared try bcachefs out though, I'm quite wary of data
| loss, even on my laptop. Does anyone have experience to share?
| BlackLotus89 wrote:
| Had(have) a Laptop that crashed reproducible when touching it
| wrong. Had a few btrfs corruptions on it and after a while got
| enough. Have it running bcachefs as rootfs for a few years now
| and had no issue whatsoever with it. Home is still btrfs (for
| reasons) and had no data loss on that either. Only problems I
| had were fixed through booting and mounting it through a rescue
| system (no fsck necessary) had that twice in 2 years or so. Was
| too lazy to check what the bcachefs hook (aur package) does
| wrong.
|
| Edit: Reasons for home being btrfs. I set this up a long
| fucking time ago and it was more or less meant as a stresstest
| for bcachefs. Since I didn't want data loss on important data
| (like my home) I left my home as btrfs
| orra wrote:
| Oh! This is very exciting. Bcachefs could be the next gen
| filesystem that Linux needs[1].
|
| Advantages over other filesystems:
|
| * ext4 or xfs -- these two don't use ECC to protect your data,
| only the filesystem metadata
|
| * zfs -- zfs is technically great, but binary distribution of the
| zfs code is tricky, because the CDDL is GPL incompatible
|
| * btrfs -- btrfs still doesn't have reliable RAID5
|
| [1] It's been in development for a number of years. It now being
| proposed for inclusion in the mainline kernel is a major
| milestone.
| vladvasiliu wrote:
| > * zfs -- zfs is technically great, but binary distribution of
| the zfs code is tricky, because the CDDL is GPL incompatible
|
| Building your own ZFS module is easy enough, for example on
| Arch with zfs-dkms.
|
| But there's also the issue of compatibility. Sometimes kernel
| updates will break ZFS. Even minor ones, 6.2.13 IIRC broke it,
| whereas 6.2.12 was fine.
|
| Right now, 6.3 seems to introduce major compatibility problems.
|
| ---
|
| edit: looking through the openzfs issues, I was likely thinking
| of 6.2.8 breaking it, where 6.2.7 was fine. Point stands,
| though. https://github.com/openzfs/zfs/issues/14658
|
| Regarding 6.3 support, it apparently is merged in the master
| branch, but no release as of yet.
| https://github.com/openzfs/zfs/issues/14622
| kaba0 wrote:
| It might help someone: nixos can be configured to always use
| the latest kernel version that is compatible with zfs, I
| believe its
| config.boot.zfs.package.latestCompatibleLinuxPackages .
| bjoli wrote:
| How is the legal situation of doing that? If I had a company
| I wouldn't want to get in trouble with any litigious
| companies.
| boomboomsubban wrote:
| Unless you're distributing, I don't see how anybody could
| do anything. Personal (or company wide) use has always
| allowed the mixing of basically any licenses.
|
| The worst case scenarios would be something like Ubuntu
| being unable to provide compiled modules, but dkms would
| still be fine. Or the very unlikely ZFS on Linux getting
| sued, but that would involve a lengthy trial that would
| allow you to move away from Open ZFS.
| chasil wrote:
| The danger is specifically to the copyright holders of
| Linux - the authors who have code in the kernel. If they
| do not defend their copyright, then it is not strong and
| can be broken in certain scenarios.
|
| "Linux copyright holders in the GPL Compliance Project
| for Linux Developers believe that distribution of ZFS
| binaries is a GPL violation and infringes Linux's
| copyright."
|
| Linux bundling ZFS code would bring this text against the
| GPL: "You may not offer or impose any terms on any
| Covered Software in Source Code form that alters or
| restricts the applicable version of [the CDDL]."
|
| Ubuntu distributes ZFS as an out of tree module, which
| taints the kernel at immediately at installation.
| Hopefully, this is enough to prevent a great legal
| challenge.
|
| https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/
| boomboomsubban wrote:
| Yes, distribution has legal risks. Use does not, it only
| has the risk that they are unable to get ZFS distributed.
| vladvasiliu wrote:
| The ArchZFS project distributes binary kernel images with
| ZFS integrated. I don't know what the legal situation is
| for that.
|
| In my case, the Arch package is more of a "recipe maker".
| It fetches the Linux headers and the zfs source code and
| compiles this for local use. As far as they are concerned,
| there is no distribution of the resulting artifact. IANAL,
| but I think if there's an issue with that, then OpenZFS is
| basically never usable under Linux.
|
| Other companies distributed kernels with zfs support
| directly, such as Ubuntu. I don't recall there being news
| of them being sued over this, but maybe they managed to
| work something out.
| 5e92cb50239222b wrote:
| archzfs does not distribute any kernel images, they only
| provide pre-built modules for the officially supported
| kernels.
| dsr_ wrote:
| IANAL.
|
| Oracle is very litigious. However, OpenZFS has been
| releasing code for more than a decade. Ubuntu shipped
| integrated ZFS/Linux in 2016. It's certain that Oracle
| knows all about it and has decided that being vague is more
| in their interests than actually settling the matter.
|
| On my list of potential legal worries, this is not a
| priority for me.
| cduzz wrote:
| I would add to this "IANAL But" list
|
| https://aws.amazon.com/fsx/openzfs/
|
| So -- AWS / Amazon are certainly big enough to have
| reviewed the licenses and have some understanding of
| potential legal risks of this.
| orra wrote:
| You're right that DKMS is fairly easy (at least until you
| enable secure boot).
|
| > Even minor ones, 6.2.13 IIRC broke it, whereas 6.2.12 was
| fine.
|
| Interesting!
|
| It's just a shame the license has hindered adoption. Ubuntu
| were shipping binary ZFS modules at one point, but they have
| walked back from that.
| vladvasiliu wrote:
| > You're right that DKMS is fairly easy (at least until you
| enable secure boot).
|
| Still easy. Under Arch, the kernel image isn't signed, so
| if you enable secure boot you need to fiddle with signing
| on your own. At that point, you can just sign the kernel
| once the module is built. Works fine for me.
| mustache_kimono wrote:
| > Ubuntu were shipping binary ZFS modules at one point, but
| they have walked back from that.
|
| This is incorrect? Ubuntu is still shipping binary modules.
| orra wrote:
| Right, but various things point to ZFS being de facto
| deprecated: https://www.omgubuntu.co.uk/2023/01/ubuntu-
| zfs-support-statu...
| mustache_kimono wrote:
| > various things point to ZFS being de facto deprecated
|
| I'm not sure that's the case? Your link points to the ZFS
| _on root install_ being deprecated on the _desktop_. I 'm
| not sure what inference you/we can draw from that
| considering ZFS is a major component to LXD, and Ubuntu
| and Linux's sweet spot is as a server OS.
|
| > Ubuntu were shipping binary ZFS modules at one point,
| but they have walked back from that.
|
| Not to be persnickety, but this was your claim, and
| Ubuntu is still shipping ZFS binary modules on all it's
| current releases.
| orra wrote:
| Yeah, my wording was clumsy, but thanks for assuming good
| faith. I essentially meant their enthusiasm had wained.
|
| It's good you can give reasons ZFS is still important to
| Ubuntu on the server, although as a desktop user I'm sad
| nobody wants to ship ZFS for the desktop.
| ilyt wrote:
| > [1] It's been in development for a number of years. It now
| being proposed for inclusion in the mainline kernel is a major
| milestone.
|
| not a measure of quality in the slightest. btrfs had some
| serious bugs over the years despise being in mainline
| orra wrote:
| True, but bcachefs gives the impression of being better
| designed, nor being rushed upstream. I think it helps that
| bcachefs evolved from bcache.
| the8472 wrote:
| > zfs is technically great
|
| It's only great due to the lack of competitors in the
| checksummed-CoW-raid category. It lacks a bunch of things:
| Defrag, Reflinks, On-Demand Dedup, Rebalance (online raid
| geometry change, device removal, device shrink). It also wastes
| RAM due to page cache + ARC.
| macdice wrote:
| Reflinks and copy_file_range() are just landing in OpenZFS
| now I think? (Block cloning)
| pongo1231 wrote:
| Block cloning support has indeed recently landed in git and
| already allows for reflinks under FreeBSD. Still has to be
| wired up for Linux though.
| mustache_kimono wrote:
| Really excited about this.
|
| Once support hits in Linux, a little app of mine[0] will
| support block cloning for its "roll forward" operation,
| where all previous snapshots are preserved, but a
| particular snapshot is rolled forward to the live
| dataset. Right now, data is simply diff copied in chunks.
| When this support hits, there will be no need to copy any
| data. Blocks written to the live dataset can just be
| references to the underlying snapshot blocks, and no
| extra space will need to be used.
|
| [0]: https://github.com/kimono-koans/httm
| nextaccountic wrote:
| What does it mean to roll forward? I read the linked
| Github and I don't get what is happening
|
| > Roll forward to a previous ZFS snapshot, instead of
| rolling back (this avoids destroying interstitial
| snapshots): sudo httm --roll-forward=r
| pool/scratch@snap_2023-04-01-15:26:06_httmSnapFileMount
| [sudo] password for kimono: httm took a pre-
| execution snapshot named: rpool/scratch@snap_pre_2023-04-
| 01-15:27:38_httmSnapRollForward ... httm
| roll forward completed successfully. httm took a
| post-execution snapshot named: rpool/scratch@snap_post_20
| 23-04-01-15:28:40_:snap_2023-04-01-15:26:06_httmSnapFileM
| ount:_httmSnapRollForward
| mustache_kimono wrote:
| From the help and man page[0]: --roll-
| forward="snap_name" traditionally 'zfs
| rollback' is a destructive operation, whereas httm roll-
| forward is non-destructive. httm will copy only files
| and their attributes that have changed since a specified
| snapshot, from that snapshot, to its live dataset. httm
| will also take two precautionary snapshots, one before
| and one after the copy. Should the roll forward fail for
| any reason, httm will roll back to the pre-execution
| state. Note: This is a ZFS only option which requires
| super user privileges.
|
| I might also add 'zfs rollback' is a destructive
| operation because it destroys snapshots between the
| current live version of the filesystem and the rollback
| snapshot target (the 'interstitial' snapshots). Imagine
| you have a ransom-ware installed and you _need_ to
| rollback, but you want to view the ransomware 's
| operations through snapshots for forensic purposes. You
| can do that.
|
| It's also faster than a checksummed rsync, because it
| makes a determination based on the underlying ZFS
| checksums, or more accurate than a non-checksummed rsync.
|
| This is a relatively minor feature re: httm. I recommend
| installing and playing around with it a bit.
|
| [0]: https://github.com/kimono-
| koans/httm/blob/master/httm.1
| nextaccountic wrote:
| What I don't understand is: aren't zfs snapshots
| writable, like in btrfs?
|
| If I wanted to rollback the live filesystem into a
| previous snapshot, why couldn't I just start writing into
| the snapshot instead? (Or create another snapshot that is
| a clone of the old one, and write into it)
| throw0101a wrote:
| > _What I don 't understand is: aren't zfs snapshots
| writable, like in btrfs?_
|
| ZFS snapshots, following the historic meaning of
| "snapshot", are read-only. ZFS supports _cloning_ of a
| read-only snapshot to a writable volume /file system.
|
| * https://openzfs.github.io/openzfs-docs/man/8/zfs-
| clone.8.htm...
|
| Btrfs is actually the one 'corrupting' the already-
| accepted nomenclature of snapshots meaning a read-only
| copy of the data.
|
| I would assume the etymology of the file system concept
| of a "snapshot" derives from photography, where something
| is frozen at a particular moment of time:
|
| > _In computer systems, a snapshot is the state of a
| system at a particular point in time. The term was coined
| as an analogy to that in photography._ [...] _To avoid
| downtime, high-availability systems may instead perform
| the backup on a snapshot--a read-only copy of the data
| set frozen at a point in time--and allow applications to
| continue writing to their data. Most snapshot
| implementations are efficient and can create snapshots in
| O(1)._
|
| *
| https://en.wikipedia.org/wiki/Snapshot_(computer_storage)
|
| * https://en.wikipedia.org/wiki/Snapshot_(photography)
| orra wrote:
| Sure, there's lots of room for improvement. IIRC, rebalancing
| might be a WIP, finally?
|
| But credit where credit is due: for a long time, ZFS has been
| the only fit for purpose filesystem, if you care about the
| integrity of your data.
| the8472 wrote:
| Afaik true rebalancing isn't in the works. Some limited
| add-device and remove-vdev features are in progress but
| AIUI they come with additional overhead and aren't as
| flexible.
|
| btrfs and bcachefs rebalance leave your pool as if you had
| created it from scratch with the existing data and the new
| layout.
| e12e wrote:
| > [ZFS is] only great due to the lack of competitors in the
| checksummed-CoW-raid category.
|
| You forgot robust native encryption, network transparent
| dump/restore (ZFS send/receive) - and broad platform support
| (not so much anymore).
|
| For a while you could have a solid FS with encryption support
| for your USB hd that could be safely used with Linux, *BSD,
| Windows, Open/FOSS Solaris and MacOS.
| josephg wrote:
| Is it just the implementation of zfs which is owned by
| oracle now? I wonder how hard it would be to write a
| compatible clean room reimplementation of zfs in rust or
| something, from the spec.
|
| Even if it doesn't implement every feature from the real
| zfs, it would still be handy for OS compatibility reasons.
| nine_k wrote:
| I would suppose it would take years of effort? and a lot
| of testing in search of performance enhancements and
| elimination of corner cases. Even if the code of the FS
| itself is created in a provably correct manner (a very
| tall order even with Rust), real hardware has a lot of
| quirks which need to be addressed.
| chasil wrote:
| I wish the btrfs (and perhaps bcachefs) projects would
| collaborate with OpenZFS to rewrite equivalent code that
| they all used.
|
| It might take years, but washing Sun out of OpenZFS is
| the only thing that will free it.
| mustache_kimono wrote:
| OpenZFS is already free and open source. Linux kernel
| developers should just stop punching themselves in face.
|
| One way to solve the ZFS issue, Linus Torvalds could call
| a meeting of project leadership, and say, "Can we all
| agree that OpenZFS is not a derived work of Linux? It
| seems pretty obvious to anyone who understands the
| meaning of copyright term of art 'derived work' and the
| origin of ZFS ... Good. We shall add a commit which
| indicates such to the COPYING file [0], like we have for
| programs that interface at the syscall boundary to clear
| up any further confusion."
|
| Can you imagine trying to bring a copyright infringement
| suit (with no damages!) in such an instance?
|
| The ZFS hair shirt is a self imposed by semi-religious
| Linux wackadoos.
|
| [0]: See, https://github.com/torvalds/linux/blob/master/L
| ICENSES/excep...
| AshamedCaptain wrote:
| Even if you were to be able to say that OpenZFS is not a
| derived work of Linux, all it would allow you to do is to
| distribute OpenZFS. You would _still_ not be able to
| distribute OpenZFS + Linux as a combined work.
|
| (I am one of these guys who thinks what Ubuntu is doing
| is crossing the line. To package two pieces of software
| whose license forbids you from distributing their
| combination in a way that "they are not combined but can
| be combined with a single click" is stretching it too
| much. )
|
| It would be much simpler for Oracle to simply relicense
| older versions of ZFS under another license.
| mustache_kimono wrote:
| > Even if you were to be able to say that OpenZFS is not
| a derived work of Linux, all it would allow you to do is
| to distribute OpenZFS. You would _still_ not be able to
| distribute OpenZFS + Linux as a combined work.
|
| Why? Linus said such modules and distribution were
| acceptable re: AFS, _an instance which is directly on
| point_. See: https://lkml.org/lkml/2003/12/3/228
| AshamedCaptain wrote:
| Where is he saying that you can distribute the combined
| work? That would not only violate the GPL, it would also
| violate AFS's license...
|
| The only thing he's saying that there is that he's not
| even 100% sure whether AFS module is a derived work or
| not (if it was, it would be a violation _just to
| distribute the module by itself_!). Go imagine what his
| opinion will be on someone distributing a kernel already
| almost pre-linked with ZFS.
|
| Not that it matters, since he's not the license author
| not even the copyright holder these days...
| mustache_kimono wrote:
| > Where is he saying that you can distribute the combined
| work?
|
| What's your reasoning as to why one couldn't, if we grant
| Linus's reasoning re: AFS as it applies to ZFS?
|
| > Not that it matters, since he's not the license author
| not even the copyright holder these days...
|
| Linux kernel community has seen fit to give its
| assurances re: other clarifications/exceptions. See the
| COPYING file.
| rascul wrote:
| Linus has some words on this matter:
|
| > And honestly, there is no way I can merge any of the
| ZFS efforts until I get an official letter from Oracle
| that is signed by their main legal counsel or preferably
| by Larry Ellison himself that says that yes, it's ok to
| do so and treat the end result as GPL'd.
|
| > Other people think it can be ok to merge ZFS code into
| the kernel and that the module interface makes it ok, and
| that's their decision. But considering Oracle's litigious
| nature, and the questions over licensing, there's no way
| I can feel safe in ever doing so.
|
| > And I'm not at all interested in some "ZFS shim layer"
| thing either that some people seem to think would isolate
| the two projects. That adds no value to our side, and
| given Oracle's interface copyright suits (see Java), I
| don't think it's any real licensing win either.
|
| https://www.realworldtech.com/forum/?threadid=189711&curp
| ost...
| mustache_kimono wrote:
| > Linus has some words on this matter:
|
| I hate to point this out, but this only demonstrates
| Linux Torvalds doesn't know much about copyright law.
| Linus could just as easily say "I was wrong. Sorry! As
| you all know -- IANAL. It's time we remedied this stupid
| chapter in our history. After all, _I gave similar
| assurances to the AFS module_ when it was open sourced
| under a GPL incompatible license in 2003. "
|
| Linus's other words on the matter[0]:
|
| > But one gray area in particular is something like a
| driver that was originally written for another operating
| system (ie clearly not a derived work of Linux in
| origin). At exactly what point does it become a derived
| work of the kernel (and thus fall under the GPL)?
|
| > THAT is a gray area, and _that_ is the area where I
| personally believe that some modules may be considered to
| not be derived works simply because they weren't designed
| for Linux and don't depend on any special Linux
| behaviour.
|
| [0]: https://lkml.org/lkml/2003/12/3/228
| kaba0 wrote:
| > wonder how hard it would be to write a compatible clean
| room reimplementation of zfs in rust or something, from
| the spec
|
| As for every non-trivial application - almost impossible.
| 0x457 wrote:
| Not exactly ZFS in Rust, but more like a replacement for
| ZFS in Rust: https://github.com/redox-os/tfs
|
| Worked stalled, though. Not compatible, but I was working
| on overlayfs for freebsd in rust, and it was not pleasant
| at all. Can't imagine making an entire "real" file system
| in Rust.
| gigatexal wrote:
| "Wastes" ram? That's a tunable my friend.
| viraptor wrote:
| https://github.com/openzfs/zfs/issues/10516
|
| The data goes through two caches instead of just page cache
| or just arc as far as I understand it.
| quotemstr wrote:
| Can I totally disable ARC yet?
| throw0101a wrote:
| zfs set primarycache=none foo/bar
|
| ?
|
| Though this will amplify reads as even metadata will need
| to be fetched from disk, so perhaps "=metadata" may be
| better.
|
| * https://openzfs.github.io/openzfs-
| docs/man/7/zfsprops.7.html...
| vluft wrote:
| I'm curious what your workflow is that not having any
| disk caching would have acceptable performance.
| 0x457 wrote:
| A workflow where the person doesn't understand that RAM
| isn't wasted and it just their utility to show usage is
| wrong. Imagine being mad at file system cache being
| stored in RAM.
| quotemstr wrote:
| The problem with ARC in ZFS on Linux is the double
| caching. Linux already has a page cache. It doesn't need
| ZFS to provide a second page cache. I want to store
| things in the Linux page cache once, not once in the page
| cache and once in ZFS's special-sauce cache.
|
| If ARC is so good, it should be the general Linux page
| cache algorithm.
| mustache_kimono wrote:
| > It's only great due to the lack of competitors in the
| checksummed-CoW-raid category.
|
| _blinks eyes, shakes head_
|
| "It's only great because it's the only thing that's figured
| out how to do a hard thing really well" may be peak FOSS
| entitlement syndrome.
|
| Meanwhile, btrfs has rapidly gone nowhere, and, if you read
| the comments to this PR, bcachefs would love to get to simply
| nowhere/btrfs status, but is still years away.
|
| ZFS fulfills the core requirement of a filesystem, which is
| to store your data, such that when you read it back you can
| be assured it was the data you stored. It's amazing we
| continue to countenance systems that don't do this, simply
| because not fulfilling this core requirement was once
| considered acceptable.
| Dylan16807 wrote:
| I don't see what's entitled about the idea that "it
| fulfills the core requirements" is enough to get it "good"
| status but not "great" status. Even if that's really rare
| among filesystems.
| throw0101a wrote:
| > _Meanwhile, btrfs has rapidly gone nowhere_ [...]
|
| A reminder that it came out in 2009:
|
| * https://en.wikipedia.org/wiki/Btrfs
|
| (ext4 was declared stable in 2008.)
| deepspace wrote:
| Yes! File systems are hard. My prediction is that it will
| be *at least* 10 years before this newfangled FS gains
| both feature- and stability parity with BTRFS and ZFS.
|
| Also, BTRFS (albeit a modified version) has been used
| successfully in at least one commercial NAS (Synology),
| for many years. I don't see how that counts as "gone
| nowhere".
| throw0101a wrote:
| Are all the foot guns described described in 2021 been
| fixed?
|
| * https://arstechnica.com/gadgets/2021/09/examining-
| btrfs-linu...
| dnzm wrote:
| Not sure about "all", but apart from that article being
| more pissy than strictly necessary, RAID1 can now, in
| fact survive losing ore than one disk. That is, provided
| you use RAID1C3 or C4 (which keeps 3 or 4 copies, rather
| than the default 2). Also, not really sure how RAID1 not
| surviving >1 disk failure is a slight against btrfs, I
| think most filesystems would have issues there...
|
| As for the rest of the article -- the tone rubs me the
| wrong way, and somehow considering a FS shit because you
| couldn't be bothered to use the correct commands (the
| scrub vs balance ranty bit) doesn't instill confidence in
| me that the article is written in good faith.
|
| I believe the writer's biggest hangup/footgunnage with
| btrfs is still there: it's not zfs. Ymmv.
| mustache_kimono wrote:
| > Also, BTRFS (albeit a modified version) has been used
| successfully in at least one commercial NAS (Synology),
| for many years. I don't see how that counts as "gone
| nowhere".
|
| Excuse me for sounding glib. My point was btrfs isn't
| considered a serious competitor to ZFS in many of the
| spaces ZFS operates. Moreover, it's inability to do
| RAID5/6 after years of effort is just weird now.
| ilyt wrote:
| Yeah world decided just replicating data somewhere is far
| preferable if you want to have resilience, instead of making
| the separate nodes more resilient.
| rektide wrote:
| Btrfs still highly recommends a raid1 mode for Metadata, but
| for data itself, the raid-5 is fine.
|
| I somewhat recall there being a little progress on trying to
| fix the remaining "write hole" issues, in the past year or two.
| But in general, I think there's very little pressure to do so
| because so very many many people run raid-5 for data already &
| it works great. Getting Metadata off raid1 is low priority, a
| nice to have.
| kiririn wrote:
| Raid5 works ok until you scrub. Even scrubbing one device at
| a time is a barrage of random reads sustained for days at a
| time
|
| I'll very happily move back from MD raid 5 when linear scrub
| for parity raid lands
| tremon wrote:
| Still, even with raid1 for metadata and raid5 for data, the
| kernel still shouts at you about it being EXPERIMENTAL every
| time you mount such a filesystem. I understand that it's best
| to err on the side of caution, but that notice does a good
| job of persisting the idea that btrfs isn't ready for prime-
| time use.
|
| I use btrfs on most of my Linux systems now (though only one
| with raid5), except for backup disks and backups volumes:
| those I intend to keep on ext4 indefinitely.
| sedatk wrote:
| > btrfs still doesn't have reliable RAID5
|
| Synology offers btrfs + RAID5 without warning the user. I
| wonder why they're so confident with it.
| bestham wrote:
| They are running brtrfs on top of DM.
| https://kb.synology.com/en-
| nz/DSM/tutorial/What_was_the_RAID...
| sedatk wrote:
| Thanks for the link!
| sporkle-feet wrote:
| Synology doesn't use the btrfs raid - AIUI they layer non-
| raid btrfs over raid LVM
| IAmLiterallyAB wrote:
| Here's a link to the Bcachefs site https://bcachefs.org/
|
| I think it summarizes its features and strengths pretty well, and
| it has a lot of good technical information.
| sumtechguy wrote:
| Does anyone know if there any good links to current benchmarks
| between the diff types? My googlefu is only finding stuff form
| 2019.
| anentropic wrote:
| I can't help reading this name as Bca-chefs
|
| (...I realise it must be B-cache-fs)
| p1mrx wrote:
| Maybe we could call it b$fs
| baobrien wrote:
| huh, this is fun:
| https://lore.kernel.org/lkml/ZFrBEsjrfseCUzqV@moria.home.lan...
|
| There's a little x86-64 code generator in bcachefs to generate
| some sort of btree unpacking code.
| dathinab wrote:
| This is also the point which is the most likely to cause
| problems for this patch series (which is only fixes and utils
| added to the kernel) and the bcachefs in general.
|
| Like when you have an entry like "bring back function which
| could make developing viruses easier (through not a
| vulnerability by themself) related to memory management and
| code execution" the default answer is nop .. nooop .. never.
| (Which doesn't mean that it won't come back).
|
| It seems while it's not necessary to have this it's a non-
| neglible performance difference.
| viraptor wrote:
| It would be really nice if he posted the difference
| with/without the optimisation for context. I hope it's going
| to be included in the explanation post he's planning.
| kzrdude wrote:
| It looks like the code generator is only available for x86
| anyway, so it seems niche that way. I am all about baseline
| being good performance, not the special case.
| BenjiWiebe wrote:
| He mentions he wants to make the same type of
| optimization for ARM, so ARM+x86 certainly wouldn't be
| niche.
|
| I wouldn't even call x86 alone niche...
| Permik wrote:
| I'll be eagerly waiting for the upcoming optimization writeup
| mentioned here:
| https://lore.kernel.org/lkml/ZFyAr%2F9L3neIWpF8@moria.home.l...
| mastax wrote:
| Please post it on HN because I won't remember to go looking
| for it.
| dontlaugh wrote:
| It's bad enough that the kernel includes a JIT for eBPF. Adding
| more of them without hardware constraints and/or formal
| verification seems like a bad idea to me.
| baobrien wrote:
| yeah, most of the kernel maintainers in that thread seem to
| be against it. bcachefs does seem to also have a non-code-
| generating implementation of this, as it runs on
| architectures other than x86-64.
| sporkle-feet wrote:
| The feature that caught my eye is the concept of having different
| targets.
|
| A fast SSD can be set as the target for foreground writes, but
| that data will be transparently copied in the background to a
| "background" target, i.e. a large/slow disk.
|
| If this works, it will be awesome.
| viraptor wrote:
| You can also have that at block level (which is where bcache
| itself comes from). Facebook used it years ago and I had it on
| an SSD+HDD laptop... a decade ago at least? Unless you want the
| filesystem to know about it, it's ready to go now.
| jwilk wrote:
| Look up --write-mostly and --write-behind options in mdadm(8)
| man page.
|
| I can't recommend such a setup though. It works very poorly
| for me.
| saltcured wrote:
| See the lvmcache(7) manpage, which I think may be what the
| earlier poster was thinking of. It isn't an asymmetric RAID
| mode, but a tiered caching scheme where you can, for
| example, put a faster and smaller enterprise SSD in front
| of a larger and slower bulk store. So you can have a large
| bulk volume but the recently/frequently used blocks get the
| performance of the fast cache volume.
|
| I set it up in the past with an mdadm RAID1 array over SSDs
| as a caching layer in front of another mdadm array over
| HDDs. It performed quite well in a developer/compute
| workstation environment.
| viraptor wrote:
| I did mean bcache specifically.
| https://www.kernel.org/doc/Documentation/bcache.txt
| throw0101a wrote:
| > _A fast SSD can be set as the target for foreground writes,
| but that data will be transparently copied in the background to
| a "background" target, i.e. a large/slow disk._
|
| This is very similar in concept to (or an evolution of?) ZFS's
| ZIL:
|
| * https://www.servethehome.com/what-is-the-zfs-zil-slog-and-
| wh...
|
| * https://www.truenas.com/docs/references/zilandslog/
|
| * https://www.45drives.com/community/articles/zfs-caching/
|
| When this feature was first introduced to ZFS in the Solaris 10
| days there was an interesting demo from a person at Sun that I
| ran across: he was based in a Sun office on the US East Coast
| where he did stuff, but had access to Sun lab equipment across
| the US. He mounted iSCSI drives that were based in (IIRC)
| Colorado as a ZFS poool, and was using them for Postgres stuff:
| the performance was unsurprisingly not good. He then add a
| local ZIL to the ZFS pool and got I/O that was not too far off
| from some local (near-LAN) disks he was using for another pool.
| seized wrote:
| ZIL is just a fast place to write the data for sync
| operations. If everything is working then the ZIL is never
| read from, ZFS uses RAM as that foreground bit.
|
| Async writes on a default configuration don't hit the ZIL,
| only RAM for a few seconds then disk. Sync writes are RAM to
| ZIL, confirm write, then RAM to pool.
| ThatPlayer wrote:
| But ZIL is a cache, and not usable for long-term storage. If
| I combine a 1TB SSD with a 1TB HDD, I get 1TB of usable
| space. In bcachefs, that's 2TB of usable space.
|
| Bcache (not bcachefs) is more equivalent to ZIL.
| harvie wrote:
| What i really miss when compared to ZFS is ability to create
| datasets. I really like to use ZFS subvolumes for LXC containers.
| That way i can have separate sub-btree for each container with
| it's own size limit without having to create partitions or LVs,
| format the filesystem and then resize everything when i need to
| grow the partition or even defragment fs before shrinking it.
| With ZFS i can easily give and take disk capacity to my
| containers without having to do any multi step operation that
| requires close attention to prevent accidental data loss.
|
| Basicaly i just state what size i want that subtree to be and it
| happens without having to touch underlying block devices. Also i
| can change it anytime during runtime extremely easily. Eg.:
|
| zfs set quota=42G tank/vps/my_vps
|
| zfs set quota=32G tank/vps/my_vps
|
| zfs set quota=23G tank/vps/my_other_vps
|
| btrfs can kinda do this as well, but the commands are not as
| straighforward as in zfs.
|
| update: My bad. bcachefs seems to have subvolumes now. there is
| also some quota support, but so far the documentation is bit
| lacking, so not yet sure how to use that and if that can be
| configured per dataset.
| layer8 wrote:
| I parsed this as "BCA chefs" at first.
| curt15 wrote:
| For some reason VM and DB workloads are btrfs's Achilles heel but
| ZFS seems to handle them pretty well (provided that a suitable
| recordsize is set). How do they perform on bcachefs?
| candiddevmike wrote:
| I've never had a problem with these on BTRFS with COW disabled
| on their directories...
| pongo1231 wrote:
| The issue is that also disables many of the interesting
| features of BTRFS for those files. No checksumming, no
| snapshots and no compression. In comparison ZFS handles these
| features just fine for those kinds of files without the
| enormous performance / fragmentation issues of BTRFS (without
| nodatacow).
| [deleted]
| MisterTea wrote:
| Another File system I am interested in is GEFS - good enough fs
| (rather - "great experimental file shredder" until stable ;-).
| It's based on B-epsilon trees, a data structure which wasn't
| around when ZFS was designed. The idea is to build a ZFS like fs
| without the size and complexity of zfs. So far its plan 9 only
| and not production ready though there is a chance it could be
| ported to OpenBSD and a talk was given at NYC*BUG:
| https://www.nycbug.org/index?action=view&id=10688
|
| Code: http://shithub.us/ori/gefs/HEAD/info.html
| voxadam wrote:
| You're interested in more detailed information about bcachefs I
| highly recommend checking out _bcachefs: Principles of
| Operation_.[0]
|
| Also, the original developer of bcachefs (as well as bcache),
| Kent Overstreet posts status updates from time to time on his
| Patreon page.[1]
|
| [0] https://bcachefs.org/bcachefs-principles-of-operation.pdf
|
| [1] https://www.patreon.com/bcachefs
| AceJohnny2 wrote:
| Thanks for the links!
|
| I was wondering if bcachefs is architectured with NAND-flash
| SSD hardware in mind (as recently highlighted on HN in the "Is
| Sequential IO Dead In The Era Of The NVMe Drive" article [1]
| [2]), to optimize IO and hardware lifecycle.
|
| Skimming through the "bcachefs: Principles Of Operation" PDF,
| it appears the answer is no.
|
| [1] https://jack-vanlightly.com/blog/2023/5/9/is-sequential-
| io-d...
|
| [2] https://news.ycombinator.com/item?id=35878961
| koverstreet wrote:
| It is. There's also plans for ZNS SSD support.
___________________________________________________________________
(page generated 2023-05-11 23:01 UTC)