[HN Gopher] Bcachefs Status Update
___________________________________________________________________
Bcachefs Status Update
Author : pantalaimon
Score : 81 points
Date : 2022-10-29 08:22 UTC (14 hours ago)
(HTM) web link (lore.kernel.org)
(TXT) w3m dump (lore.kernel.org)
| 2pEXgD0fZ5cF wrote:
| Always excited to hear news about Bcachefs, can't wait for it to
| hit upstream!
| Quekid5 wrote:
| Just curious about the mention of persistent data structures. Is
| this about the FP notion of persistent data structures (which I
| suspect) or on-disk data structures? Or both? :)
|
| I find that (FP) persistent data structures are like a super-
| power in lots of ways. (And it only costs a log-N slowdown in the
| worst case.)
| koverstreet wrote:
| On disk, transactionally-updated data structures
| Quekid5 wrote:
| Thanks for the clarification.
| pdimitar wrote:
| I didn't understand even half of that but am still excited. I
| have no less than 6 external HDDs and SSDs lying around and it's
| been a pain to make a good stable ZFS pool out of them.
| `bcachefs` seems like a perfect match.
| insanitybit wrote:
| Great to see this progress. This is one of the projects I'm happy
| to sponsor.
| tmulcahy wrote:
| > The lock ordering rules had become too complicated and this was
| getting us too many transaction restarts, so I stole the standard
| technique from databases
|
| What is the standard technique from databases?
| koverstreet wrote:
| You detect the deadlocks when they happen and abort
| tmulcahy wrote:
| Any details/links that explain this? The article seems to
| suggest that they used to detect the deadlock as it was about
| to happen, and then abort everything and retry. This doesn't
| seem too different from "when they happen". What is the
| optimization?
| koverstreet wrote:
| Previously, we were checking for lock ordering violations.
| But a lock ordering violation isn't a deadlock, just a
| potential deadlock.
|
| Checking for a lock ordering violation is simpler because
| you only have to look at data structures for the current
| thread - you're just looking at what other locks the
| current thread is holding.
|
| To check for a deadlock you have to do a full DFS of
| threads blocked on locks held by other threads, looking for
| cycles. Quite a bit trickier, but it was well worth it :)
| tmulcahy wrote:
| Thanks for the explanation. Very interesting! I assume
| you only do the DFS when you detect a lock ordering
| violation? So if there is the potential for a deadlock
| you'll do extra work to make sure that you're actually
| deadlocked before aborting?
| mastax wrote:
| The talk linked at the end is really interesting, even as someone
| who doesn't know anything about filesystems outside of the
| classic Unix they teach in school and reading a lot of deep
| discussions about tuning ZFS. (None of my ZFS users are strenuous
| enough to require any tuning but I can't help but learn and
| optimize when there are levers to pull).
|
| It's interesting that it's implementing all these modern features
| on top of extent-based non-COW transactional btrees (though it
| looks like APFS is the same way). With the very database-like
| architecture at the low level, it would be fun to stick a SQL
| interface on top. You should be able to do advanced queries very
| efficiently without having to fall back to scanning every file in
| the directory like you would normally have to.
|
| I'm salivating looking at the caching/tiering features. People
| often try to implement a tiered filesystem with ZFS L2ARC, SLOG,
| and now the special vdev. But none of those features were
| intended for that and it shows.
| justinlloyd wrote:
| I was able to solo-developer build a reasonably complex CI/CD
| system for a company that relied heavily on bcachefs in a
| tiering system: RAM --> SSD --> HDD, to run hundreds of builds
| per day across dozens of SKUs measuring in the tens to hundreds
| of gigabytes per build directory. Later was able to port bcache
| (not the fs) to a vmx driver for VMWare esxi because of that
| understanding I had gained earlier in the project, again, RAM
| to SSD to HDD. Which then lead to making a down'n'dirty
| FusionDrive script (it's on github) for macOS that can tier RAM
| and SSD for doing fast CI/CD agent builds on macOS.
| koverstreet wrote:
| I'd love to hear more about that
| soopurman wrote:
| I eagerly await this work stabilizing to the point of merging to
| mainline. I'm impressed with how much progress Kent has made, but
| I'm frustrated by how often he seems to say some feature is
| basically ready except for this bug and except for that problem.
| I understand great designs often come from the mind of a single
| inspired individual, but I hope that he's ready to accept
| contributions from others and that more developers see the value
| in helping to get this across the finish line.
| slavapestov wrote:
| Unfortunately, file system development is a pretty niche skill
| set these days, and the majority of the experts in the field
| are employed maintaining existing file systems (ext4, xfs,
| apfs, etc).
|
| One thing I've been bugging Kent to do is to write
| documentation about the design and internal workings on
| bcachefs; very little about modern file system design is
| actually written down anywhere, and a detailed reference manual
| would attract more people to work in this area.
| koverstreet wrote:
| That exists! I just always forget to link to it (and I do
| need to do more work on it): https://bcachefs.org/bcachefs-
| principles-of-operation.pdf
| koverstreet wrote:
| I'm _always_ happy for other people to jump in :)
|
| But the reality is that filesystem development is hard, and
| there aren't that many people clamoring to work on this stuff -
| but that's ok. Slow and steady wins the race, in the end.
|
| I'm a big believer in steady incremental development, and being
| up front about what works and what doesn't - and also making
| sure the core is a solid foundation for everything we want to
| do.
|
| For anyone who does want to get involved - come join us!
| irc.oftc.net#bcache
| metadat wrote:
| More info and highlights about bcachefs:
|
| https://bcachefs.org/
|
| Bcachefs is a new filesystem for Linux.
|
| Copy on write (COW, like zfs or btrfs), Full data and metadata
| checksumming, Multiple device support, Replication, Compression,
| Encryption, Snapshots, Caching
|
| Erasure coding (not quite stable)
|
| Scalable - tested to 50+ TB
|
| Already working and available today
| brnt wrote:
| It sounds great, espcially erasure coding: bit rot resistance
| up to a point? Yes please!
|
| Bit hesitant to give it a go though, a stable filesystem that
| is calamity resistant is such a core thing to have.
|
| Is there anyone with larger scale experience with bcachefs?
| slavapestov wrote:
| Bcachefs is one of the most interesting projects going on in
| Linux kernel land these days, all the more remarkable because it
| is mostly the work of one person over the last 12 years. I hope
| it goes upstream soon; with a bit of polish it should be ready
| for widespread adoption.
| pedrocr wrote:
| Does anyone know if any of these next generation filesystems can
| do raid6 with more flexibility than normal linux software raid?
|
| Say you do a 6x4TB raid 6 array in Linux and get a 16TB virtual
| drive. Now if you update 4 of the drives to 8TB will still get
| 16TB with 4x4TB of fully unused space. But 4x4TB is itself enough
| to do 8TB of raid6 that would get you a total of 24TB usable
| space where any two drives can fail. Does any of these
| filesystems just do that automatically and continuously upgrade
| the capacity as you phase in new disks?
| RealStickman_ wrote:
| BTRFS can do that, but RAID 5/6 use is discouraged to say the
| least.
| kzrdude wrote:
| Synology seems to use it, doesn't it?
| KAMSPioneer wrote:
| Synology uses mdadm to create the array, then formats the
| resultant block device as btrfs. Btrfs's RAID code never
| touches your data.
| mastax wrote:
| ZFS doesn't do it as you phase in disks, but once you've
| replaced all of them you get the extra capacity.
| pedrocr wrote:
| Linux raid does that already. You can also add disks to grow
| an array or change it from raid5 to raid6. The gradual phase-
| in is what's missing. It makes sense as having what is
| effectively a raid0 of a 4 drive raid6 and a 6 drive raid6 in
| the example is a more complex configuration with uneven
| performance across the array. But for a home NAS meeting the
| reliability with maximum space ends up being all that counts.
___________________________________________________________________
(page generated 2022-10-29 23:01 UTC)