hngopher.com

       [HN Gopher] Bcachefs Status Update
       ___________________________________________________________________
        
       Bcachefs Status Update
        
       Author : pantalaimon
       Score  : 81 points
       Date   : 2022-10-29 08:22 UTC (14 hours ago)
        
 (HTM) web link (lore.kernel.org)
 (TXT) w3m dump (lore.kernel.org)
        
       | 2pEXgD0fZ5cF wrote:
       | Always excited to hear news about Bcachefs, can't wait for it to
       | hit upstream!
        
       | Quekid5 wrote:
       | Just curious about the mention of persistent data structures. Is
       | this about the FP notion of persistent data structures (which I
       | suspect) or on-disk data structures? Or both? :)
       | 
       | I find that (FP) persistent data structures are like a super-
       | power in lots of ways. (And it only costs a log-N slowdown in the
       | worst case.)
        
         | koverstreet wrote:
         | On disk, transactionally-updated data structures
        
           | Quekid5 wrote:
           | Thanks for the clarification.
        
       | pdimitar wrote:
       | I didn't understand even half of that but am still excited. I
       | have no less than 6 external HDDs and SSDs lying around and it's
       | been a pain to make a good stable ZFS pool out of them.
       | `bcachefs` seems like a perfect match.
        
       | insanitybit wrote:
       | Great to see this progress. This is one of the projects I'm happy
       | to sponsor.
        
       | tmulcahy wrote:
       | > The lock ordering rules had become too complicated and this was
       | getting us too many transaction restarts, so I stole the standard
       | technique from databases
       | 
       | What is the standard technique from databases?
        
         | koverstreet wrote:
         | You detect the deadlocks when they happen and abort
        
           | tmulcahy wrote:
           | Any details/links that explain this? The article seems to
           | suggest that they used to detect the deadlock as it was about
           | to happen, and then abort everything and retry. This doesn't
           | seem too different from "when they happen". What is the
           | optimization?
        
             | koverstreet wrote:
             | Previously, we were checking for lock ordering violations.
             | But a lock ordering violation isn't a deadlock, just a
             | potential deadlock.
             | 
             | Checking for a lock ordering violation is simpler because
             | you only have to look at data structures for the current
             | thread - you're just looking at what other locks the
             | current thread is holding.
             | 
             | To check for a deadlock you have to do a full DFS of
             | threads blocked on locks held by other threads, looking for
             | cycles. Quite a bit trickier, but it was well worth it :)
        
               | tmulcahy wrote:
               | Thanks for the explanation. Very interesting! I assume
               | you only do the DFS when you detect a lock ordering
               | violation? So if there is the potential for a deadlock
               | you'll do extra work to make sure that you're actually
               | deadlocked before aborting?
        
       | mastax wrote:
       | The talk linked at the end is really interesting, even as someone
       | who doesn't know anything about filesystems outside of the
       | classic Unix they teach in school and reading a lot of deep
       | discussions about tuning ZFS. (None of my ZFS users are strenuous
       | enough to require any tuning but I can't help but learn and
       | optimize when there are levers to pull).
       | 
       | It's interesting that it's implementing all these modern features
       | on top of extent-based non-COW transactional btrees (though it
       | looks like APFS is the same way). With the very database-like
       | architecture at the low level, it would be fun to stick a SQL
       | interface on top. You should be able to do advanced queries very
       | efficiently without having to fall back to scanning every file in
       | the directory like you would normally have to.
       | 
       | I'm salivating looking at the caching/tiering features. People
       | often try to implement a tiered filesystem with ZFS L2ARC, SLOG,
       | and now the special vdev. But none of those features were
       | intended for that and it shows.
        
         | justinlloyd wrote:
         | I was able to solo-developer build a reasonably complex CI/CD
         | system for a company that relied heavily on bcachefs in a
         | tiering system: RAM --> SSD --> HDD, to run hundreds of builds
         | per day across dozens of SKUs measuring in the tens to hundreds
         | of gigabytes per build directory. Later was able to port bcache
         | (not the fs) to a vmx driver for VMWare esxi because of that
         | understanding I had gained earlier in the project, again, RAM
         | to SSD to HDD. Which then lead to making a down'n'dirty
         | FusionDrive script (it's on github) for macOS that can tier RAM
         | and SSD for doing fast CI/CD agent builds on macOS.
        
           | koverstreet wrote:
           | I'd love to hear more about that
        
       | soopurman wrote:
       | I eagerly await this work stabilizing to the point of merging to
       | mainline. I'm impressed with how much progress Kent has made, but
       | I'm frustrated by how often he seems to say some feature is
       | basically ready except for this bug and except for that problem.
       | I understand great designs often come from the mind of a single
       | inspired individual, but I hope that he's ready to accept
       | contributions from others and that more developers see the value
       | in helping to get this across the finish line.
        
         | slavapestov wrote:
         | Unfortunately, file system development is a pretty niche skill
         | set these days, and the majority of the experts in the field
         | are employed maintaining existing file systems (ext4, xfs,
         | apfs, etc).
         | 
         | One thing I've been bugging Kent to do is to write
         | documentation about the design and internal workings on
         | bcachefs; very little about modern file system design is
         | actually written down anywhere, and a detailed reference manual
         | would attract more people to work in this area.
        
           | koverstreet wrote:
           | That exists! I just always forget to link to it (and I do
           | need to do more work on it): https://bcachefs.org/bcachefs-
           | principles-of-operation.pdf
        
         | koverstreet wrote:
         | I'm _always_ happy for other people to jump in :)
         | 
         | But the reality is that filesystem development is hard, and
         | there aren't that many people clamoring to work on this stuff -
         | but that's ok. Slow and steady wins the race, in the end.
         | 
         | I'm a big believer in steady incremental development, and being
         | up front about what works and what doesn't - and also making
         | sure the core is a solid foundation for everything we want to
         | do.
         | 
         | For anyone who does want to get involved - come join us!
         | irc.oftc.net#bcache
        
       | metadat wrote:
       | More info and highlights about bcachefs:
       | 
       | https://bcachefs.org/
       | 
       | Bcachefs is a new filesystem for Linux.
       | 
       | Copy on write (COW, like zfs or btrfs), Full data and metadata
       | checksumming, Multiple device support, Replication, Compression,
       | Encryption, Snapshots, Caching
       | 
       | Erasure coding (not quite stable)
       | 
       | Scalable - tested to 50+ TB
       | 
       | Already working and available today
        
         | brnt wrote:
         | It sounds great, espcially erasure coding: bit rot resistance
         | up to a point? Yes please!
         | 
         | Bit hesitant to give it a go though, a stable filesystem that
         | is calamity resistant is such a core thing to have.
         | 
         | Is there anyone with larger scale experience with bcachefs?
        
       | slavapestov wrote:
       | Bcachefs is one of the most interesting projects going on in
       | Linux kernel land these days, all the more remarkable because it
       | is mostly the work of one person over the last 12 years. I hope
       | it goes upstream soon; with a bit of polish it should be ready
       | for widespread adoption.
        
       | pedrocr wrote:
       | Does anyone know if any of these next generation filesystems can
       | do raid6 with more flexibility than normal linux software raid?
       | 
       | Say you do a 6x4TB raid 6 array in Linux and get a 16TB virtual
       | drive. Now if you update 4 of the drives to 8TB will still get
       | 16TB with 4x4TB of fully unused space. But 4x4TB is itself enough
       | to do 8TB of raid6 that would get you a total of 24TB usable
       | space where any two drives can fail. Does any of these
       | filesystems just do that automatically and continuously upgrade
       | the capacity as you phase in new disks?
        
         | RealStickman_ wrote:
         | BTRFS can do that, but RAID 5/6 use is discouraged to say the
         | least.
        
           | kzrdude wrote:
           | Synology seems to use it, doesn't it?
        
             | KAMSPioneer wrote:
             | Synology uses mdadm to create the array, then formats the
             | resultant block device as btrfs. Btrfs's RAID code never
             | touches your data.
        
         | mastax wrote:
         | ZFS doesn't do it as you phase in disks, but once you've
         | replaced all of them you get the extra capacity.
        
           | pedrocr wrote:
           | Linux raid does that already. You can also add disks to grow
           | an array or change it from raid5 to raid6. The gradual phase-
           | in is what's missing. It makes sense as having what is
           | effectively a raid0 of a 4 drive raid6 and a 6 drive raid6 in
           | the example is a more complex configuration with uneven
           | performance across the array. But for a home NAS meeting the
           | reliability with maximum space ends up being all that counts.
        
       ___________________________________________________________________
       (page generated 2022-10-29 23:01 UTC)