https://lwn.net/SubscriberLink/842385/069d98ea9d94f2ed/ LWN.net Logo LWN .net News from the source LWN * Content + Weekly Edition + Archives + Search + Kernel + Security + Distributions + Events calendar + Unread comments + ------------------------------------------------------------- + LWN FAQ + Write for us User: [ ] Password: [ ] [Log in] | [Subscribe] | [Register] Subscribe / Log in / New account Fast commits for ext4 [LWN subscriber-only content] Welcome to LWN.net Free trial subscription The following subscription-only Try LWN for free for 1 content has been made available to month: no payment or you by an LWN subscriber. Thousands credit card required. of subscribers depend on LWN for Activate your trial the best news from the Linux and subscription now and see free software communities. If you why thousands of readers enjoy this article, please consider subscribe to LWN.net. accepting the trial offer on the right. Thank you for visiting LWN.net! January 15, 2021 This article was contributed by Marta Rybczynska The Linux 5.10 release included a change that is expected to significantly increase the performance of the ext4 filesystem; it goes by the name "fast commits" and introduces a new, lighter-weight journaling method. Let us look into how the feature works, who can benefit from it, and when its use may be appropriate. Ext4 is a journaling filesystem, designed to ensure that filesystem structures appear consistent on disk at all times. A single filesystem operation (from the user's point of view) may require multiple changes in the filesystem, which will only be coherent after all of those changes are present on the disk. If a power failure or a system crash happens in the middle of those operations, corruption of the data and filesystem structure (including unrelated files) is possible. Journaling prevents corruption by maintaining a log of transactions in a separate journal on disk. In case of a power failure, the recovery procedure can replay the journal and restore the filesystem to a consistent state. The ext4 journal includes the metadata changes associated with an operation, but not necessarily the related data changes. Mount options can be used to select one of three journaling modes, as described in the ext4 kernel documentation. data=ordered, the default, causes ext4 to write all data before committing the associated metadata to the journal. It does not put the data itself into the journal. The data=journal option, instead, causes all data to be written to the journal before it is put into the main filesystem; as a side effect, it disables delayed allocation and direct-I/O support. Finally, data=writeback relaxes the constraints, allowing data to be written to the filesystem after the metadata has been committed to the journal. Another important ext4 feature is delayed allocation, where the filesystem defers the allocation of blocks on disk for data written by applications until that data is actually written to disk. The idea is to wait until the application finishes its operations on the file, then allocate the actual number of data blocks needed on the disk at once. This optimization limits unneeded operations related to short-lived, small files, batches large writes, and helps ensure that data space is allocated contiguously. On the other hand, the writing of data to disk might be delayed (with the default settings) by a minute or so. In the default data=ordered mode, where the journal entry is written only after flushing all pending data, delayed allocation might thus delay the writing of the journal. To assure data is actually written to disk, applications use the fsync() or fdatasync() system calls, causing the data (and the journal) to be written immediately. Ext4 journal optimization One might assume that, in such a situation, there are a number of optimizations that could be made in the commit path; that assumption turns out to be correct. In this USENIX'17 paper [PDF], Daejun Park and Dongkun Shin showed that the current ext4 journaling scheme can introduce significant latencies because fsync() causes a lot of unrelated I/O. They proposed a faster scheme, taking into account the fact that some of the metadata written to the journal could instead be derived from changes to the inode being written, and it is possible to commit transactions related to the requested file descriptor only. Their optimization works in the data=ordered mode. The fast-commit changes, implemented by Harshad Shirwadkar, are based on the work of Park and Shin. This work implements an additional journal for fast commits, but simplifies the commit path. There are now two journals in the filesystem: the fast-commit journal for operations that can be optimized, and the regular journal for "standard commits" whose handling is unchanged. The fast-commit journal contains operations executed since the last standard commit. Ext4 uses a generic journaling layer called "Journaling Block Device 2" (JBD2), with the exact on-disk format documented in the ext4 wiki. JBD2 operates on blocks, so when it commits a transaction, this transaction includes all changed blocks. One logical change may affect multiple blocks, for example the inode table and the block bitmap. The fast-commit journal, on the other hand, contains changes at the file level, resulting in a more compact format. Information that can be recreated is left out, as described in the patch posting: For example, if a new extent is added to an inode, then corresponding updates to the inode table, the block bitmap, the group descriptor and the superblock can be derived based on just the extent information and the corresponding inode information. During recovery from this journal, the filesystem must recalculate all changed blocks from the inode changes, and modify all affected data structures on the disk. This requires specific code paths for each file operation, and not all of them are implemented right now. The fast-commits feature currently supports unlinking and linking a directory entry, creating an inode and a directory entry, adding blocks to and removing blocks from an inode, and recording an inode that should be replayed. Fast commits are an addition to -- not a replacement of -- the standard commit path; the two work together. If fast commits cannot handle an operation, the filesystem falls back to the standard commit path. This happens, for example, for changes to extended attributes. During recovery, JBD2 first performs replay of the standard transactions, then lets the filesystem recover fast commits. fsync() side effects The fast-commit optimization is designed to work with applications using fsync() frequently to ensure data integrity. When we look at the fsync() and fdatasync() man pages, we see that those system calls only guarantee to write data linked to the given file descriptor. With ext4, as a side effect of the filesystem structure, all pending data and metadata for all file descriptors will be flushed instead. This creates a lot of I/O traffic that is unneeded to satisfy any given fsync() or fdatasync() call. This side effect leads to a difference between the paper and the implementation: a fast commit may still include changes affecting other files. In a review, Jan Kara asked why unrelated changes are committed. Shirwadkar replied that, in an earlier version of the patch, he did indeed write only the file in question. However, this change broke some existing tests that depend on fsync() working as a global barrier, so he backed it out. Ted Ts'o commented that the current version of the patch set keeps the existing behavior, but he can see workloads where "not requiring entanglement of unrelated file writes via fsync(2) could be a huge performance win." He added that a future solution could be a new system call taking an array of file descriptors to synchronize together. For now, application developers should base their code on the POSIX definition, and not rely on that specific fsync() side effect, as it might change in the future. Using fast commits Fast commits are activated at filesystem creation time, so users will have to recreate their filesystems to use this feature. In addition, the required support in e2fsprogs has not yet been added to the main branch, but is still in development. So interested users will need to compile the tool on their own, or wait until the feature is supported by their distribution. When enabled, information on fast commits shows up in a new /proc/fs/ext4/dev/fc_info file. On the development side, there are numerous features to be added to fast commits. These include making the operations more fine-grained and supporting more cases that fall back to standard commits today. Shirwadkar is also working on fast commits with byte-granularity (instead of the current block-granularity) support for direct-access (DAX) mode, to be used on persistent memory devices. The benchmark results given by Shirwadkar in the posted patch set show 20-200% performance improvements with filesystem benchmarks for local filesystems, and 30-75% improvement for NFS workloads. We can assume that the performance gain will be more important in applications doing many fsync() operations than in those doing only a few. Either way, though, the fast-commits feature should lead to better ext4 filesystem performance going forward. Index entries for this article Kernel Filesystems/ext4 GuestArticles Rybczynska, Marta [Send a free link] Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion. ----------------------------------------- (Log in to post comments) Fast commits for ext4 Posted Jan 15, 2021 18:49 UTC (Fri) by NYKevin (subscriber, #129325) [Link] I have some thoughts about this: 1. As usual, Ted's reasoning seems eminently sensible to me. Apps that really want to sync the whole filesystem should *already* be using syncfs(2). 2. Still, I could imagine some apps not knowing about the "and also you have to fsync the directory" rule (see fsync(2)), which is currently not really enforced since fsync in practice flushes "everything." It might be worth special-casing that, or providing a flag. But hard links make this trickier (which directory do you want to fsync?). 3. It would be really nice if rename(2) would function as a write barrier, or at least not be reordered before any writes to the file that is renamed, but I'm not sure where current filesystems stand on that... I know this has definitely been discussed in the past, though (see for example this article: https://lwn.net/Articles/351422/). 4. Maybe if this gets performant enough, we can rename O_DSYNC to O_PONIES. [Reply to this comment] What about other filesystems? Posted Jan 15, 2021 19:57 UTC (Fri) by Wol (subscriber, #4433) [Link] All this talk about "apps should use fsync" or "apps should use fsfsync" fills me with horror, as someone who wants to write a system that relies on data integrity. Are you telling me that my app needs to be filesystem-aware, and not only that but aware of what mount options were used, so I know which commands to call to make sure that my data is safe? And no, I don't want to offload that onto the glibc guys either. I know it'll be hard, but is there any way we can get the VFS to dictate that certain things are supposed to happen in a certain order. As an app, I don't care *how* the OS does it, but I want to be able to reason about what's hit the disk and when. And I DON'T want to have to worry about what the filesystem is or how it does it. I've said it before, but I don't care too much about fsync or fsfsync, and I really don't want to have to worry about how the OS will react to those calls - one only has to go back to the ext3/ext4 transition/debacle where code that was fast on ext3 brought ext4 to its knees... if I can know for certain that stuff hits the disk in the order I write it, and tell the system what it can write in parallel and what it can't ... (basically data can be written in parallel, logs can be written in parallel, just not in parallel with each other. And the OS needs me to tell it the difference between the two, it won't know by itself.) Cheers, Wol [Reply to this comment] What about other filesystems? Posted Jan 15, 2021 20:19 UTC (Fri) by joib (subscriber, #8541) [Link ] Carefully ordering writes was the idea behind soft updates previously(?) used in the BSD FFS. I believe it was replaced by a more traditional journaling approach because modern hardware is dependent on queuing for good performance, and all current hardware command queuing mechanisms are unordered. On such hardware soft updates wasn't able to keep up with journaling, which is better able to batch updates. [Reply to this comment] Copyright (c) 2021, Eklektix, Inc. Comments and public postings are copyrighted by their creators. Linux is a registered trademark of Linus Torvalds