https://lwn.net/SubscriberLink/842385/069d98ea9d94f2ed/

LWN.net Logo LWN
.net News from the source LWN

  * Content
      + Weekly Edition
      + Archives
      + Search
      + Kernel
      + Security
      + Distributions
      + Events calendar
      + Unread comments
      + -------------------------------------------------------------
      + LWN FAQ
      + Write for us

User: [        ] Password: [        ] [Log in]
|
[Subscribe]
|
[Register]
Subscribe / Log in / New account

Fast commits for ext4

[LWN subscriber-only content]

   Welcome to LWN.net                  Free trial subscription

   The following subscription-only     Try LWN for free for 1
   content has been made available to  month: no payment or
   you by an LWN subscriber. Thousands credit card required.
   of subscribers depend on LWN for    Activate your trial
   the best news from the Linux and    subscription now and see
   free software communities. If you   why thousands of readers
   enjoy this article, please consider subscribe to LWN.net.
   accepting the trial offer on the
   right. Thank you for visiting
   LWN.net!

January 15, 2021

This article was contributed by Marta Rybczynska

The Linux 5.10 release included a change that is expected to
significantly increase the performance of the ext4 filesystem; it
goes by the name "fast commits" and introduces a new, lighter-weight
journaling method. Let us look into how the feature works, who can
benefit from it, and when its use may be appropriate.

Ext4 is a journaling filesystem, designed to ensure that filesystem
structures appear consistent on disk at all times. A single
filesystem operation (from the user's point of view) may require
multiple changes in the filesystem, which will only be coherent after
all of those changes are present on the disk. If a power failure or a
system crash happens in the middle of those operations, corruption of
the data and filesystem structure (including unrelated files) is
possible. Journaling prevents corruption by maintaining a log of
transactions in a separate journal on disk. In case of a power
failure, the recovery procedure can replay the journal and restore
the filesystem to a consistent state.

The ext4 journal includes the metadata changes associated with an
operation, but not necessarily the related data changes. Mount
options can be used to select one of three journaling modes, as
described in the ext4 kernel documentation. data=ordered, the
default, causes ext4 to write all data before committing the
associated metadata to the journal. It does not put the data itself
into the journal. The data=journal option, instead, causes all data
to be written to the journal before it is put into the main
filesystem; as a side effect, it disables delayed allocation and
direct-I/O support. Finally, data=writeback relaxes the constraints,
allowing data to be written to the filesystem after the metadata has
been committed to the journal.

Another important ext4 feature is delayed allocation, where the
filesystem defers the allocation of blocks on disk for data written
by applications until that data is actually written to disk. The idea
is to wait until the application finishes its operations on the file,
then allocate the actual number of data blocks needed on the disk at
once. This optimization limits unneeded operations related to
short-lived, small files, batches large writes, and helps ensure that
data space is allocated contiguously. On the other hand, the writing
of data to disk might be delayed (with the default settings) by a
minute or so. In the default data=ordered mode, where the journal
entry is written only after flushing all pending data, delayed
allocation might thus delay the writing of the journal. To assure
data is actually written to disk, applications use the fsync() or
fdatasync() system calls, causing the data (and the journal) to be
written immediately.

Ext4 journal optimization

One might assume that, in such a situation, there are a number of
optimizations that could be made in the commit path; that assumption
turns out to be correct. In this USENIX'17 paper [PDF], Daejun Park
and Dongkun Shin showed that the current ext4 journaling scheme can
introduce significant latencies because fsync() causes a lot of
unrelated I/O. They proposed a faster scheme, taking into account the
fact that some of the metadata written to the journal could instead
be derived from changes to the inode being written, and it is
possible to commit transactions related to the requested file
descriptor only. Their optimization works in the data=ordered mode.

The fast-commit changes, implemented by Harshad Shirwadkar, are based
on the work of Park and Shin. This work implements an additional
journal for fast commits, but simplifies the commit path. There are
now two journals in the filesystem: the fast-commit journal for
operations that can be optimized, and the regular journal for
"standard commits" whose handling is unchanged. The fast-commit
journal contains operations executed since the last standard commit.

Ext4 uses a generic journaling layer called "Journaling Block
Device 2" (JBD2), with the exact on-disk format documented in the
ext4 wiki. JBD2 operates on blocks, so when it commits a transaction,
this transaction includes all changed blocks. One logical change may
affect multiple blocks, for example the inode table and the block
bitmap.

The fast-commit journal, on the other hand, contains changes at the
file level, resulting in a more compact format. Information that can
be recreated is left out, as described in the patch posting:

For example, if a new extent is added to an inode, then corresponding
updates to the inode table, the block bitmap, the group descriptor
and the superblock can be derived based on just the extent
information and the corresponding inode information.

During recovery from this journal, the filesystem must recalculate
all changed blocks from the inode changes, and modify all affected
data structures on the disk. This requires specific code paths for
each file operation, and not all of them are implemented right now.
The fast-commits feature currently supports unlinking and linking a
directory entry, creating an inode and a directory entry, adding
blocks to and removing blocks from an inode, and recording an inode
that should be replayed.

Fast commits are an addition to -- not a replacement of -- the standard
commit path; the two work together. If fast commits cannot handle an
operation, the filesystem falls back to the standard commit path.
This happens, for example, for changes to extended attributes. During
recovery, JBD2 first performs replay of the standard transactions,
then lets the filesystem recover fast commits.

fsync() side effects

The fast-commit optimization is designed to work with applications
using fsync() frequently to ensure data integrity. When we look at
the fsync() and fdatasync() man pages, we see that those system calls
only guarantee to write data linked to the given file descriptor.
With ext4, as a side effect of the filesystem structure, all pending
data and metadata for all file descriptors will be flushed instead.
This creates a lot of I/O traffic that is unneeded to satisfy any
given fsync() or fdatasync() call.

This side effect leads to a difference between the paper and the
implementation: a fast commit may still include changes affecting
other files. In a review, Jan Kara asked why unrelated changes are
committed. Shirwadkar replied that, in an earlier version of the
patch, he did indeed write only the file in question. However, this
change broke some existing tests that depend on fsync() working as a
global barrier, so he backed it out.

Ted Ts'o commented that the current version of the patch set keeps
the existing behavior, but he can see workloads where "not requiring
entanglement of unrelated file writes via fsync(2) could be a huge
performance win." He added that a future solution could be a new
system call taking an array of file descriptors to synchronize
together. For now, application developers should base their code on
the POSIX definition, and not rely on that specific fsync() side
effect, as it might change in the future.

Using fast commits

Fast commits are activated at filesystem creation time, so users will
have to recreate their filesystems to use this feature. In addition,
the required support in e2fsprogs has not yet been added to the main
branch, but is still in development. So interested users will need to
compile the tool on their own, or wait until the feature is supported
by their distribution. When enabled, information on fast commits
shows up in a new /proc/fs/ext4/dev/fc_info file.

On the development side, there are numerous features to be added to
fast commits. These include making the operations more fine-grained
and supporting more cases that fall back to standard commits today.
Shirwadkar is also working on fast commits with byte-granularity
(instead of the current block-granularity) support for direct-access
(DAX) mode, to be used on persistent memory devices.

The benchmark results given by Shirwadkar in the posted patch set
show 20-200% performance improvements with filesystem benchmarks for
local filesystems, and 30-75% improvement for NFS workloads. We can
assume that the performance gain will be more important in
applications doing many fsync() operations than in those doing only a
few. Either way, though, the fast-commits feature should lead to
better ext4 filesystem performance going forward.

Index entries for this article
Kernel        Filesystems/ext4
GuestArticles Rybczynska, Marta


[Send a free link]


    Did you like this article? Please accept our trial subscription
    offer to be able to see more content like it and to participate
    in the discussion.

-----------------------------------------
(Log in to post comments)

Fast commits for ext4

Posted Jan 15, 2021 18:49 UTC (Fri) by NYKevin (subscriber, #129325)
[Link]

I have some thoughts about this:

1. As usual, Ted's reasoning seems eminently sensible to me. Apps
that really want to sync the whole filesystem should *already* be
using syncfs(2).
2. Still, I could imagine some apps not knowing about the "and also
you have to fsync the directory" rule (see fsync(2)), which is
currently not really enforced since fsync in practice flushes
"everything." It might be worth special-casing that, or providing a
flag. But hard links make this trickier (which directory do you want
to fsync?).
3. It would be really nice if rename(2) would function as a write
barrier, or at least not be reordered before any writes to the file
that is renamed, but I'm not sure where current filesystems stand on
that... I know this has definitely been discussed in the past, though
(see for example this article: https://lwn.net/Articles/351422/).
4. Maybe if this gets performant enough, we can rename O_DSYNC to
O_PONIES.
[Reply to this comment]
What about other filesystems?

Posted Jan 15, 2021 19:57 UTC (Fri) by Wol (subscriber, #4433) [Link]

All this talk about "apps should use fsync" or "apps should use
fsfsync" fills me with horror, as someone who wants to write a system
that relies on data integrity. Are you telling me that my app needs
to be filesystem-aware, and not only that but aware of what mount
options were used, so I know which commands to call to make sure that
my data is safe?

And no, I don't want to offload that onto the glibc guys either.

I know it'll be hard, but is there any way we can get the VFS to
dictate that certain things are supposed to happen in a certain
order. As an app, I don't care *how* the OS does it, but I want to be
able to reason about what's hit the disk and when. And I DON'T want
to have to worry about what the filesystem is or how it does it.

I've said it before, but I don't care too much about fsync or
fsfsync, and I really don't want to have to worry about how the OS
will react to those calls - one only has to go back to the ext3/ext4
transition/debacle where code that was fast on ext3 brought ext4 to
its knees... if I can know for certain that stuff hits the disk in
the order I write it, and tell the system what it can write in
parallel and what it can't ... (basically data can be written in
parallel, logs can be written in parallel, just not in parallel with
each other. And the OS needs me to tell it the difference between the
two, it won't know by itself.)

Cheers,
Wol
[Reply to this comment]
What about other filesystems?

Posted Jan 15, 2021 20:19 UTC (Fri) by joib (subscriber, #8541) [Link
]

Carefully ordering writes was the idea behind <a href="https://
www.usenix.org/legacy/publications/library/procee...">soft updates</
a> previously(?) used in the BSD FFS. I believe it was replaced by a
more traditional journaling approach because modern hardware is
dependent on queuing for good performance, and all current hardware
command queuing mechanisms are unordered. On such hardware soft
updates wasn't able to keep up with journaling, which is better able
to batch updates.
[Reply to this comment]

                  Copyright (c) 2021, Eklektix, Inc.
   Comments and public postings are copyrighted by their creators.
          Linux is a registered trademark of Linus Torvalds