hngopher.com

       [HN Gopher] Disk I/O bottlenecks in GitHub Actions
       ___________________________________________________________________
        
       Disk I/O bottlenecks in GitHub Actions
        
       Author : jacobwg
       Score  : 86 points
       Date   : 2025-03-28 15:22 UTC (7 hours ago)
        
 (HTM) web link (depot.dev)
 (TXT) w3m dump (depot.dev)
        
       | ValdikSS wrote:
       | `apt` installation could be easily sped-up with `eatmydata`:
       | `dpkg` calls `fsync()` on all the unpacked files, which is very
       | slow on HDDs, and `eatmydata` hacks it out.
        
         | nijave wrote:
         | Really if you could just disable fsync at the OS level. A bunch
         | of other common package managers and tools also do. Docker is a
         | big culprit
         | 
         | If you corrupt a CI node, whatever. Just rerun the step
        
           | wtallis wrote:
           | CI containers should probably run entirely from tmpfs.
        
             | jacobwg wrote:
             | We're having some success with doing this at the block
             | level (e.g. in-memory writeback cache).
        
               | yjftsjthsd-h wrote:
               | Why do it at the block level (instead of tmpfs)? Or do
               | you mean that you're doing actual real persistent disks
               | that just have a lot of cache sitting in front of them?
        
               | jacobwg wrote:
               | The block level has two advantages: (1) you can
               | accelerate access to everything on the whole disk (like
               | even OS packages) and (2) everything appears as one
               | device to the OS, meaning that build tools that want to
               | do things like hardlink files in global caches still work
               | without any issue.
        
             | candiddevmike wrote:
             | We built EtchaOS for this use case--small, immutable, in
             | memory variants of Fedora, Debian, Ubuntu, etc bundled with
             | Docker. It makes a great CI runner for GitHub Actions, and
             | plays nicely with caching:
             | 
             | https://etcha.dev/etchaos/
        
             | nijave wrote:
             | Can tmpfs be backed by persistent storage? Most of the
             | recent stuff I've worked on is a little too big to fit in
             | memory handily. Ideally about 20GiB of scratch space for
             | 4-8GiB of working memory would be ideal.
             | 
             | I've had good success with machines that have NVMe storage
             | (especially on cloud providers) but you still are paying
             | the cost of fsync there even if it's a lot faster
        
               | wtallis wrote:
               | tmpfs is backed by swap space, in the sense that it will
               | overflow to use swap capacity but will not become
               | persistent (since the lack of persistence is a feature).
        
           | jacobwg wrote:
           | I'd love to experiment with that and/or flags like `noatime`,
           | especially when CI nodes are single-use and ephemeral.
        
             | 3np wrote:
             | atime is so exotic you shouldn't need to consider disabling
             | it experimental. I consider it legacy at this point.
        
             | formerly_proven wrote:
             | noatime is irrelevant because everyone has been using
             | relatime for ages, and updating the atime field with
             | relatime means you're writing that block to disk anyway,
             | since you're updating the mtime field. So no I/O saved.
        
           | kylegalbraith wrote:
           | This is a neat idea that we should try. We've tried the
           | `eatmydata` thing to speed up dpkg, but the slow part wasn't
           | the fsync portion but rather the dpkg database.
        
           | formerly_proven wrote:
           | You can probably use a BPF return override on fsync and
           | fdatasync and sync_file_range, considering that the main use
           | case of that feature is syscall-level error injection.
           | 
           | edit: Or, even easier, just use the pre-built fail_function
           | infrastructure (with retval = 0 instead of an error):
           | https://docs.kernel.org/fault-injection/fault-injection.html
        
           | chippiewill wrote:
           | > Docker is a big culprit
           | 
           | Actually in my experience with pulling very large images to
           | run with docker it turns out that Docker doesn't really do
           | any fsync-ing itself. The sync happens when it creates an
           | overlayfs mount while creating a container because the
           | overlayfs driver in the kernel does it.
           | 
           | A volatile flag to the kernel driver was added a while back,
           | but I don't think Docker uses it yet
           | https://www.redhat.com/en/blog/container-volatile-overlay-
           | mo...
        
             | nijave wrote:
             | Well yeah, but indirectly through the usage of Docker, I
             | mean.
             | 
             | Unpacking the Docker image tarballs can be a bit expensive
             | --especially with things like nodejs where you have tons of
             | tiny files
             | 
             | Tearing down overlayfs is a huge issue, though
        
         | Brian_K_White wrote:
         | "`dpkg` calls `fsync()` on all the unpacked files"
         | 
         | Why in the world does it do that ????
         | 
         | Ok I googled (kagi). Same reason anyone ever does: pure voodoo.
         | 
         | If you can't trust the kernel to close() then you can't trust
         | it to fsync() or anything else either.
         | 
         | Kernel-level crashes, the only kind of crash that risks half-
         | written files, are no more likely during dpkg than any other
         | time. A bad update is the same bad update regardless, no
         | better, no worse.
        
           | duped wrote:
           | "durability" isn't voodoo. Consider if dpkg updates libc.so
           | and then you yank the power cord before the page cache is
           | flushed to disk, or you're on a laptop and the battery dies.
        
             | Brian_K_White wrote:
             | Like I said.
        
               | levkk wrote:
               | Pretty sure kernel doesn't have to fsync on close. In
               | fact, you don't want it to, otherwise you're limiting the
               | performance of your page cache. So fsync on install for
               | dpkg makes perfect sense.
        
               | Brian_K_White wrote:
               | I didn't say it synced. The file is simply "written" and
               | available at that point.
               | 
               | It makes no sense to trust that fsync() does what it
               | promises but not that close() does what it promises.
               | close() promises that when close() returns, the data is
               | stored and some other process may open() and find all of
               | it verbatim. And that's all you care about or have any
               | business caring about unless you are the kernel yourself.
        
               | hiciu wrote:
               | I would like to introduce you to a few case studies on
               | bugzilla:
               | 
               | https://wiki.debian.org/Teams/Dpkg/FAQ#Q:_Why_is_dpkg_so_
               | slo...
        
               | switch007 wrote:
               | Get involved, tell and show them you know better. They
               | have a bug tracker, mailing list etc
        
             | timewizard wrote:
             | > before the page cache is flushed to disk
             | 
             | And if yank the cord before the package is fully unpacked?
             | Wouldn't that just be the same problem? Solving that
             | problem involves simply unpacking to a temporary location
             | first, verifying all the files were extracted correctly,
             | and then renaming them into existence. Which actually
             | solves both problems.
             | 
             | Package management is stuck in a 1990s idea of "efficiency"
             | which is entirely unwarranted. I have more than enough hard
             | drive space to install the distribution several times over.
             | Stop trying to be clever.
        
               | hiciu wrote:
               | > Wouldn't that just be the same problem?
               | 
               | Not the same problem, it's half-written file vs half of
               | the files in older version.
               | 
               | > Which actually solves both problems.
               | 
               | it does not and you would have to guarantee that multiple
               | rename operations are executed in a transaction. Which
               | you can't. Unless you have really fancy filesystem.
               | 
               | > Stop trying to be clever.
               | 
               | It's called being correct and reliable.
        
               | timewizard wrote:
               | > have to guarantee that multiple rename operations are
               | executed in a transaction. Which you can't. Unless you
               | have really fancy filesystem
               | 
               | Not strictly. You have to guarantee that after reboot you
               | rollback any partial package operations. This is what a
               | filesystem journal does anyways. So it would be one
               | fsync() per package and not one per every file in the
               | package. The failure mode implies a reboot must occur.
               | 
               | > It's called being correct and reliable.
               | 
               | There are multiple ways to achieve this. There are
               | different requirements among different systems which is
               | the whole point of this post. And your version of
               | "correct and reliable" depends on /when/ I pull the plug.
               | So you're paying a huge price to shift the problem from
               | one side of the line to the other in what is not clearly
               | a useful or pragmatic way.
        
           | inetknght wrote:
           | > _Kernel-level crashes, the only kind of crash that risks
           | half-written files, are no more likely during dpkg than any
           | other time. A bad update is the same bad update regardless,
           | no better, no worse._
           | 
           | Imagine this scenario; you're writing a CI pipeline:
           | 
           | 1. You write some script to `apt-get install` blah blah
           | 
           | 2. As soon as the script is done, your CI job finishes.
           | 
           | 3. Your job is finished, so the VM is powered off.
           | 
           | 4. The hypervisor hits the power button but, oops, the VM
           | still had dirty disk cache/pending writes.
           | 
           | The hypervisor _may_ immediately pull the power (chaos monkey
           | style; developers don 't have patience), in which case those
           | writes are now corrupted. Or, it _may_ use ACPI shutdown
           | which then _should_ also have an ultimate timeout before
           | pulling power (otherwise stalled IO might prevent resources
           | from ever being cleaned up).
           | 
           | If you rely on sync to occur at step 4 during the kernel to
           | gracefully exit, how long does the kernel wait before it
           | decides that some shutdown-timeout occurred? How long does
           | the hypervisor wait and is it longer than the kernel would
           | wait? Are you even sure that the VM shutdown command you're
           | sending is the graceful one?
           | 
           | How would you fsync at step 3?
           | 
           | For step 2, perhaps you might have an exit script that calls
           | `fsync`.
           | 
           | For step 1, perhaps you might call `fsync` after `apt-get
           | install` is done.
        
             | Brian_K_White wrote:
             | This is like people that think they have designed their own
             | even better encryption algorithm. Voodoo. You are not
             | solving a problem better than the kernel and filesystem
             | (and hypervisor in this case) has already done. If you are
             | not writing the kernel or a driver or bootloader itself,
             | then fsync() is not your problem or responsibility and you
             | aren't filling any holes left by anyone else. You're just
             | rubbing the belly of the budda statue for good luck to feel
             | better.
        
               | inetknght wrote:
               | You didn't answer any of the questions proposed by the
               | outlined scenario.
               | 
               | Until you answer how you've solved the "I want to make
               | sure my data is written to disk before the hypervisor
               | powers off the virtual machine at the end of the
               | successful run" problem, then I claim that this is
               | absolutely _not_ voodoo.
               | 
               | I suggest you actually read the documentation of all of
               | these things before you start claiming that `fsync()` is
               | exclusively the purpose of kernel, driver, or bootloader
               | developers.
        
           | hiciu wrote:
           | this is actually an interesting topic and it turns out kernel
           | never made any promises about close() and data being on the
           | disk :)
           | 
           | and about kernel-level crashes: yes, but you see, dpkg
           | creates a new file on the disk, makes sure it is written
           | correctly with fsync() and then calls rename() (or something
           | like that) to atomically replace old file with new one.
           | 
           | So there is never a possibility of given file being corrupt
           | during update.
        
           | scottlamb wrote:
           | Let's start from the assumption that dpkg shouldn't commit to
           | its database that package X is installed/updated until all
           | the on-disk files reflect that. Then if the operation fails
           | and you try again later (on next boot or whatever) it knows
           | to check that package's state and try rolling forward.
           | 
           | > If you can't trust the kernel to close() then you can't
           | trust it to fsync() or anything else either.
           | 
           | https://man7.org/linux/man-pages/man2/close.2.html
           | A successful close does not guarantee that the data has been
           | successfully saved to disk, as the kernel uses the buffer
           | cache to            defer writes.  Typically, filesystems do
           | not flush buffers when a            file is closed.  If you
           | need to be sure that the data is            physically stored
           | on the underlying disk, use fsync(2).  (It will
           | depend on the disk hardware at this point.)
           | 
           | So if you want to wait until it's been saved to disk, you
           | have to do an fsync first. If you even just want to know if
           | it succeeded or failed, you have to do an fsync first.
           | 
           | Of course none of this matters much on an ephemeral Github
           | Actions VM. There's no "on next boot or whatever". So this is
           | one environment where it makes sense to bypass all this
           | careful durability work that I'd normally be totally behind.
           | It seems reasonable enough to say it's reached the page
           | cache, it should continue being visible in the current boot,
           | and tomorrow will never come.
        
             | Brian_K_White wrote:
             | Writing to disk is none of your business unless you are the
             | kernel itself.
        
               | scottlamb wrote:
               | Huh? Are you thinking dpkg is sending NVMe commands
               | itself or something? No, that's not what that manpage
               | means. dpkg is asking the kernel to write stuff to the
               | page cache, and then asking for a guarantee that the data
               | will continue to exist on next boot. The second half is
               | what fsync does. Without fsync returning success, this is
               | not guaranteed at all. And if you don't understand the
               | point of the guarantee after reading my previous
               | comment...and this is your way of saying so...further
               | conversation will be pointless...
        
           | switch007 wrote:
           | What's your experience developing a package manager for one
           | of the world's most popular Linux distributions?
           | 
           | Maybe they know something you don't ?????
        
         | the8472 wrote:
         | No need for eatmydata, dpkg has an unsafe-io option.
         | 
         | Other options are to use an overlay mount with volatile or ext4
         | with nobarrier and writeback.
        
           | ValdikSS wrote:
           | unsafe-io eliminates most fsyncs, but not all of them.
        
         | inetknght wrote:
         | So... the goal is to make `apt` to be web scale?
         | 
         | (to be clear: my comment is sarcasm and web scale is a
         | reference to a joke about reliability [0])
         | 
         | [0]: https://www.youtube.com/watch?v=b2F-DItXtZs
        
       | suryao wrote:
       | TLDR: disk is often the bottleneck in builds. Use 'fio' to get
       | performance of the disk.
       | 
       | If you want to truly speed up builds by optimizing disk
       | performance, there are no shortcuts to physically attaching NVMe
       | storage with high throughput and high IOPS to your compute
       | directly.
       | 
       | That's what we do at WarpBuild[0] and we outperform Depot runners
       | handily. This is because we do not use network attached disks
       | which come with relatively higher latency. Our runners are also
       | coupled with faster processors.
       | 
       | I love the Depot content team though, it does a lot of heavy
       | lifting.
       | 
       | [0] https://www.warpbuild.com
        
       | miohtama wrote:
       | If you can afford, upgrade your CI runners on GitHub to paid
       | offering. Highly recommend, less drinking coffee, more instant
       | unit test results. Pay as you go.
        
         | striking wrote:
         | As a Depot customer, I'd say if you can afford to pay for
         | GitHub's runners, you should pay for Depot's instead. They boot
         | faster, run faster, are a fraction of the price. And they are
         | lovely people who provide amazing support.
        
         | kylegalbraith wrote:
         | This is what we focus on with Depot. Faster builds across the
         | board without breaking the bank. More time to get things done
         | and maybe go outside earlier.
         | 
         | Trading Strategy looks super cool, by the way.
        
       | jacobwg wrote:
       | A list of fun things we've done for CI runners to improve CI:
       | 
       | - Configured a block-level in-memory disk accelerator / cache (fs
       | operations at the speed of RAM!)
       | 
       | - Benchmarked EC2 instance types (m7a is the best x86 today, m8g
       | is the best arm64)
       | 
       | - "Warming" the root EBS volume by accessing a set of priority
       | blocks before the job starts to give the job full disk
       | performance [0]
       | 
       | - Launching each runner instance in a public subnet with a public
       | IP - the runner gets full throughput from AWS to the public
       | internet, and IP-based rate limits rarely apply (Docker Hub)
       | 
       | - Configuring Docker with containerd/estargz support
       | 
       | - Just generally turning kernel options and unit files off that
       | aren't needed
       | 
       | [0] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-
       | initial...
        
         | 3np wrote:
         | > Launching each runner instance in a public subnet with a
         | public IP - the runner gets full throughput from AWS to the
         | public internet, and IP-based rate limits rarely apply (Docker
         | Hub)
         | 
         | Are you not using a caching registry mirror, instead pulling
         | the same image from Hub for each runner...? If so that seems
         | like it would be an easy win to add, unless you specifically do
         | mostly hot/unique pulls.
         | 
         | The more efficient answer to those rate limits is almost always
         | to pull less times for the same work rather than scaling in a
         | way that circumvents them.
        
           | jacobwg wrote:
           | Today we (Depot) are not, though some of our customers
           | configure this. For the moment at least, the ephemeral public
           | IP architecture makes it generally unnecessary from a rate-
           | limit perspective.
           | 
           | From a performance / efficiency perspective, we generally
           | recommend using ECR Public images[0], since AWS hosts mirrors
           | of all the "Docker official" images, and throughput to ECR
           | Public is great from inside AWS.
           | 
           | [0] https://gallery.ecr.aws/
        
             | glenjamin wrote:
             | If you're running inside AWS us-east-1 then docker hub will
             | give you direct S3 URLs for layer downloads (or it used to
             | anyway)
             | 
             | Any pulls doing this become zero cost for docker hub
             | 
             | Any sort of cache you put between docker hub and your own
             | infra would probably be S3 backed anyway, so adding another
             | cache in between could be mostly a waste
        
               | jacobwg wrote:
               | Yeah we do some similar tricks with our registry[0]:
               | pushes and pulls from inside AWS are served directly from
               | AWS for maximum performance and no data transfer cost.
               | Then when the client is outside AWS, we redirect all that
               | to Tigris[1], also for maximum performance (CDN) and
               | minimum data transfer cost (no cost from Tigris, just the
               | cost to move content out of AWS once).
               | 
               | [0]: https://depot.dev/blog/introducing-depot-registry
               | 
               | [1]: https://www.tigrisdata.com/blog/depot-registry/
        
         | philsnow wrote:
         | > Configured a block-level in-memory disk accelerator / cache
         | (fs operations at the speed of RAM!)
         | 
         | I'm slightly old; is that the same thing as a ramdisk?
         | https://en.wikipedia.org/wiki/RAM_drive
        
           | jacobwg wrote:
           | Exactly, a ramdisk-backed writeback cache for the root volume
           | for Linux. For macOS we wrote a custom nbd filter to achieve
           | the same thing.
        
             | philsnow wrote:
             | Forgive me, I'm not trying to be argumentative, but doesn't
             | Linux (and presumably all modern OSes) already have a ram-
             | backed writeback cache for filesystems? That sounds exactly
             | like the page cache.
        
               | trillic wrote:
               | If you clearly understand your access patterns and memory
               | requirements, you can often outperform the default OS
               | page cache.
               | 
               | Consider a scenario where your VM has 4GB of RAM, but
               | your build accesses a total of 6GB worth of files.
               | Suppose your code interacts with 16GB of data, yet at any
               | moment, its active working set is only around 2GB. If you
               | preload all Docker images at the start of your build,
               | they'll initially be cached in RAM. However, as your
               | build progresses, the kernel will begin evicting these
               | cached images to accommodate recently accessed data,
               | potentially even files used infrequently or just once.
               | And that's the key bit, to force caching of files you
               | know are accessed more than once.
               | 
               | By implementing your own caching layer, you gain explicit
               | control, allowing critical data to remain persistently
               | cached in memory. In contrast, the kernel-managed page
               | cache treats cached pages as opportunistic, evicting the
               | least recently used pages whenever new data must be
               | accommodated, even if this new data isn't frequently
               | accessed.
        
               | philsnow wrote:
               | > If you clearly understand your access patterns and
               | memory requirements, you can often outperform the default
               | OS page cache.
               | 
               | I believe various RDBMSs bypass the page cache and use
               | their own strategies for managing caching if you give
               | them access to raw block devices, right?
        
               | jacobwg wrote:
               | No worries, entirely valid question. There may be ways to
               | tune page cache to be more like this, but my mental model
               | for what we've done is effectively make reads and writes
               | transparently redirect to the equivalent of a tmpfs, up
               | to a certain size. If you reserve 2GB of memory for the
               | cache, and the CI job's read and written files are less
               | than 2GB, then _everything_ stays in RAM, at RAM
               | throughput/IOPS. When you exceed the limit of the cache,
               | blocks are moved to the physical disk in the background.
               | Feels like we have more direct control here than page
               | cache (and the page cache is still helping out in this
               | scenario too, so it's more that we're using both).
        
               | philsnow wrote:
               | > reads and writes transparently redirect to the
               | equivalent of a tmpfs, _up to a certain size_
               | 
               | The last bit (emphasis added) sounds novel to me, I don't
               | think I've heard before of anybody doing that. It sounds
               | like an almost-"free" way to get a ton of performance
               | ("almost" because somebody has to figure out the sizing.
               | Though, I bet you could automate that by having your tool
               | export a "desired size" metric that's equal to the high
               | watermark of tmpfs-like storage used during the CI run)
        
               | nine_k wrote:
               | No, it's more like swapping pages to disk when RAM is
               | full, or like using RAM when the L2 cache is full.
               | 
               | Linux page cache exists to speed up access to the durable
               | store which is the underlying block device (NVMe, SSD,
               | HDD, etc).
               | 
               | The RAM-backed block device in question here is more like
               | tmpfs, but with an ability to use the disk if, and only
               | if, it overflows. There's no intention or need to store
               | its whole contents on the durable "disk" device.
               | 
               | Hence you can do things entirely in RAM as long as your
               | CI/CD job can fit all the data there, but if it can't
               | fit, the job just gets slower instead of failing.
        
         | jiocrag wrote:
         | have you tried Buildkite? https://buildkite.com
        
       | larusso wrote:
       | So I had to read to the end to realize it's a kinda infomercial.
       | Ok fair enough. Didn't know what depot was though.
        
       | crmd wrote:
       | This is exactly the kind of content marketing I want to see. The
       | IO bottleneck data and the fio scripts are useful to all. Then at
       | the end a link to their product which I'd never heard of, in case
       | you're dealing with the issue at hand.
        
         | kylegalbraith wrote:
         | Thank you for the kind words. We're always trying to share our
         | knowledge even if Depot isn't a good fit for everyone. I hope
         | the scripts get some mileage!
        
       | nodesocket wrote:
       | I just migrated multiple ARM64 GitHub action Docker builds from
       | my self hosted runner (Raspberry Pi in my homeland) to
       | Blacksmith.io and I'm really impressed with the performance so
       | far. Only downside is no Docker layer and image cache like I had
       | on my self hosted runner, but can't complain on the free tier.
        
         | adityamaru wrote:
         | Have you checked out https://docs.blacksmith.sh/docker-
         | builds/incremental-docker-...? This should help setup a shared,
         | persistent docker layer cache for your runners
        
           | nodesocket wrote:
           | Thanks for sharing. I have a custom bash script which does
           | the docker builds currently and swapping to
           | useblacksmith/build-push-action would take a bit of
           | refactoring which I don't want to spend the time on now. :-)
        
       | kayson wrote:
       | Bummer there's no free tier. I've been bashing my head against an
       | intermittent CI failure problem on Github runners for probably a
       | couple years now. I think it's related to the networking stack in
       | their runner image and the fact that I'm using docker in docker
       | to unit test a docker firewall. While I do appreciate that
       | someone at Github did actually look at my issue, they totally
       | missed the point. https://github.com/actions/runner-
       | images/issues/11786
       | 
       | Are there any reasonable alternatives for a really tiny FOSS
       | project?
        
       | crohr wrote:
       | I'm maintaining a benchmark of various GitHub Actions providers
       | regarding I/O speed [1]. Depot is not present because my account
       | was blocked but would love to compare! The disk accelerator looks
       | like a nice feature.
       | 
       | [1]: https://runs-on.com/benchmarks/github-actions-disk-
       | performan...
        
       ___________________________________________________________________
       (page generated 2025-03-28 23:01 UTC)