[HN Gopher] Disk I/O bottlenecks in GitHub Actions
___________________________________________________________________
Disk I/O bottlenecks in GitHub Actions
Author : jacobwg
Score : 86 points
Date : 2025-03-28 15:22 UTC (7 hours ago)
(HTM) web link (depot.dev)
(TXT) w3m dump (depot.dev)
| ValdikSS wrote:
| `apt` installation could be easily sped-up with `eatmydata`:
| `dpkg` calls `fsync()` on all the unpacked files, which is very
| slow on HDDs, and `eatmydata` hacks it out.
| nijave wrote:
| Really if you could just disable fsync at the OS level. A bunch
| of other common package managers and tools also do. Docker is a
| big culprit
|
| If you corrupt a CI node, whatever. Just rerun the step
| wtallis wrote:
| CI containers should probably run entirely from tmpfs.
| jacobwg wrote:
| We're having some success with doing this at the block
| level (e.g. in-memory writeback cache).
| yjftsjthsd-h wrote:
| Why do it at the block level (instead of tmpfs)? Or do
| you mean that you're doing actual real persistent disks
| that just have a lot of cache sitting in front of them?
| jacobwg wrote:
| The block level has two advantages: (1) you can
| accelerate access to everything on the whole disk (like
| even OS packages) and (2) everything appears as one
| device to the OS, meaning that build tools that want to
| do things like hardlink files in global caches still work
| without any issue.
| candiddevmike wrote:
| We built EtchaOS for this use case--small, immutable, in
| memory variants of Fedora, Debian, Ubuntu, etc bundled with
| Docker. It makes a great CI runner for GitHub Actions, and
| plays nicely with caching:
|
| https://etcha.dev/etchaos/
| nijave wrote:
| Can tmpfs be backed by persistent storage? Most of the
| recent stuff I've worked on is a little too big to fit in
| memory handily. Ideally about 20GiB of scratch space for
| 4-8GiB of working memory would be ideal.
|
| I've had good success with machines that have NVMe storage
| (especially on cloud providers) but you still are paying
| the cost of fsync there even if it's a lot faster
| wtallis wrote:
| tmpfs is backed by swap space, in the sense that it will
| overflow to use swap capacity but will not become
| persistent (since the lack of persistence is a feature).
| jacobwg wrote:
| I'd love to experiment with that and/or flags like `noatime`,
| especially when CI nodes are single-use and ephemeral.
| 3np wrote:
| atime is so exotic you shouldn't need to consider disabling
| it experimental. I consider it legacy at this point.
| formerly_proven wrote:
| noatime is irrelevant because everyone has been using
| relatime for ages, and updating the atime field with
| relatime means you're writing that block to disk anyway,
| since you're updating the mtime field. So no I/O saved.
| kylegalbraith wrote:
| This is a neat idea that we should try. We've tried the
| `eatmydata` thing to speed up dpkg, but the slow part wasn't
| the fsync portion but rather the dpkg database.
| formerly_proven wrote:
| You can probably use a BPF return override on fsync and
| fdatasync and sync_file_range, considering that the main use
| case of that feature is syscall-level error injection.
|
| edit: Or, even easier, just use the pre-built fail_function
| infrastructure (with retval = 0 instead of an error):
| https://docs.kernel.org/fault-injection/fault-injection.html
| chippiewill wrote:
| > Docker is a big culprit
|
| Actually in my experience with pulling very large images to
| run with docker it turns out that Docker doesn't really do
| any fsync-ing itself. The sync happens when it creates an
| overlayfs mount while creating a container because the
| overlayfs driver in the kernel does it.
|
| A volatile flag to the kernel driver was added a while back,
| but I don't think Docker uses it yet
| https://www.redhat.com/en/blog/container-volatile-overlay-
| mo...
| nijave wrote:
| Well yeah, but indirectly through the usage of Docker, I
| mean.
|
| Unpacking the Docker image tarballs can be a bit expensive
| --especially with things like nodejs where you have tons of
| tiny files
|
| Tearing down overlayfs is a huge issue, though
| Brian_K_White wrote:
| "`dpkg` calls `fsync()` on all the unpacked files"
|
| Why in the world does it do that ????
|
| Ok I googled (kagi). Same reason anyone ever does: pure voodoo.
|
| If you can't trust the kernel to close() then you can't trust
| it to fsync() or anything else either.
|
| Kernel-level crashes, the only kind of crash that risks half-
| written files, are no more likely during dpkg than any other
| time. A bad update is the same bad update regardless, no
| better, no worse.
| duped wrote:
| "durability" isn't voodoo. Consider if dpkg updates libc.so
| and then you yank the power cord before the page cache is
| flushed to disk, or you're on a laptop and the battery dies.
| Brian_K_White wrote:
| Like I said.
| levkk wrote:
| Pretty sure kernel doesn't have to fsync on close. In
| fact, you don't want it to, otherwise you're limiting the
| performance of your page cache. So fsync on install for
| dpkg makes perfect sense.
| Brian_K_White wrote:
| I didn't say it synced. The file is simply "written" and
| available at that point.
|
| It makes no sense to trust that fsync() does what it
| promises but not that close() does what it promises.
| close() promises that when close() returns, the data is
| stored and some other process may open() and find all of
| it verbatim. And that's all you care about or have any
| business caring about unless you are the kernel yourself.
| hiciu wrote:
| I would like to introduce you to a few case studies on
| bugzilla:
|
| https://wiki.debian.org/Teams/Dpkg/FAQ#Q:_Why_is_dpkg_so_
| slo...
| switch007 wrote:
| Get involved, tell and show them you know better. They
| have a bug tracker, mailing list etc
| timewizard wrote:
| > before the page cache is flushed to disk
|
| And if yank the cord before the package is fully unpacked?
| Wouldn't that just be the same problem? Solving that
| problem involves simply unpacking to a temporary location
| first, verifying all the files were extracted correctly,
| and then renaming them into existence. Which actually
| solves both problems.
|
| Package management is stuck in a 1990s idea of "efficiency"
| which is entirely unwarranted. I have more than enough hard
| drive space to install the distribution several times over.
| Stop trying to be clever.
| hiciu wrote:
| > Wouldn't that just be the same problem?
|
| Not the same problem, it's half-written file vs half of
| the files in older version.
|
| > Which actually solves both problems.
|
| it does not and you would have to guarantee that multiple
| rename operations are executed in a transaction. Which
| you can't. Unless you have really fancy filesystem.
|
| > Stop trying to be clever.
|
| It's called being correct and reliable.
| timewizard wrote:
| > have to guarantee that multiple rename operations are
| executed in a transaction. Which you can't. Unless you
| have really fancy filesystem
|
| Not strictly. You have to guarantee that after reboot you
| rollback any partial package operations. This is what a
| filesystem journal does anyways. So it would be one
| fsync() per package and not one per every file in the
| package. The failure mode implies a reboot must occur.
|
| > It's called being correct and reliable.
|
| There are multiple ways to achieve this. There are
| different requirements among different systems which is
| the whole point of this post. And your version of
| "correct and reliable" depends on /when/ I pull the plug.
| So you're paying a huge price to shift the problem from
| one side of the line to the other in what is not clearly
| a useful or pragmatic way.
| inetknght wrote:
| > _Kernel-level crashes, the only kind of crash that risks
| half-written files, are no more likely during dpkg than any
| other time. A bad update is the same bad update regardless,
| no better, no worse._
|
| Imagine this scenario; you're writing a CI pipeline:
|
| 1. You write some script to `apt-get install` blah blah
|
| 2. As soon as the script is done, your CI job finishes.
|
| 3. Your job is finished, so the VM is powered off.
|
| 4. The hypervisor hits the power button but, oops, the VM
| still had dirty disk cache/pending writes.
|
| The hypervisor _may_ immediately pull the power (chaos monkey
| style; developers don 't have patience), in which case those
| writes are now corrupted. Or, it _may_ use ACPI shutdown
| which then _should_ also have an ultimate timeout before
| pulling power (otherwise stalled IO might prevent resources
| from ever being cleaned up).
|
| If you rely on sync to occur at step 4 during the kernel to
| gracefully exit, how long does the kernel wait before it
| decides that some shutdown-timeout occurred? How long does
| the hypervisor wait and is it longer than the kernel would
| wait? Are you even sure that the VM shutdown command you're
| sending is the graceful one?
|
| How would you fsync at step 3?
|
| For step 2, perhaps you might have an exit script that calls
| `fsync`.
|
| For step 1, perhaps you might call `fsync` after `apt-get
| install` is done.
| Brian_K_White wrote:
| This is like people that think they have designed their own
| even better encryption algorithm. Voodoo. You are not
| solving a problem better than the kernel and filesystem
| (and hypervisor in this case) has already done. If you are
| not writing the kernel or a driver or bootloader itself,
| then fsync() is not your problem or responsibility and you
| aren't filling any holes left by anyone else. You're just
| rubbing the belly of the budda statue for good luck to feel
| better.
| inetknght wrote:
| You didn't answer any of the questions proposed by the
| outlined scenario.
|
| Until you answer how you've solved the "I want to make
| sure my data is written to disk before the hypervisor
| powers off the virtual machine at the end of the
| successful run" problem, then I claim that this is
| absolutely _not_ voodoo.
|
| I suggest you actually read the documentation of all of
| these things before you start claiming that `fsync()` is
| exclusively the purpose of kernel, driver, or bootloader
| developers.
| hiciu wrote:
| this is actually an interesting topic and it turns out kernel
| never made any promises about close() and data being on the
| disk :)
|
| and about kernel-level crashes: yes, but you see, dpkg
| creates a new file on the disk, makes sure it is written
| correctly with fsync() and then calls rename() (or something
| like that) to atomically replace old file with new one.
|
| So there is never a possibility of given file being corrupt
| during update.
| scottlamb wrote:
| Let's start from the assumption that dpkg shouldn't commit to
| its database that package X is installed/updated until all
| the on-disk files reflect that. Then if the operation fails
| and you try again later (on next boot or whatever) it knows
| to check that package's state and try rolling forward.
|
| > If you can't trust the kernel to close() then you can't
| trust it to fsync() or anything else either.
|
| https://man7.org/linux/man-pages/man2/close.2.html
| A successful close does not guarantee that the data has been
| successfully saved to disk, as the kernel uses the buffer
| cache to defer writes. Typically, filesystems do
| not flush buffers when a file is closed. If you
| need to be sure that the data is physically stored
| on the underlying disk, use fsync(2). (It will
| depend on the disk hardware at this point.)
|
| So if you want to wait until it's been saved to disk, you
| have to do an fsync first. If you even just want to know if
| it succeeded or failed, you have to do an fsync first.
|
| Of course none of this matters much on an ephemeral Github
| Actions VM. There's no "on next boot or whatever". So this is
| one environment where it makes sense to bypass all this
| careful durability work that I'd normally be totally behind.
| It seems reasonable enough to say it's reached the page
| cache, it should continue being visible in the current boot,
| and tomorrow will never come.
| Brian_K_White wrote:
| Writing to disk is none of your business unless you are the
| kernel itself.
| scottlamb wrote:
| Huh? Are you thinking dpkg is sending NVMe commands
| itself or something? No, that's not what that manpage
| means. dpkg is asking the kernel to write stuff to the
| page cache, and then asking for a guarantee that the data
| will continue to exist on next boot. The second half is
| what fsync does. Without fsync returning success, this is
| not guaranteed at all. And if you don't understand the
| point of the guarantee after reading my previous
| comment...and this is your way of saying so...further
| conversation will be pointless...
| switch007 wrote:
| What's your experience developing a package manager for one
| of the world's most popular Linux distributions?
|
| Maybe they know something you don't ?????
| the8472 wrote:
| No need for eatmydata, dpkg has an unsafe-io option.
|
| Other options are to use an overlay mount with volatile or ext4
| with nobarrier and writeback.
| ValdikSS wrote:
| unsafe-io eliminates most fsyncs, but not all of them.
| inetknght wrote:
| So... the goal is to make `apt` to be web scale?
|
| (to be clear: my comment is sarcasm and web scale is a
| reference to a joke about reliability [0])
|
| [0]: https://www.youtube.com/watch?v=b2F-DItXtZs
| suryao wrote:
| TLDR: disk is often the bottleneck in builds. Use 'fio' to get
| performance of the disk.
|
| If you want to truly speed up builds by optimizing disk
| performance, there are no shortcuts to physically attaching NVMe
| storage with high throughput and high IOPS to your compute
| directly.
|
| That's what we do at WarpBuild[0] and we outperform Depot runners
| handily. This is because we do not use network attached disks
| which come with relatively higher latency. Our runners are also
| coupled with faster processors.
|
| I love the Depot content team though, it does a lot of heavy
| lifting.
|
| [0] https://www.warpbuild.com
| miohtama wrote:
| If you can afford, upgrade your CI runners on GitHub to paid
| offering. Highly recommend, less drinking coffee, more instant
| unit test results. Pay as you go.
| striking wrote:
| As a Depot customer, I'd say if you can afford to pay for
| GitHub's runners, you should pay for Depot's instead. They boot
| faster, run faster, are a fraction of the price. And they are
| lovely people who provide amazing support.
| kylegalbraith wrote:
| This is what we focus on with Depot. Faster builds across the
| board without breaking the bank. More time to get things done
| and maybe go outside earlier.
|
| Trading Strategy looks super cool, by the way.
| jacobwg wrote:
| A list of fun things we've done for CI runners to improve CI:
|
| - Configured a block-level in-memory disk accelerator / cache (fs
| operations at the speed of RAM!)
|
| - Benchmarked EC2 instance types (m7a is the best x86 today, m8g
| is the best arm64)
|
| - "Warming" the root EBS volume by accessing a set of priority
| blocks before the job starts to give the job full disk
| performance [0]
|
| - Launching each runner instance in a public subnet with a public
| IP - the runner gets full throughput from AWS to the public
| internet, and IP-based rate limits rarely apply (Docker Hub)
|
| - Configuring Docker with containerd/estargz support
|
| - Just generally turning kernel options and unit files off that
| aren't needed
|
| [0] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-
| initial...
| 3np wrote:
| > Launching each runner instance in a public subnet with a
| public IP - the runner gets full throughput from AWS to the
| public internet, and IP-based rate limits rarely apply (Docker
| Hub)
|
| Are you not using a caching registry mirror, instead pulling
| the same image from Hub for each runner...? If so that seems
| like it would be an easy win to add, unless you specifically do
| mostly hot/unique pulls.
|
| The more efficient answer to those rate limits is almost always
| to pull less times for the same work rather than scaling in a
| way that circumvents them.
| jacobwg wrote:
| Today we (Depot) are not, though some of our customers
| configure this. For the moment at least, the ephemeral public
| IP architecture makes it generally unnecessary from a rate-
| limit perspective.
|
| From a performance / efficiency perspective, we generally
| recommend using ECR Public images[0], since AWS hosts mirrors
| of all the "Docker official" images, and throughput to ECR
| Public is great from inside AWS.
|
| [0] https://gallery.ecr.aws/
| glenjamin wrote:
| If you're running inside AWS us-east-1 then docker hub will
| give you direct S3 URLs for layer downloads (or it used to
| anyway)
|
| Any pulls doing this become zero cost for docker hub
|
| Any sort of cache you put between docker hub and your own
| infra would probably be S3 backed anyway, so adding another
| cache in between could be mostly a waste
| jacobwg wrote:
| Yeah we do some similar tricks with our registry[0]:
| pushes and pulls from inside AWS are served directly from
| AWS for maximum performance and no data transfer cost.
| Then when the client is outside AWS, we redirect all that
| to Tigris[1], also for maximum performance (CDN) and
| minimum data transfer cost (no cost from Tigris, just the
| cost to move content out of AWS once).
|
| [0]: https://depot.dev/blog/introducing-depot-registry
|
| [1]: https://www.tigrisdata.com/blog/depot-registry/
| philsnow wrote:
| > Configured a block-level in-memory disk accelerator / cache
| (fs operations at the speed of RAM!)
|
| I'm slightly old; is that the same thing as a ramdisk?
| https://en.wikipedia.org/wiki/RAM_drive
| jacobwg wrote:
| Exactly, a ramdisk-backed writeback cache for the root volume
| for Linux. For macOS we wrote a custom nbd filter to achieve
| the same thing.
| philsnow wrote:
| Forgive me, I'm not trying to be argumentative, but doesn't
| Linux (and presumably all modern OSes) already have a ram-
| backed writeback cache for filesystems? That sounds exactly
| like the page cache.
| trillic wrote:
| If you clearly understand your access patterns and memory
| requirements, you can often outperform the default OS
| page cache.
|
| Consider a scenario where your VM has 4GB of RAM, but
| your build accesses a total of 6GB worth of files.
| Suppose your code interacts with 16GB of data, yet at any
| moment, its active working set is only around 2GB. If you
| preload all Docker images at the start of your build,
| they'll initially be cached in RAM. However, as your
| build progresses, the kernel will begin evicting these
| cached images to accommodate recently accessed data,
| potentially even files used infrequently or just once.
| And that's the key bit, to force caching of files you
| know are accessed more than once.
|
| By implementing your own caching layer, you gain explicit
| control, allowing critical data to remain persistently
| cached in memory. In contrast, the kernel-managed page
| cache treats cached pages as opportunistic, evicting the
| least recently used pages whenever new data must be
| accommodated, even if this new data isn't frequently
| accessed.
| philsnow wrote:
| > If you clearly understand your access patterns and
| memory requirements, you can often outperform the default
| OS page cache.
|
| I believe various RDBMSs bypass the page cache and use
| their own strategies for managing caching if you give
| them access to raw block devices, right?
| jacobwg wrote:
| No worries, entirely valid question. There may be ways to
| tune page cache to be more like this, but my mental model
| for what we've done is effectively make reads and writes
| transparently redirect to the equivalent of a tmpfs, up
| to a certain size. If you reserve 2GB of memory for the
| cache, and the CI job's read and written files are less
| than 2GB, then _everything_ stays in RAM, at RAM
| throughput/IOPS. When you exceed the limit of the cache,
| blocks are moved to the physical disk in the background.
| Feels like we have more direct control here than page
| cache (and the page cache is still helping out in this
| scenario too, so it's more that we're using both).
| philsnow wrote:
| > reads and writes transparently redirect to the
| equivalent of a tmpfs, _up to a certain size_
|
| The last bit (emphasis added) sounds novel to me, I don't
| think I've heard before of anybody doing that. It sounds
| like an almost-"free" way to get a ton of performance
| ("almost" because somebody has to figure out the sizing.
| Though, I bet you could automate that by having your tool
| export a "desired size" metric that's equal to the high
| watermark of tmpfs-like storage used during the CI run)
| nine_k wrote:
| No, it's more like swapping pages to disk when RAM is
| full, or like using RAM when the L2 cache is full.
|
| Linux page cache exists to speed up access to the durable
| store which is the underlying block device (NVMe, SSD,
| HDD, etc).
|
| The RAM-backed block device in question here is more like
| tmpfs, but with an ability to use the disk if, and only
| if, it overflows. There's no intention or need to store
| its whole contents on the durable "disk" device.
|
| Hence you can do things entirely in RAM as long as your
| CI/CD job can fit all the data there, but if it can't
| fit, the job just gets slower instead of failing.
| jiocrag wrote:
| have you tried Buildkite? https://buildkite.com
| larusso wrote:
| So I had to read to the end to realize it's a kinda infomercial.
| Ok fair enough. Didn't know what depot was though.
| crmd wrote:
| This is exactly the kind of content marketing I want to see. The
| IO bottleneck data and the fio scripts are useful to all. Then at
| the end a link to their product which I'd never heard of, in case
| you're dealing with the issue at hand.
| kylegalbraith wrote:
| Thank you for the kind words. We're always trying to share our
| knowledge even if Depot isn't a good fit for everyone. I hope
| the scripts get some mileage!
| nodesocket wrote:
| I just migrated multiple ARM64 GitHub action Docker builds from
| my self hosted runner (Raspberry Pi in my homeland) to
| Blacksmith.io and I'm really impressed with the performance so
| far. Only downside is no Docker layer and image cache like I had
| on my self hosted runner, but can't complain on the free tier.
| adityamaru wrote:
| Have you checked out https://docs.blacksmith.sh/docker-
| builds/incremental-docker-...? This should help setup a shared,
| persistent docker layer cache for your runners
| nodesocket wrote:
| Thanks for sharing. I have a custom bash script which does
| the docker builds currently and swapping to
| useblacksmith/build-push-action would take a bit of
| refactoring which I don't want to spend the time on now. :-)
| kayson wrote:
| Bummer there's no free tier. I've been bashing my head against an
| intermittent CI failure problem on Github runners for probably a
| couple years now. I think it's related to the networking stack in
| their runner image and the fact that I'm using docker in docker
| to unit test a docker firewall. While I do appreciate that
| someone at Github did actually look at my issue, they totally
| missed the point. https://github.com/actions/runner-
| images/issues/11786
|
| Are there any reasonable alternatives for a really tiny FOSS
| project?
| crohr wrote:
| I'm maintaining a benchmark of various GitHub Actions providers
| regarding I/O speed [1]. Depot is not present because my account
| was blocked but would love to compare! The disk accelerator looks
| like a nice feature.
|
| [1]: https://runs-on.com/benchmarks/github-actions-disk-
| performan...
___________________________________________________________________
(page generated 2025-03-28 23:01 UTC)