[HN Gopher] Reverse engineering GitHub Actions cache to make it ...
___________________________________________________________________
Reverse engineering GitHub Actions cache to make it fast
Author : tsaifu
Score : 133 points
Date : 2025-07-23 13:17 UTC (9 hours ago)
(HTM) web link (www.blacksmith.sh)
(TXT) w3m dump (www.blacksmith.sh)
| movedx01 wrote:
| Anything for artifacts perhaps? ;) We use external runners(not
| blacksmith) and had to work around this manually.
| https://github.com/actions/download-artifact/issues/362#issu...
| aayushshah15 wrote:
| [cofounder of blacksmith here]
|
| This is on our radar! The primitives mentioned in this blog
| post are fairly general and allow us to support various types
| of artifact storage and caching protocols.
| AOE9 wrote:
| Small blacksmith.sh user here, have yous any plans on
| reducing billing from per minute to something smaller like
| per second?
| tsaifu wrote:
| Hi! no plans for more resolution than per minute currently
| AOE9 wrote:
| Shame, I am not sure how others use actions but I like
| really small granular ones to make it easy to see what's
| wrong at a glance. E.g. formatting checks per language
| per project in my monorepo etc. Each check is like 10
| seconds and I have like 70+, so the per minute is biting
| me ATM.
| tagraves wrote:
| Come check out RWX :). We have per second billing but I
| think it won't even matter for your use case because most
| of those checks are going to take 0s on RWX. And our UI
| is optimized for showing you what is wrong at a glance
| without having to look at logs at all.
| AOE9 wrote:
| Sorry not for me
|
| * Your per min billing is double blacksmith's * RWX is a
| proprietary format? vs blacksmith's one line change. * No
| fallback option, blacksmith goes down and I can revert
| back to GitHub temporarily.
| pbardea wrote:
| Also might be worth checking out our sticky disks for your use-
| case: https://github.com/useblacksmith/stickydisk. It can be a
| good option for persisting artifacts across jobs - especially
| when they're larger
| kylegalbraith wrote:
| Fun to see folks replicating what we've done with Depot for
| GitHub Actions [0]. Going as far as using a similar title :)
|
| Forking the ecosystem of actions to plug in your cache backed
| isn't a good long term solution.
|
| [0] https://depot.dev/blog/github-actions-cache
| junon wrote:
| > iptables was already doing heavy lifting for other subsystems
| inside our environment, and with each VM adding or removing its
| own set of rules, things got messy fast, and extremely flakey
|
| We saw the same thing at Vercel. Back when we were still doing
| docker-as-a-service we used k8s for both internal services as
| well as user deployments. The latter lead to master deadlocks and
| all sorts of SRE nightmares (literally).
|
| So I was tasked to write a service scheduler from scratch that
| replaced k8s. When it got to the manhandling of IP address
| allocations, deep into the rabbit hole, we had already written
| our own redis-backed DHCP implementation and needed to insert
| those IPs into the firewall tables ourselves, since Docker
| couldn't really do much at all concurrently.
|
| Iptables was VERY fragile. Aside from the fact it didn't even
| have a stable programmatic interface, it was also a race
| condition nightmare, rules were strictly ordered, had no
| composition or destruction-free system (name spacing, layering,
| etc), and was just all around the worst tool for the job.
|
| Unfortunately not much else existed at the time, and given that
| we didn't have time to spend on implementing our own kernel
| modules for this system, and that Docker itself had a slew of
| ridiculous behavior, we ended up scratching the project.
|
| Learned a lot though! We were almost done, until we weren't :)
| tsaifu wrote:
| yeah our findings were similar. the issues we saw with iptables
| rules, especially at scale with ephemeral workloads, was
| starting to cause us a lot of operational toil. nftables ftw
| tekla wrote:
| I've had this problem.
|
| We ended up using Docker Swarm. Painless afterward
| immibis wrote:
| I think iptables compiles BPF filters; you could write your own
| thing to compile BPF filters. In general, the whole Linux
| userspace interface (with few exceptions) is considered stable;
| if you go below any given userspace tool, you're likely to find
| a more stable, but less well documented, kernel interface.
| Since it's all OSS, you can even use iptables itself as a
| starting point to build your own thing.
| formerly_proven wrote:
| Nowadays you would use nftables, which like most new-ish
| kernel infra uses netlink as an API, and supports at least
| atomic updates of multiple rules. That's not to say there's
| documentation for that; there isn't.
| cameronh90 wrote:
| I spent a decade and a bit away from Linux programming and
| have recently come back to it, and I'm absolutely blown
| away at how poor the documentation has become.
|
| Back in the day, one of the best things about Linux was
| actually how _good_ the docs were. Comprehensive man pages,
| stable POSIX standards, projects and APIs that have been
| used since 1970 so every little quirk has been documented
| inside out.
|
| Now it seems like the entire OS has been rewritten by
| freedesktop and if I'm lucky I might find some two year out
| of date information on the ArchLinux wiki. If I'm even
| luckier, that behaviour won't have been completely broken
| by a commit from @poettering in a minor point release.
|
| I actually think a lot of the new stuff is really fantastic
| once I reverse engineer it enough to understand what it's
| doing. I will defend to the death that systemd is, in
| principle, a lot better than the adhoc mountain of distro-
| specific shell scripts it replaces. Pulseaudio does a lot
| of important things that weren't possible before, etc. But
| honestly it feels like nobody wants to write any docs
| because it's changing too frequently, but then everything
| just constantly breaks because it turns out changing
| complex systems rapidly without any documentation leads to
| weird bugs that nobody understands.
| tagraves wrote:
| It's pretty amazing to see what Blacksmith, Depot, Actuated, etc.
| have been able to build on top of GitHub Actions. At RWX we got a
| bit tired of constantly trying to work around the limitations of
| the platform with self-hosted runners, so we just built an
| entirely new CI platform on a brand new execution model with
| support for things like lightning-fast caching out of the box.
| Plus, there are some fundamental limitations that are impossible
| to work around, like the retry behavior [0]. Still, I have a huge
| appreciation for the patience of the Blacksmith team to actually
| dig in and improve what they can with GHA.
|
| [0] https://www.rwx.com/blog/retry-failures-while-run-in-
| progres...
| sameermanek wrote:
| Is it similar to this article posted a year ago:
|
| https://depot.dev/blog/github-actions-cache
| bob1029 wrote:
| I am struggling with justification for CI/CD pipelines that are
| so complex this kind of additional tooling becomes necessary.
|
| There are ways to refactor your technology so that you don't have
| to suffer so much at integration and deployment time. For
| example, the use of containers and hosted SQL where neither are
| required can instantly 10x+ the complexity of deploying your
| software.
|
| The last few B2B/SaaS projects I worked on had CI/CD built into
| the actual product. Writing a simple console app that polls SCM
| for commits, runs dotnet build and then performs a filesystem
| operation is approximately all we've ever needed. The only
| additional enhancement was zipping the artifacts to an S3 bucket
| so that we could email the link out to the customer's IT team for
| install in their secure on-prem instances.
|
| I would propose a canary - If your proposed CI/CD process is so
| complicated that you couldn't write a script by hand to replicate
| it in an afternoon or two, you should seriously question bringing
| the rest of the team into that coal mine.
| norir wrote:
| Here is my cynical take in ci. Firstly, testing is almost never
| valued by management which would rather close a deal on half
| finished promises than actually build a polished, reliable
| product (they can always scapegoat the eng team if things go
| wrong with the customer anyway).
|
| So, to begin with, testing is rarely prioritized. But most
| developer orgs eventually realize that centralized testing is
| necessary or else everyone is stuck in permanent "works on my
| machine!" mode. When deciding to switch to automated ci, eng
| management is left with the build vs buy decision. Buy is very
| attractive for something that is not seriously valued anyway
| and that is often given away for free. There is also industry
| consensus pressure, which has converged on github (even though
| github is objectively bad on almost every metric besides
| popularity -- to be fair the other larger players are also
| generally bad on similar ways). This is when the lock in
| begins. What begins as a simple build file starts expanding
| outward. Well intentioned developers will want to do things
| idiomatically for the ci tool and will start putting logic in
| the ci tool's dsl. The more they do this, the more invested
| they become and the more costly switching becomes. The CI
| vendor is rarely incentivized to make things truly better once
| you are captive. Indeed, that would threaten their business
| model where they typically are going to sell you one of two
| things or both: support or cpu time. Given that business model,
| it is clear that they are incentivized to make their system as
| inefficient and difficult to use (particularly at scale) as
| possible while still retaining just enough customers to remain
| profitable.
|
| The industry has convinced many people that it is too
| costly/inefficient to build your own test infrastructure even
| while burning countless man and cpu hours on the awful
| solutions presented by industry.
|
| Companies like blacksmith are smart to address the clear
| shortcomings in the market though personally I find life too
| short to spend on github actions in any capacity.
| bob1029 wrote:
| > they typically are going to sell you one of two things or
| both: support or cpu time
|
| At what point does the line between CPU time in GH Actions
| and CPU time in the actual production environment lose all
| meaning? Why even bother moving to production? You could just
| create a new GH action called "Production" that gets invoked
| at the end of the pipeline and runs perpetually.
|
| I think I may have identified a better canary here. If the
| CI/CD process takes so much CPU time that we are consciously
| aware of the resulting bill, there is _definitely_ something
| going wrong.
| AOE9 wrote:
| > I think I may have identified a better canary here. If
| the CI/CD process takes so much CPU time that we are
| consciously aware of the resulting bill, there is
| definitely something going wrong.
|
| CPU time is cheaper than an engineers time, you should be
| offloading formatting/linting/testing checks to CI on PRs.
| This will add up though when multiple by hundreds or
| thousands, it isn't a good canary.
| AOE9 wrote:
| > The last few B2B/SaaS projects I worked on had CI/CD built
| into the actual product. Writing a simple console app that
| polls SCM for commits, runs dotnet build and then performs a
| filesystem operation is approximately all we've ever needed.
| The only additional enhancement was zipping the artifacts to an
| S3 bucket so that we could email the link out to the customer's
| IT team for install in their secure on-prem instances.
|
| That sounds like the biggest yikes.
| jchw wrote:
| Oh this is pretty interesting. One thing that's also interesting
| to note is that the Azure Blob Storage version of GitHub Actions
| Cache is actually a sort of V2, although internally it is just a
| brand new service with the internal version of V1. The old
| service was a REST-ish service that abstracted the storage
| backend, and it is still used by GitHub Enterprise. The new
| service is a TWIRP-based system where you directly store things
| into Azure using signed URLs from the TWIRP side. I reverse
| engineered this to implement support for the new cache API in
| Determinate System's Magic Nix Cache which abruptly stopped
| working earlier this year when GitHub disabled the old API on
| GitHub.com. One thing that's annoying is GitHub seems to continue
| to tacitly allow people to use the cache internals but stops
| short of providing useful things like the protobuf files used to
| generate the TWIRP clients. I wound up reverse engineering them
| from the actions/cache action's gencode, tweaking the
| reconstructed protobuf files until I was able to get a byte-for-
| byte match.
|
| On the flip side, I did something that might break Blacksmith: I
| used append blobs instead of block blobs. Why? ... Because it was
| simpler. For block blobs you have to construct this silly XML
| payload with the block list or whatever. With append blobs you
| can just keep appending chunks of data and then seal it when
| you're done. I have always wondered if the fact that I am
| responsible for the fact that some of GitHub Actions Cache is
| using append blobs would ever come back to bite me, but as far as
| I can tell from the Azure PoV it makes very little difference,
| pricing seems the same at least. But either way, they need to
| support append blobs now probably. Sorry :)
|
| (If you are wondering why not use the Rust Azure SDK, as far as I
| can tell the official Rust Azure SDK does not support using
| signed URLs for uploading. And frankly, it would've brought a lot
| of dependencies and been rather complex to integrate for other
| Rust reasons.)
|
| (It would also be possible, by setting env variables a certain
| way, to get virtually all workflows to behave as if they're
| running under GitHub Enterprise, and get the old REST API.
| However, Azure SDK with its concurrency features probably yields
| better performance.)
| esafak wrote:
| Caching gets trickier when tasks spin up containers.
| crohr wrote:
| At this point this is considered a baseline feature of every good
| GitHub Actions third-party provider, but nice to see the write-up
| and solution they came up with!
|
| Note that GitHub Actions Cache v2 is actually very good in terms
| of download/upload speed right now, when running from GitHub
| managed runners. The low speed Blacksmith was seeing before is
| just due to their slow (Hetzner?) network.
|
| I benchmarked most providers (I maintain RunsOn) with regards to
| their cache performance here: https://runs-
| on.com/benchmarks/github-actions-cache-performa...
| crohr wrote:
| Also note this open-source project that shows a way to
| implement this: https://github.com/falcondev-oss/github-
| actions-cache-server
| EdJiang wrote:
| Related read: Cirrus Labs also wrote a drop-in replacement for GH
| Actions cache on their platform.
|
| https://cirrus-runners.app/blog/2024/04/23/speeding-up-cachi...
|
| https://github.com/cirruslabs/cache
| crohr wrote:
| It is not transparent though, so it doesn't work with all the
| other actions that use the cache toolkit, and you have to
| reference a specific action.
| zamalek wrote:
| I'm currently migrating some stuff from azdo to GHA, and have
| been putting past lessons to serious use:
|
| * Perf: don't use "install X" (Node, .Net, Ruby, Python, etc.)
| tasks. Create a container image with all your deps and use that
| instead.
|
| * Perf: related to the last, keep multiple utility container
| images around of varying degrees of complexity. For example, in
| our case, I decided on PowerShell because we have some devs with
| Windows and it's the easiest to get working across Linux+Windows
| - so my simplest container has pwsh and some really basic tools
| (git, curl, etc.). I build another container on that which has
| .Net deps. Then each .Net repo uses that to:
|
| * Perf: don't use the cache action at all. Run a job nightly that
| pulls down your code into a container, restore/install to warm
| the cache, then delete the code. `RUN --mount` is a good way to
| avoid creating a layer with your code in it.
|
| * Maintainability: don't write big scripts in your workflow file.
| Create scripts as files that can also be executed on your local
| machine. Keep the "glue code" between GHA and your script in the
| workflow file. I slightly lie here, I do source in a single
| utility script that reads in GHA envars and has functions to set
| CI variables and so forth (that does sensible things when run
| locally).
|
| Our CI builds are _stupid_ fast. Comparatively speaking.
|
| For the OP (I just sent your pricing page to my manager ;) ):
| having a colocated container registry for these types of things
| would be super useful. I would say you don't need to expose it to
| the internet, but sometimes you do need to be able to `podman
| run` into an image for debug purposes.
|
| [1]: https://docs.github.com/en/actions/how-tos/writing-
| workflows...
| nodesocket wrote:
| I'm currently using blacksmith for my arm64 Docker builds.
| Unfortunately my workflow currently requires invoking a custom
| bash script which executes the Docker commands. Does this mean, I
| can now utilize Docker image caching without needing to migrate
| to useblacksmith/build-push-action?
| aayushshah15 wrote:
| Yes! This is documented in our docs:
| https://docs.blacksmith.sh/blacksmith-caching/docker-
| builds#..., the TLDR is that you can use the `build-push-
| action` with `setup-only: true`.
___________________________________________________________________
(page generated 2025-07-23 23:01 UTC)