[HN Gopher] Reverse engineering GitHub Actions cache to make it ...
       ___________________________________________________________________
        
       Reverse engineering GitHub Actions cache to make it fast
        
       Author : tsaifu
       Score  : 133 points
       Date   : 2025-07-23 13:17 UTC (9 hours ago)
        
 (HTM) web link (www.blacksmith.sh)
 (TXT) w3m dump (www.blacksmith.sh)
        
       | movedx01 wrote:
       | Anything for artifacts perhaps? ;) We use external runners(not
       | blacksmith) and had to work around this manually.
       | https://github.com/actions/download-artifact/issues/362#issu...
        
         | aayushshah15 wrote:
         | [cofounder of blacksmith here]
         | 
         | This is on our radar! The primitives mentioned in this blog
         | post are fairly general and allow us to support various types
         | of artifact storage and caching protocols.
        
           | AOE9 wrote:
           | Small blacksmith.sh user here, have yous any plans on
           | reducing billing from per minute to something smaller like
           | per second?
        
             | tsaifu wrote:
             | Hi! no plans for more resolution than per minute currently
        
               | AOE9 wrote:
               | Shame, I am not sure how others use actions but I like
               | really small granular ones to make it easy to see what's
               | wrong at a glance. E.g. formatting checks per language
               | per project in my monorepo etc. Each check is like 10
               | seconds and I have like 70+, so the per minute is biting
               | me ATM.
        
               | tagraves wrote:
               | Come check out RWX :). We have per second billing but I
               | think it won't even matter for your use case because most
               | of those checks are going to take 0s on RWX. And our UI
               | is optimized for showing you what is wrong at a glance
               | without having to look at logs at all.
        
               | AOE9 wrote:
               | Sorry not for me
               | 
               | * Your per min billing is double blacksmith's * RWX is a
               | proprietary format? vs blacksmith's one line change. * No
               | fallback option, blacksmith goes down and I can revert
               | back to GitHub temporarily.
        
         | pbardea wrote:
         | Also might be worth checking out our sticky disks for your use-
         | case: https://github.com/useblacksmith/stickydisk. It can be a
         | good option for persisting artifacts across jobs - especially
         | when they're larger
        
       | kylegalbraith wrote:
       | Fun to see folks replicating what we've done with Depot for
       | GitHub Actions [0]. Going as far as using a similar title :)
       | 
       | Forking the ecosystem of actions to plug in your cache backed
       | isn't a good long term solution.
       | 
       | [0] https://depot.dev/blog/github-actions-cache
        
       | junon wrote:
       | > iptables was already doing heavy lifting for other subsystems
       | inside our environment, and with each VM adding or removing its
       | own set of rules, things got messy fast, and extremely flakey
       | 
       | We saw the same thing at Vercel. Back when we were still doing
       | docker-as-a-service we used k8s for both internal services as
       | well as user deployments. The latter lead to master deadlocks and
       | all sorts of SRE nightmares (literally).
       | 
       | So I was tasked to write a service scheduler from scratch that
       | replaced k8s. When it got to the manhandling of IP address
       | allocations, deep into the rabbit hole, we had already written
       | our own redis-backed DHCP implementation and needed to insert
       | those IPs into the firewall tables ourselves, since Docker
       | couldn't really do much at all concurrently.
       | 
       | Iptables was VERY fragile. Aside from the fact it didn't even
       | have a stable programmatic interface, it was also a race
       | condition nightmare, rules were strictly ordered, had no
       | composition or destruction-free system (name spacing, layering,
       | etc), and was just all around the worst tool for the job.
       | 
       | Unfortunately not much else existed at the time, and given that
       | we didn't have time to spend on implementing our own kernel
       | modules for this system, and that Docker itself had a slew of
       | ridiculous behavior, we ended up scratching the project.
       | 
       | Learned a lot though! We were almost done, until we weren't :)
        
         | tsaifu wrote:
         | yeah our findings were similar. the issues we saw with iptables
         | rules, especially at scale with ephemeral workloads, was
         | starting to cause us a lot of operational toil. nftables ftw
        
         | tekla wrote:
         | I've had this problem.
         | 
         | We ended up using Docker Swarm. Painless afterward
        
         | immibis wrote:
         | I think iptables compiles BPF filters; you could write your own
         | thing to compile BPF filters. In general, the whole Linux
         | userspace interface (with few exceptions) is considered stable;
         | if you go below any given userspace tool, you're likely to find
         | a more stable, but less well documented, kernel interface.
         | Since it's all OSS, you can even use iptables itself as a
         | starting point to build your own thing.
        
           | formerly_proven wrote:
           | Nowadays you would use nftables, which like most new-ish
           | kernel infra uses netlink as an API, and supports at least
           | atomic updates of multiple rules. That's not to say there's
           | documentation for that; there isn't.
        
             | cameronh90 wrote:
             | I spent a decade and a bit away from Linux programming and
             | have recently come back to it, and I'm absolutely blown
             | away at how poor the documentation has become.
             | 
             | Back in the day, one of the best things about Linux was
             | actually how _good_ the docs were. Comprehensive man pages,
             | stable POSIX standards, projects and APIs that have been
             | used since 1970 so every little quirk has been documented
             | inside out.
             | 
             | Now it seems like the entire OS has been rewritten by
             | freedesktop and if I'm lucky I might find some two year out
             | of date information on the ArchLinux wiki. If I'm even
             | luckier, that behaviour won't have been completely broken
             | by a commit from @poettering in a minor point release.
             | 
             | I actually think a lot of the new stuff is really fantastic
             | once I reverse engineer it enough to understand what it's
             | doing. I will defend to the death that systemd is, in
             | principle, a lot better than the adhoc mountain of distro-
             | specific shell scripts it replaces. Pulseaudio does a lot
             | of important things that weren't possible before, etc. But
             | honestly it feels like nobody wants to write any docs
             | because it's changing too frequently, but then everything
             | just constantly breaks because it turns out changing
             | complex systems rapidly without any documentation leads to
             | weird bugs that nobody understands.
        
       | tagraves wrote:
       | It's pretty amazing to see what Blacksmith, Depot, Actuated, etc.
       | have been able to build on top of GitHub Actions. At RWX we got a
       | bit tired of constantly trying to work around the limitations of
       | the platform with self-hosted runners, so we just built an
       | entirely new CI platform on a brand new execution model with
       | support for things like lightning-fast caching out of the box.
       | Plus, there are some fundamental limitations that are impossible
       | to work around, like the retry behavior [0]. Still, I have a huge
       | appreciation for the patience of the Blacksmith team to actually
       | dig in and improve what they can with GHA.
       | 
       | [0] https://www.rwx.com/blog/retry-failures-while-run-in-
       | progres...
        
       | sameermanek wrote:
       | Is it similar to this article posted a year ago:
       | 
       | https://depot.dev/blog/github-actions-cache
        
       | bob1029 wrote:
       | I am struggling with justification for CI/CD pipelines that are
       | so complex this kind of additional tooling becomes necessary.
       | 
       | There are ways to refactor your technology so that you don't have
       | to suffer so much at integration and deployment time. For
       | example, the use of containers and hosted SQL where neither are
       | required can instantly 10x+ the complexity of deploying your
       | software.
       | 
       | The last few B2B/SaaS projects I worked on had CI/CD built into
       | the actual product. Writing a simple console app that polls SCM
       | for commits, runs dotnet build and then performs a filesystem
       | operation is approximately all we've ever needed. The only
       | additional enhancement was zipping the artifacts to an S3 bucket
       | so that we could email the link out to the customer's IT team for
       | install in their secure on-prem instances.
       | 
       | I would propose a canary - If your proposed CI/CD process is so
       | complicated that you couldn't write a script by hand to replicate
       | it in an afternoon or two, you should seriously question bringing
       | the rest of the team into that coal mine.
        
         | norir wrote:
         | Here is my cynical take in ci. Firstly, testing is almost never
         | valued by management which would rather close a deal on half
         | finished promises than actually build a polished, reliable
         | product (they can always scapegoat the eng team if things go
         | wrong with the customer anyway).
         | 
         | So, to begin with, testing is rarely prioritized. But most
         | developer orgs eventually realize that centralized testing is
         | necessary or else everyone is stuck in permanent "works on my
         | machine!" mode. When deciding to switch to automated ci, eng
         | management is left with the build vs buy decision. Buy is very
         | attractive for something that is not seriously valued anyway
         | and that is often given away for free. There is also industry
         | consensus pressure, which has converged on github (even though
         | github is objectively bad on almost every metric besides
         | popularity -- to be fair the other larger players are also
         | generally bad on similar ways). This is when the lock in
         | begins. What begins as a simple build file starts expanding
         | outward. Well intentioned developers will want to do things
         | idiomatically for the ci tool and will start putting logic in
         | the ci tool's dsl. The more they do this, the more invested
         | they become and the more costly switching becomes. The CI
         | vendor is rarely incentivized to make things truly better once
         | you are captive. Indeed, that would threaten their business
         | model where they typically are going to sell you one of two
         | things or both: support or cpu time. Given that business model,
         | it is clear that they are incentivized to make their system as
         | inefficient and difficult to use (particularly at scale) as
         | possible while still retaining just enough customers to remain
         | profitable.
         | 
         | The industry has convinced many people that it is too
         | costly/inefficient to build your own test infrastructure even
         | while burning countless man and cpu hours on the awful
         | solutions presented by industry.
         | 
         | Companies like blacksmith are smart to address the clear
         | shortcomings in the market though personally I find life too
         | short to spend on github actions in any capacity.
        
           | bob1029 wrote:
           | > they typically are going to sell you one of two things or
           | both: support or cpu time
           | 
           | At what point does the line between CPU time in GH Actions
           | and CPU time in the actual production environment lose all
           | meaning? Why even bother moving to production? You could just
           | create a new GH action called "Production" that gets invoked
           | at the end of the pipeline and runs perpetually.
           | 
           | I think I may have identified a better canary here. If the
           | CI/CD process takes so much CPU time that we are consciously
           | aware of the resulting bill, there is _definitely_ something
           | going wrong.
        
             | AOE9 wrote:
             | > I think I may have identified a better canary here. If
             | the CI/CD process takes so much CPU time that we are
             | consciously aware of the resulting bill, there is
             | definitely something going wrong.
             | 
             | CPU time is cheaper than an engineers time, you should be
             | offloading formatting/linting/testing checks to CI on PRs.
             | This will add up though when multiple by hundreds or
             | thousands, it isn't a good canary.
        
         | AOE9 wrote:
         | > The last few B2B/SaaS projects I worked on had CI/CD built
         | into the actual product. Writing a simple console app that
         | polls SCM for commits, runs dotnet build and then performs a
         | filesystem operation is approximately all we've ever needed.
         | The only additional enhancement was zipping the artifacts to an
         | S3 bucket so that we could email the link out to the customer's
         | IT team for install in their secure on-prem instances.
         | 
         | That sounds like the biggest yikes.
        
       | jchw wrote:
       | Oh this is pretty interesting. One thing that's also interesting
       | to note is that the Azure Blob Storage version of GitHub Actions
       | Cache is actually a sort of V2, although internally it is just a
       | brand new service with the internal version of V1. The old
       | service was a REST-ish service that abstracted the storage
       | backend, and it is still used by GitHub Enterprise. The new
       | service is a TWIRP-based system where you directly store things
       | into Azure using signed URLs from the TWIRP side. I reverse
       | engineered this to implement support for the new cache API in
       | Determinate System's Magic Nix Cache which abruptly stopped
       | working earlier this year when GitHub disabled the old API on
       | GitHub.com. One thing that's annoying is GitHub seems to continue
       | to tacitly allow people to use the cache internals but stops
       | short of providing useful things like the protobuf files used to
       | generate the TWIRP clients. I wound up reverse engineering them
       | from the actions/cache action's gencode, tweaking the
       | reconstructed protobuf files until I was able to get a byte-for-
       | byte match.
       | 
       | On the flip side, I did something that might break Blacksmith: I
       | used append blobs instead of block blobs. Why? ... Because it was
       | simpler. For block blobs you have to construct this silly XML
       | payload with the block list or whatever. With append blobs you
       | can just keep appending chunks of data and then seal it when
       | you're done. I have always wondered if the fact that I am
       | responsible for the fact that some of GitHub Actions Cache is
       | using append blobs would ever come back to bite me, but as far as
       | I can tell from the Azure PoV it makes very little difference,
       | pricing seems the same at least. But either way, they need to
       | support append blobs now probably. Sorry :)
       | 
       | (If you are wondering why not use the Rust Azure SDK, as far as I
       | can tell the official Rust Azure SDK does not support using
       | signed URLs for uploading. And frankly, it would've brought a lot
       | of dependencies and been rather complex to integrate for other
       | Rust reasons.)
       | 
       | (It would also be possible, by setting env variables a certain
       | way, to get virtually all workflows to behave as if they're
       | running under GitHub Enterprise, and get the old REST API.
       | However, Azure SDK with its concurrency features probably yields
       | better performance.)
        
       | esafak wrote:
       | Caching gets trickier when tasks spin up containers.
        
       | crohr wrote:
       | At this point this is considered a baseline feature of every good
       | GitHub Actions third-party provider, but nice to see the write-up
       | and solution they came up with!
       | 
       | Note that GitHub Actions Cache v2 is actually very good in terms
       | of download/upload speed right now, when running from GitHub
       | managed runners. The low speed Blacksmith was seeing before is
       | just due to their slow (Hetzner?) network.
       | 
       | I benchmarked most providers (I maintain RunsOn) with regards to
       | their cache performance here: https://runs-
       | on.com/benchmarks/github-actions-cache-performa...
        
         | crohr wrote:
         | Also note this open-source project that shows a way to
         | implement this: https://github.com/falcondev-oss/github-
         | actions-cache-server
        
       | EdJiang wrote:
       | Related read: Cirrus Labs also wrote a drop-in replacement for GH
       | Actions cache on their platform.
       | 
       | https://cirrus-runners.app/blog/2024/04/23/speeding-up-cachi...
       | 
       | https://github.com/cirruslabs/cache
        
         | crohr wrote:
         | It is not transparent though, so it doesn't work with all the
         | other actions that use the cache toolkit, and you have to
         | reference a specific action.
        
       | zamalek wrote:
       | I'm currently migrating some stuff from azdo to GHA, and have
       | been putting past lessons to serious use:
       | 
       | * Perf: don't use "install X" (Node, .Net, Ruby, Python, etc.)
       | tasks. Create a container image with all your deps and use that
       | instead.
       | 
       | * Perf: related to the last, keep multiple utility container
       | images around of varying degrees of complexity. For example, in
       | our case, I decided on PowerShell because we have some devs with
       | Windows and it's the easiest to get working across Linux+Windows
       | - so my simplest container has pwsh and some really basic tools
       | (git, curl, etc.). I build another container on that which has
       | .Net deps. Then each .Net repo uses that to:
       | 
       | * Perf: don't use the cache action at all. Run a job nightly that
       | pulls down your code into a container, restore/install to warm
       | the cache, then delete the code. `RUN --mount` is a good way to
       | avoid creating a layer with your code in it.
       | 
       | * Maintainability: don't write big scripts in your workflow file.
       | Create scripts as files that can also be executed on your local
       | machine. Keep the "glue code" between GHA and your script in the
       | workflow file. I slightly lie here, I do source in a single
       | utility script that reads in GHA envars and has functions to set
       | CI variables and so forth (that does sensible things when run
       | locally).
       | 
       | Our CI builds are _stupid_ fast. Comparatively speaking.
       | 
       | For the OP (I just sent your pricing page to my manager ;) ):
       | having a colocated container registry for these types of things
       | would be super useful. I would say you don't need to expose it to
       | the internet, but sometimes you do need to be able to `podman
       | run` into an image for debug purposes.
       | 
       | [1]: https://docs.github.com/en/actions/how-tos/writing-
       | workflows...
        
       | nodesocket wrote:
       | I'm currently using blacksmith for my arm64 Docker builds.
       | Unfortunately my workflow currently requires invoking a custom
       | bash script which executes the Docker commands. Does this mean, I
       | can now utilize Docker image caching without needing to migrate
       | to useblacksmith/build-push-action?
        
         | aayushshah15 wrote:
         | Yes! This is documented in our docs:
         | https://docs.blacksmith.sh/blacksmith-caching/docker-
         | builds#..., the TLDR is that you can use the `build-push-
         | action` with `setup-only: true`.
        
       ___________________________________________________________________
       (page generated 2025-07-23 23:01 UTC)