[HN Gopher] Show HN: Managed GitHub Actions Runners for AWS
___________________________________________________________________
Show HN: Managed GitHub Actions Runners for AWS
Hey HN! I'm Jacob, one of the founders of Depot
(https://depot.dev), a build service for Docker images, and I'm
excited to show what we've been working on for the past few months:
run GitHub Actions jobs in AWS, orchestrated by Depot! Here's a
video demo: https://www.youtube.com/watch?v=VX5Z-k1mGc8, and here's
our blog post: https://depot.dev/blog/depot-github-actions-runners.
While GitHub Actions is one of the most prevalent CI providers,
Actions is slow, for a few reasons: GitHub uses underpowered CPUs,
network throughput for cache and the internet at large is capped at
1 Gbps, and total cache storage is limited to 10GB per repo. It is
also rather expensive for runners with more than 2 CPUs, and larger
runners frequently take a long time to start running jobs. Depot-
managed runners solve this! Rather than your CI jobs running on
GitHub's slow compute, Depot routes those same jobs to fast EC2
instances. And not only is this faster, it's also 1/2 the cost of
GitHub Actions! We do this by launching a dedicated instance for
each job, registering that instance as a self-hosted Actions runner
in your GitHub organization, then terminating the instance when the
job is finished. Using AWS as the compute provider has a few
advantages: - CPUs are typically 30%+ more performant than
alternatives (the m7a instance type). - Each instance has high-
throughput networking of up to 12.5 Gbps, hosted in us-east-1, so
interacting with artifacts, cache, container registries, or the
internet at large is quick. - Each instance has a public IPv4
address, so it does not share rate limits with anyone else. We
integrated the runners with the distributed cache system (backed by
S3 and Ceph) that we use for Docker build cache, so jobs
automatically save / restore cache from this cache system, with
speeds of up to 1 GB/s, and without the default 10 GB per repo
limit. Building this was a fun challenge; some matrix workflows
start 40+ jobs at once, then requiring 40 EC2 instances to launch
at once. We've effectively gotten very good at starting EC2
instances with a "warm pool" system which allows us to prepare many
EC2 instances to run a job, stop them, then resize and start them
when an actual job request arrives, to keep job queue times around
5 seconds. We're using a homegrown orchestration system, as
alternatives like autoscaling groups or Kubernetes weren't fast or
secure enough. There are three alternatives to our managed runners
currently: 1. GitHub offers larger runners: these have more CPUs,
but still have slow network and cache. Depot runners are also 1/2
the cost per minute of GitHub's runners. 2. You can self-host the
Actions runner on your own compute: this requires ongoing
maintenance, and it can be difficult to ensure that the runner
image or container matches GitHub's. 3. There are other companies
offering hosted GitHub Actions runners, though they frequently use
cheaper compute hosting providers that are bottlenecked on network
throughput or geography. Any feedback is very welcome! You can
sign up at https://depot.dev/sign-up for a free trial if you'd like
to try it out on your own workflows. We aren't able to offer a
trial without a signup gate, both because using it requires
installing a GitHub app, and we're offering build compute, so we
need some way to keep out the cryptominers :)
Author : jacobwg
Score : 39 points
Date : 2024-04-04 14:32 UTC (8 hours ago)
| playingalong wrote:
| Can I use my own AWS account?
| jacobwg wrote:
| You can! The default is that we launch the runners on our AWS
| account, but we do also have a bring-your-own-cloud deployment
| option.
|
| We have some docs on this for our container builder product -
| still need to write the docs for Actions runners too, though
| they use the same underlying system:
| https://depot.dev/docs/self-hosted/overview.
| toomuchtodo wrote:
| How will you compete if GitHub talks to the Azure folks (who have
| the benefit of Azure scale) and gets better compute and network
| treatment for runners? Or is the assumption GH running remains
| perpetually stunted as described (which is potentially a fair and
| legit assumption to make based on MS silos and enterprise
| inertia)?
|
| To be clear, this is a genuine question, as compute (even when
| efficiently orchestrated and arbitraged) is a commodity. Your
| cache strategy is good (will be interested in testing to tease
| out where is S3 and where is Ceph), but not a moat and somewhat
| straightforward to replicate.
|
| (again, questions from a place of curiosity, nothing more)
| jacobwg wrote:
| Yep, it's a good question! At the moment, my thoughts are
| roughly:
|
| GitHub's incentives and design constraints are different than
| ours. GitHub needs to offer something that covers a very large
| user-base, to cover the widest possible number of workflows,
| and they've done this by offering basic ephemeral VMs on-
| demand. CI and builds are also not GitHub's primary focus as an
| org.
|
| We're trying to be the absolute fastest place to build
| software, with a deep focus on achieving maximum performance
| and reducing build time as much as possible (even to 0 with
| caching). Software builds today are often wildly inefficient,
| and I personally believe there's an opportunity to do for build
| compute what has been done for application compute over the
| last 10 years.
|
| GitHub Actions workflows are more of an "input" for us then
| (similar to how container image builds have been), with the
| goal of adding more input types over time and applying the same
| core tech to all of them.
| toomuchtodo wrote:
| Good reply. It seems like you understand the market and where
| your product fits, which is half the battle.
|
| Wishing you much success.
| jacobwg wrote:
| Thank you!
| playingalong wrote:
| Corporate inertia might not be the only reason for excessive
| pricing.
|
| They might simply charge for everything working out of the box
| convenience. Or even for not being aware there are other
| options.
| crohr wrote:
| I believe the solution is to decentralise, i.e. let the
| customer run the machines in their own AWS account (what I'm
| doing with RunsOn, link in bio if interested).
|
| It is very hard for a single player to get favourable treatment
| from Azure / AWS / GCP to handle many thousands of jobs every
| day / hour.
|
| I wish Depot all the luck, I think they've done good work wrt
| caching.
| pestkranker wrote:
| How does it compare to BuildJet?
| jacobwg wrote:
| We're both offering managed GitHub Actions runners - some of
| the differences include:
|
| - Depot runners are hosted in AWS us-east-1, which has
| implications for network speed, cache speed, access to internet
| services, etc. (BuildJet is hosted in Europe - maybe Hetzner?)
|
| - Also thanks to AWS: each runner has a dedicated public IP
| address, so you're not sharing any third-party rate limits
| (e.g. Docker Hub) with other users
|
| - We have an option to deploy the runners in your own AWS
| account or VPC-peer with your VPC
|
| - We're integrating Actions runners with the acceleration tech
| we've built for container builds, starting with distributed
| caching
| crohr wrote:
| Yes, BuildJet runs from Hetzner - https://runs-
| on.com/reference/benchmarks-gha-providers/
| everfrustrated wrote:
| GitHub has a colo presence in Frankfurt so pulling repos from
| Europe is quick.
| watermelon0 wrote:
| Hey, @jacobwg, this looks great.
|
| I couldn't find it anywhere on the page, but do you support
| Graviton3 (i.e. m7g instances) for GHA Runners? If the answer is
| no, are there any plans to support it in the future?
|
| > start them when an actual job request arrives, to keep job
| queue times around 5 seconds
|
| Did you have to fine-tune Ubuntu kernel/systemd boot to reach
| such fast startup times?
| playingalong wrote:
| Not affiliated, just guessing.
|
| This 5 seconds might be the warm start, not cold. I.e. they
| likely have a pool of autoscaled, multi tenant workers
| jacobwg wrote:
| Yeah 5 seconds is from stopped to running, but to get that
| speed we need to pre-initialize the root EBS volumes so that
| they're not streaming their contents from S3 during boot. The
| GitHub Actions runner image is 50GB in size _just_ from
| preinstalled software!
| jacobwg wrote:
| We do support Graviton! I actually _just_ enabled them today,
| which we're calling "beta" for the moment:
| https://depot.dev/docs/github-actions/overview#depot-
| support....
|
| The challenge with Arm is actually just that GitHub doesn't
| have a runner image defined for Arm. For the Intel runners, we
| build our image directly from GitHub's source[0], and we're
| doing the same for the Arm runners by patching those same
| Packer scripts for arm64. It also looks like some popular
| actions, like `actions/setup-*`, don't always have arm support
| either.
|
| So the disclaimers for launching Depot `-arm` instances at the
| moment is basically just (1) we have no idea if our image is
| compatible with your workflows, and (2) those instances take a
| bit longer to start.
|
| On achieving fast startup times, it's a challenge. :) The main
| slowdown that prevents a <5s kernel boot is actually EBS lazy-
| loading the AMI from S3 on launch.
|
| To address that at the moment, we do keep a pool of instances
| that boot once, load their volume contents, then shutdown until
| they're needed for a job. That works, at the cost of extra
| complexity and extra money - we're experimenting some with more
| exotic solutions now though like netbooting the AMI. That'll be
| a nice blog post someday I think.
|
| [0] https://github.com/actions/runner-
| images/tree/main/images/ub...
| werewrsdf wrote:
| I recently set up AWS Github runners with this terraform. It
| works well and you don't have to pay any extra in addition to
| AWS.
|
| https://github.com/philips-labs/terraform-aws-github-runner
| jacobwg wrote:
| Yeah this is a good option if you'd like something to deploy
| yourself! You can also build an AMI from GitHub's upstream
| image definition (https://github.com/actions/runner-
| images/tree/main/images/ub...) if you'd like it to match what's
| available in GitHub-hosted Actions.
|
| With Depot, we're moving towards deeper performance
| optimizations and observability than vanilla GitHub runners -
| we've integrated the runners with a cache storage cluster for
| instance, and we're working on deeper integration with the
| compute platform that we built for distributed container image
| builds - as well as expanding the types of builds we can
| process beyond Actions and Docker, for instance.
|
| But different options will be better for different folks, and
| the `philips-labs` project is good at what it does.
| striking wrote:
| I helped set this up at my workplace and can second that it
| works fairly well, but it definitely does have scale issues (we
| tend to exhaust our GH org's API ratelimit and end up being
| unable to scale up sometimes, as well as seeing containers be
| prematurely terminated because the scale down lambda doesn't
| seem to always see them in the GH API) and it's definitely
| lacking a lot of tooling around building runner images and
| caching optimization that we ended up building in-house.
|
| Definitely linking OP to my team now.
| alas44 wrote:
| How do you ensure privacy/isolation between users if you have a
| pool of ready VMs that you re-use?
| jacobwg wrote:
| We don't re-use the VMs - a VM's lifecycle is basically:
|
| 1. Launch, prepare basic software, shut down
|
| 2. A GitHub job request arrives at Depot
|
| 3. The job is assigned to the stopped VM, which is then started
|
| 4. The job runs on the VM completes
|
| 5. The VM is terminated
|
| So the pool exists to speed up the EC2 instance launch time,
| but the VMs themselves are both single-tenant and single-use.
| timvdalen wrote:
| Congrats on shipping! We built something similar internally.
| Tweaking it for the right cost/availability/speed was
| interesting, but we now have it working to where workers are
| generally spun up from 0 faster than GitHub's own are.
| jacobwg wrote:
| Yeah, GitHub's runners, especially the ones with >2 CPUs, have
| surprisingly long start times!
| madisp wrote:
| > - Each instance has high-throughput networking of up to 12.5
| Gbps, hosted in us-east-1, so interacting with artifacts, cache,
| container registries, or the internet at large is quick.
|
| do you actually get the promised 12.5 Gbps? I've been doing some
| experiments and it's really hard to get over 2.5Gbit/s upstream
| from AWS EC2, even when using large 64 vCPU machines. Intra-AWS
| (e.g. VPC) traffic is another thing and that seems to be ok.
| jacobwg wrote:
| We do get the promised throughput, but it depends on the
| destination as you've discovered. AWS actually has some docs on
| this[0]:
|
| - For instances with >= 32 vCPUs, traffic to an internet
| gateway can use 50% of the throughput
|
| - For instances with < 32 vCPUs, traffic to an internet gateway
| can use 5 Gbps
|
| - Traffic inside the VPC can use the full throughput
|
| So for us, that means traffic outbound to the public internet
| can use up to 5 Gbps, but for things like our distributed cache
| or pulling Docker images from our container builders, we can
| get the full 12.5 Gbps.
|
| [0]
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
| ijustlovemath wrote:
| It's important with these kinds of claims to under promise
| and over deliver
| LilBytes wrote:
| [delayed]
___________________________________________________________________
(page generated 2024-04-04 23:00 UTC)