[HN Gopher] Making EC2 boot time faster
___________________________________________________________________
Making EC2 boot time faster
Author : jacobwg
Score : 160 points
Date : 2024-05-23 14:31 UTC (8 hours ago)
(HTM) web link (depot.dev)
(TXT) w3m dump (depot.dev)
| amluto wrote:
| I don't use EC2 enough to have played with this, but a big part
| here is the population of the AMI into the per-instance EBS
| volume.
|
| ISTM one could do much better with an immutable/atomic setup: set
| up an immutable read-only EBS volume, and have each instance
| share that volume and have a per-instance volume that starts out
| blank.
|
| Actually pulling this off looks like it would be limited by the
| rules of EBS Multi-Attach. One could have fun experimenting with
| an extremely minimal boot AMI that streams a squashfs or similar
| file from S3 and unpacks it.
|
| edit: contemplating a bit, unless you are willing to babysit your
| deployment and operate under serious constraints, EBS multi-
| attach looks like the wrong solution. I think the right approach
| would be build a very very small AMI that sets up a rootfs using
| s3fs or a similar technology and optionally puts an overlayfs on
| top. Alternatively, it could set up a block device backed by an
| S3 file and optionally use it as a base layer of a device-mapper
| stack. There's plenty of room to optimize this.
| mdaniel wrote:
| I believe they addressed this in their post because one cannot
| (currently?) `aws ec2 run-instances --volume-id vol-cafebabe`,
| rather one can only tell AWS what volume parameters to use when
| they _create_ the root device. Your theory may still be sound
| about using some kind of super bare bones AMI but there will be
| no such outcome of "hey, friend, use this existing EBS as your
| root volume, don't create a new one"
| stingraycharles wrote:
| Isn't EBS multi-attach only available for the (very expensive)
| io1 / io2 volume types?
| amluto wrote:
| Hmm, it does look like it, although one could carefully use
| large IO.
|
| But the bigger issue might be durability. Most EBS types have
| rather low quoted durability, and, for a shared volume like
| this, that's a problem. Using S3 instead would be better all
| around except for the smallish engineering effort and
| deployment effort needed.
|
| Getting a tool like mkosi to generate a boot-from-S3 setup
| should be straightforward. Converting most any bootable
| container should also be doable, even automatically.
| Converting an AMI would involve more heuristics and be more
| fragile, but it ought to work reliably with most modern Linux
| distros.
| Szpadel wrote:
| we used s3fs in production. please don't use it, it's
| unreliable, unpredictable failure modes, can bring whole
| instance down. if you really need something like that use
| rclone mount
| attentive wrote:
| That's reinventing ebs/ami/snapshots. They are already doing it
| i.e. data goes lazily from s3 to ebs/ec2.
| maccard wrote:
| I don't use GHA as some of our code is stored in Perforce, but
| we've faced the same challenges with EC2 instance startup times
| on our self managed runners on a different provider.
|
| We would happily pay someone like depot for "here's the AMI I
| want to run & autoscale, can you please do it faster than AWS?"
|
| We hit this problem with containers too - we'd _love_ to just run
| all our CI on something like fargate and have it automatically
| scale and respond to our demand, but the response times and rate
| limting are just _so slow_ that it means instead we just end up
| starting/stopping instances with a lambda which feels so 2014.
| CaptainOfCoit wrote:
| > We would happily pay someone like depot for "here's the AMI I
| want to run & autoscale, can you please do it faster than AWS?"
|
| Change that to "here's the ISO/IMG I want to run & autoscale,
| can you please do it faster than AWS?" and you'll have tons of
| options. Most platforms using Firecracker would most likely be
| faster, maybe try to use that as a search vector.
| maccard wrote:
| Can you maybe share some examples? We're fine to use other
| image formats, but a lot of the value of AWS is that the
| services interact, IAM works nicely together, etc.
|
| Fly.io comes up often [0] on HN, but there's an overwhelming
| amount of "it's a nice idea, but it just doesn't work"
| feedback on it.
|
| [0] https://news.ycombinator.com/item?id=39363499
| everfrustrated wrote:
| Out of curiosity what CI system are you using with Perforce?
| maccard wrote:
| We use buildkite with a customised verison of
| https://github.com/improbable-eng/perforce-buildkite-plugin/
|
| Our game code is in P4, but our backend services are on GH.
| Having a single CI system means we get easy interop e.g. game
| updates can trigger backend pipelines and vice versa.
|
| In the past I've used TeamCity, Jenkins, and
| ElectricCommander(!)
| Szpadel wrote:
| I'm not fully investigated fargate limitations but I think it
| would be possible to use any k8s native CI on eks + fargate,
| maybe even use kubevirt for VM creation? from my exploration of
| fargate with eks, aws provisioned capacity in around 1s region
| maccard wrote:
| > AWS offers something very similar to this approach called
| warm pools for EC2 Auto Scaling. This allows you to define a
| certain number of EC2 instances inside an autoscaling group
| that are booted once, perform initialization, then shut down,
| and the autoscaling group will pull from this pool of compute
| first when scaling up.
|
| > While this sounds like it would serve our needs,
| autoscaling groups are very slow to react to incoming
| requests to scale up. From experimentation, it appears that
| autoscaling groups may have a slow poll loop that checks if
| new instances are needed, so the delay between requesting a
| scale up and the instance starting can exceed 60 seconds. For
| us, this negates the benefit of the warm pool.
|
| I pulled this from the article, but it's the same problem.
| Technically yes, eks + fargate works. In practice the
| response times from "thing added to queue" to "node is
| responding" is minutes with that setup.
| immibis wrote:
| There's something to say about building a tower of abstractions
| and then trying to tear it back down. We used to just run a
| compiler on a machine. Startup time: 0.001 seconds. Then we'd run
| a Docker container on a machine. Startup time: 0.01 sections.
| Fine, if you need that abstraction. Now apparently we're booting
| full VMs to run compilers - startup time: 5 seconds. But that's
| not enough, because we're also allocating a bunch of resources in
| a distributed network - startup time: 40 seconds.
|
| Do we actually need all this stuff, or does it suffice to get one
| really powerful server (price less than $40k) and run Docker on
| it?
| cjk2 wrote:
| I'm mostly just running the (Go) compiler on my laptop which is
| considerably faster than on docker and considerably cheaper
| than the server...
|
| I mean an ass end M3 macbook has the same compile time as an
| i9-14900k. God knows what an equivalent Xeon/Epyc costs...
| immibis wrote:
| Maybe your container isn't set up right - Docker contains run
| directly on the host, just partitioned off from accessing
| stuff outside of themselves with the equivalent of chroot. Or
| it could be a Mac-specific thing. Docker only works that way
| on Linux, and has to emulate Linux on other platforms.
| yjftsjthsd-h wrote:
| Right, they said they're on a macbook so unless they're
| going out of their way to run Linux bare-metal it has to
| use a VM. And AIUI there are extra footguns in that
| situation, especially that mapping volumes from the host is
| slower because instead of just telling the kernel to make
| the directory visible you have to actually share from the
| host to the VM.
|
| See also: https://reece.tech/posts/osx-docker-performance/
|
| See also: https://docs.docker.com/desktop/settings/mac/
|
| > Shared folders are designed to allow application code to
| be edited on the host while being executed in containers.
| For non-code items such as cache directories or databases,
| the performance will be much better if they are stored in
| the Linux VM, using a data volume (named volume) or data
| container.
| cjk2 wrote:
| Why would I use docker? You don't have to use it. I'm just
| generating static binaries.
|
| Does anyone understand how to do stuff without containers
| these days?
| skydhash wrote:
| I'm using VMs these day because of conflicts and
| inconsistencies between tooling. But the VM is dedicated
| to one project and I set it up just like a real machine
| (GUI, browser, and stuff). No file sharing. It's been a
| blast.
| rfoo wrote:
| Because you just said:
|
| > which is considerably faster than on docker
|
| And we are curious why it is like so because we not only
| understand how to do stuff without containers, we _also_
| understand how containers work and your claim sounds off.
| cjk2 wrote:
| I don't understand what you are saying.
|
| I'm saying it is slower on docker due to container
| startup, pulling images, overheads, working out what
| incantations to run, filesystem access, network
| weirdness, things talking to other things, configuration
| required, pull limits, API tokens, all sorts.
|
| Versus "go run"
| benwaffle wrote:
| reminds me of https://world.hey.com/dhh/we-re-moving-
| continuous-integratio...
| cjk2 wrote:
| Yep.
|
| And you usually get lumbered with some shitty thing like
| github actions which consumes one mortal full time to keep
| it working, goes down twice a month (yesterday wasn't it
| this week?), takes bloody forever to build anything and is
| impossible to debug.
|
| Edit: and MORE YAML HELL!
| mike_hearn wrote:
| A really powerful server should not cost you anywhere near $40k
| unless you're renting bare metal in AWS or something like that.
|
| Getting rid of the overhead is possible but hard, unless you're
| willing to sacrifice things people really want.
|
| 1. Docker. Adds a few hundred msec of startup time to
| containers, configuration complexity, daemons, disk caches to
| manage, repositories .... a lot of stuff. In rigorously
| controlled corp environments it's not needed. You can just have
| a base OS distro that's managed centrally and tell people to
| target it. If they're building on e.g. the JVM then Docker
| isn't adding much. I don't use it on my own companies CI
| cluster for example, it's just raw TeamCity agents on raw
| machines.
|
| 2. VMs. Clouds need them because they don't trust the Linux
| kernel to isolate customers from each other, and they want to
| buy the biggest machines possible and then subdivide them.
| That's how their business model works. You can solve this a few
| ways. One is something like Firecracker where they make a super
| bare bones VM. Another would be to make a super-hardened
| version of Linux, so hardened people trust it to provide inter-
| tenant isolation. Another way would be a clean room kernel
| designed for security from day one (e.g. written in Rust, Java
| or C#?)
|
| 3. Drives on a distributed network. Honestly not sure why this
| is needed. For CI runners entirely ephemeral VMs running off
| read only root drive images should be fine. They could swap to
| local NVMe storage. I think the big clouds don't always like to
| offer this because they have a lot of machines with no local
| storage whatsoever, as that increases the density and allows
| storage aggregation/binpacking, which lowers their costs.
|
| Basically a big driver of overheads is that people want to be
| in the big clouds because it avoids the need to do long term
| planning or commit capital spend to CI, but the cloud is so
| popular that providers want to pack everyone in as tightly as
| possible which requires strong isolation and the need to avoid
| arbitrary boundaries caused by physical hardware shapes.
| necovek wrote:
| How do you get Docker container startup time of 0.01s with any
| real-life workload (yes, I know they are just processes, so you
| could build a simple "hello world" thing, but I'd be surprised
| if even that runs this fast)?
|
| Do you have an example image and network config that would
| demonstrate that?
|
| (I'd love to understand the performance limits of Docker
| containers, but never played with them deeply enough since they
| are usually in >1s space which is too slow for me to care)
| iudqnolq wrote:
| That doesn't solve the same problem.
|
| GitHub actions in the standard setup needs to run untrusted
| code and so you essentially need a VM.
|
| You can lock it down at the cost of sacrificing features and
| usability, but that's a tradeoff.
| develatio wrote:
| Maybe AWS should actually take a look into this. I know comparing
| AWS to other (smaller) cloud providers is not totally fair given
| the size of AWS, but for example creating / booting an instance
| in Hetzner takes a few seconds.
| matt-p wrote:
| What's size got to do with boot time? Serious question.
| RationPhantoms wrote:
| More employed eyes on an issue or ability to compensate the
| best-in-class engineers to take a look.
| CaptainOfCoit wrote:
| Smaller companies are faster and more nimble than larger
| corporations.
| develatio wrote:
| By "the size" I meant to say "the size of the
| infrastructure", meaning that AWS has to manage orders of
| magnitude more instances than Hetzner. This might as well
| contribute to "things" being slower.
| londons_explore wrote:
| Arguably it can also make things faster. A small provider
| might need to migrate other instances around to make space
| for your new instance, whereas a big provider almost
| certainly can satisfy your request from existing free
| capacity, and it should therefore be a matter of
| milliseconds to identify the physical machine your new VM
| will run on.
| playingalong wrote:
| Likely they mean that following Conway's law in AWS there are
| more abstraction layers involved.
| tekla wrote:
| They have and I know this because I've hammered them on this
| because we demand thousands of instances to autoscale very
| aggressively in 1-3 minutes. Very few people give a shit about
| initialization times. They care more about instance ready times
| which is constrained by the OS that is running.
| everfrustrated wrote:
| Hetzner does not offer a network block storage comparable to
| EBS that can be used as a root (bootable) file system. AWS
| local-attached ephemeral disk are also immediately available
| but cannot be seeded with data (same as Hetzner they are wiped
| clean ahead of boot).
| andersa wrote:
| This is an advantage. EBS is terrible! Literally orders of
| magnitude slower than modern SSDs.
| tekla wrote:
| EBS is great for workloads that dont require SSDs, which
| most don't.
|
| If it does, you can do provisioned which will get you alot
| more or go NVME.
| Nextgrid wrote:
| Even provisioned won't get you the access times of a
| direct-attached SSD. Speed of light and all that - EBS is
| using the network under the hood, it's not a direct
| connection to the host.
| tekla wrote:
| Yes I know, and? Thats why I mentioned NVME
| stingraycharles wrote:
| Depends on your definition of slow. Throughput-wise, I
| think it's fairly decent -- we typically set up 4 EBS
| volumes in raid0 and get 4GB/sec for a really decent price.
| Nextgrid wrote:
| Sequential throughput _can_ be fine. Random access is
| always going to be orders of magnitude slower than a
| direct-attach disk.
|
| Remember why we switched from spinning hard drives to
| SSDs? Well EBS is like going back to a spinning drive.
| torginus wrote:
| It also takes a few seconds on AWS. The guy is comparing
| setting up a whole new machine from an image, with network and
| all, to turning on a stopped EC2 instance.
|
| The latter takes a few seconds, the former is presumably
| longer. This is the great relevation of this blog post.
| dylan604 wrote:
| wait, restarting a stopped machine is faster than launching
| an AMI from scracth is a great revelation?
|
| That's like saying waking your MacbookPro is faster than
| booting from powered off state. Of course it is, and that's
| precisely why the option exists.
| mdeeks wrote:
| If you aren't familiar with how EBS works and how volumes
| are warmed, then yes, this is an interesting blog post. Not
| everyone is an expert. They become experts by reading
| things like this and learning.
|
| If you didn't know about this EBS behavior it would be
| logical to assume that booting from scratch is roughly
| equivalent to starting/stopping/starting again.
| jpambrun wrote:
| I think this is unexpected. I expected that once created,
| my boot volume would have the same performance on the first
| boot than on the second. It's really not obvious that the
| volume is really empty and lazily loaded from S3. The
| proposed work around is also a bit silly: read all blocks
| one by one even tho maybe 1% of the block have something in
| them on a new VM. This is actually a revelation.
| attentive wrote:
| It depends on instance type and OS and can be real short on
| ec2.
| everfrustrated wrote:
| It's too bad that EBS doesn't natively support Copy-On-Write.
|
| Snapshots are persisted into S3 (transparently to the user) but
| it means each new EBS volume spawned doesn't start at full IOPS
| allocation.
|
| I presume this is due to EBS volumes being specific-AZ so to be
| able to launch an AMI-seeded EBS volume in any AZ it needs to go
| via S3 (multi-AZ)
| Twirrim wrote:
| EBS volumes are "expensive" compared to S3, due to the
| limitations of what you can do with live block volumes +
| replicas, vs S3. It takes more disk space to have an image be a
| provisioned volume ready to be used for copy-on-write, vs
| having it as something backed up in S3. So the incentives
| aren't there vs just trying to make the volume creation process
| as smooth and fast as possible.
|
| I'd guess it's likely that EBS is using a tiered caching
| system, where they'll keep live volumes around for Copy-on-
| write cloning for the more popular images/snapshots, with
| slightly less popular images maybe stored in an EBS cache of
| some form, before it goes all the way back to S3. You're just
| not likely to end up getting a live volume level of caching
| until you hit a certain threshold of launches.
| cmckn wrote:
| You can enable fast restore on the EBS snapshot that backs your
| AMI: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-fast-
| sn...
|
| It's not cheap, but it speeds things up.
| stingraycharles wrote:
| $540/month per EBS volume per AZ. And it's still fairly
| limited, at a maximum of 8 credits, it wouldn't nearly cover
| the use case described in the article (launching 50 instances
| quickly).
| bingemaker wrote:
| Curious, how do you measure the time taken for those 4 steps
| listed in "What takes so long?" section?
| waiwai933 wrote:
| I believe this is similar to EC2 Fast Launch which is available
| for Windows AMIs, but I don't know exactly how that works under
| the hood.
|
| https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/win-a...
| necovek wrote:
| From a technical perspective, Amazon has actually optimized this
| but turned that into "serverless functions": their ultra-
| optimized image paired with Firecracker achieve ultra-fast boot-
| up of virtual Linux machines. IIRC from when Firecracker was
| being introduced, they are booting up in sub-second times.
|
| I wonder if Amazon would ever decide to offer booting the same
| image with the same hypervisor in EC2 as they do for lambdas?
| arianvanp wrote:
| And AWS now has a product to spin up Lambdas for GitHub Actions
| CI runners
|
| https://docs.aws.amazon.com/codebuild/latest/userguide/actio...
| cr125rider wrote:
| Fargate is an alternative that runs on Firecracker as well.
| It's hidden behind ECS and EKS, however.
| 20thr wrote:
| 100% -- EC2's general purpose nature is not in my opinion the
| best fit for ephemeral use-cases. You'll be constantly fighting
| the infrastructure as the set of trade-offs and design goals
| are widely different.
|
| This is why CodeSandbox, Namespace, and even fly.io built
| special-purpose architectures to guarantee extremely start-up
| time.
|
| In the case of Namespace it's ~2sec on cold boots with a set of
| user-supplied containers, with storage allocations.
|
| (Disclaimer, I'm with Namespace -- https://namespace.so)
| crohr wrote:
| > while we can boot the Actions runner within 5 seconds of a job
| starting, it can take GitHub 10+ seconds to actually deliver that
| job to the runner
|
| This. I went the same route with regards to boot time
| optimisations for [1] (cleaning up the AMI, cloud-init, etc.),
| and can boot a VM from cold in 15s (I can't rely on prewarming
| pools of machines -- even stopped -- since RunsOn doesn't share
| machines with multiple clients and this would not make sense
| economically).
|
| But the time taken by the official runner binary to load and then
| get assigned a job by GitHub always takes around 8s, which is
| more than half of the VM boot time :( At some point it would be
| great if GitHub could give us a leaner runner binary with less
| legacy stuff, and tailored for ephemeral runners (that, or
| reverse-engineer the protocol).
|
| [1] https://runs-on.com
| suryao wrote:
| This is very cool optimization.
|
| I make a similar product offering fast Github actions runners[1]
| and we've been down this rabbit hole of boot time optimization.
|
| Eventually, we realized that the best solution is to actually
| build scale. There are two factors in your favor then: 1) Spikes
| are less pronounced and the workloads are a lot more predictable.
| 2) The predictability means that you have a decent estimate of
| the workload to expect at any given time, within reason for
| maintaining an efficient warm pool.
|
| This enables us to simplify the stack and not have high-
| maintenance optimizations while delivering great user experience.
|
| We have some pretty heavy use customers that enable us to do
| this.
|
| [1] https://www.warpbuild.com
| Nextgrid wrote:
| I don't get why they're using EBS here to begin with. EBS trades
| off cost and performance for durability. It's slow because it's a
| network-attached volume that's most likely also replicated under
| the hood. You use this for data that you need high durability
| for.
|
| It looks like their use-case fetches all the data it needs from
| the network (in the form of the GH Actions runner getting the job
| from GitHub, and then pulling down Docker containers, etc).
|
| What they need is a minimal Linux install (Arch Linux would be
| good for this) in a squashfs/etc and the only thing in EBS should
| be an HTTP-aware boot loader like IPXE or a kernel+initrd capable
| of pulling down the squashfs from S3 and run it from memory.
| Local "scratchspace" storage for the build jobs can be provided
| by the ephemeral NVME drives which are also direct-attach and
| much faster than EBS.
| jedberg wrote:
| By using EBS they don't have to wait for disk to fill from
| network on second+ boot.
| Nextgrid wrote:
| Ah so they are keeping the machines around? Do they need to
| do that - does the GH runner actually persist anything worth
| keeping in between runs?
| jedberg wrote:
| They keep the instances in a "stopped" state, which means
| keeping the EBS volume around (and paying for it) but not
| paying for the instance (which could be another machine
| when turn it back on, which is why you can't load it into
| scratch space and then stop it).
|
| What's on the EBS is their docker image, so they don't have
| to load it back up again.
| Nextgrid wrote:
| Makes sense. I still think it would be cheaper to just
| reload it from S3 (straight into memory, not using EBS at
| all) on every boot. The entire OS shouldn't be more than
| a gigabyte which is quite fast to download as a bulk
| transfer straight into RAM.
| jedberg wrote:
| Yes it would be cheaper, but the whole point of this
| article is trading off cost for faster boot times. They
| address your points in the article, how it's faster to
| boot off a warm EBS instead of loading from scratch.
| jedberg wrote:
| Boot time is the number one factor in your success with auto-
| scaling. The smaller your boot time, the smaller your prediction
| window needs to be. Ex. If your boot time is five minutes, you
| need to predict what your traffic will be in five minutes, but if
| you can boot in 20 seconds, you only need to predict 20 seconds
| ahead. By definition your predictions will be more accurate the
| smaller the window is.
|
| But! Autoscaling serves two purposes. One is to address load
| spikes. The other is to reduce costs with scaling down. What this
| solution does is trade off some of the cost savings by prewarming
| the EBS volumes and then paying for them.
|
| This feels like a reasonable tradeoff if you can justify the cost
| with better auto-scaling.
|
| And if you're not autoscaling, it's still worth the cost if the
| trade off is having your engineers wait around for instance
| boots.
| sfilmeyer wrote:
| >By definition your predictions will be more accurate the
| smaller the window is.
|
| Small nit, and this doesn't detract from your points. I don't
| think this is universally true by definition, even if it is
| almost always true. You could come up with some rare conditions
| where your traffic at t+5 minutes is actually easier to predict
| than at t+20 seconds. Of course, even in that case you're
| better off (ceteris paribus) being able to spin things up in 20
| seconds.
| jedberg wrote:
| I can come up with a lot of examples where it is easier to
| predict further out[0], but that also means I can predict
| them 20 seconds out. :)
|
| [0] For example I can tell you exactly when spikes will
| happen to Netflix's servers on Saturday morning (because the
| kids all get up at the same time). And I can tell you there
| will be spikes on the hour during prime time as people shift
| from linear TV to streaming (or at least they did a lot more
| 10 years ago!). I can also tell you when spikes to Alexa will
| be because I already know what times peoples alarms are set
| for.
| paulddraper wrote:
| > From a billing perspective, AWS does not charge for the EC2
| instance itself when stopped, as there's no physical hardware
| being reserved; a stopped instance is just the configuration that
| will be used when the instance is started next. Note that you do
| pay for the root EBS volume though, as it's still consuming
| storage.
|
| Shutdown standbys absolutely the way to do it.
|
| Does AWS offer anything for this, because it's very tedious to
| set this up.
| tekla wrote:
| Warm pools
| paulddraper wrote:
| yep, that's it, thank you kind person
| mnutt wrote:
| They talk about the limitations of the EC2 autoscaler and mention
| calling LaunchInstances themselves, but are there any autoscaler
| service projects for EC2 ASGs out there? The AWS-provided one is
| slow (as they mention), annoyingly opaque, and has all kinds of
| limitations like not being able to use Warm Pools with multiple
| instance types etc.
| fduran wrote:
| So I've created ~300k ec2 instances with SadServers and my
| experience was that starting an ec2 VM from stopped took ~30
| seconds and creating one from AMI took ~50 seconds.
|
| Recently I decided to actually look at boot times since I store
| in the db when the servers are requested and when they become
| ready and it turns out for me it's really bi-modal; some take
| about 15-20s and many take about 80s, see graph
| https://x.com/sadservers_com/status/1782081065672118367
|
| Pretty baffled by this (same region, same pretty much
| everything), any idea why?. Definitively going to try this trick
| in the article.
| fletchowns wrote:
| Perhaps in one case you are getting a slice of a machine that
| is already running, versus AWS powering up a machine that was
| offline and getting a slice of that one?
___________________________________________________________________
(page generated 2024-05-23 23:00 UTC)