[HN Gopher] Running e2e tests 10x faster using firecracker VMs
___________________________________________________________________
Running e2e tests 10x faster using firecracker VMs
Author : samanthachai
Score : 114 points
Date : 2022-04-17 17:00 UTC (6 hours ago)
(HTM) web link (webapp.io)
(TXT) w3m dump (webapp.io)
| fideloper wrote:
| What do y'all run firecracker on? The metal servers on aws (the
| only servers you can run firecracker on in aws) are pretty
| expensive!
| neatze wrote:
| What does this e2e tests in webapp ?
|
| I don't understand why you need to rebuild docker image every app
| build, this seems like really wasteful.
| n8ta wrote:
| If the app itself is part of the image you need to rebuild the
| image every time a dev wants to test their change.
| mtoddsmith wrote:
| Is that the same as redeploying your app to an existing
| container?
| goodpoint wrote:
| This has been done successfully using VMs since 2 decades.
| yewenjie wrote:
| Interesting. What other cool things are people doing with
| Firecracker?
| cpach wrote:
| Fly built a whole platform with Firecracker VMs:
| https://fly.io/
| sjosh003 wrote:
| I have been using Weave Ignite [1] recently to run Firecracker
| micro vm(s) instead of containers for a multitude of tasks!
|
| 1. https://github.com/weaveworks/ignite
| kaivalyagandhi wrote:
| interesting, I wonder if you can use this with GitHub self hosted
| runners?
| StreamBright wrote:
| Great article. Firecracker has been an amazing addition to my
| toolkit and it is good to see succeeding in solving real world
| problems.
| rossmohax wrote:
| They seem to be comparing CI runner starting from scratch to
| always on VM with firecracker preconfigured.
| nicoburns wrote:
| Firecracker _is_ a CI runner starting a VM for each run in this
| case, just a more optimised one, no?
| greatgib wrote:
| Always amaze me to see the new trend of DevOps that will be
| happily following such a tutorial, wget and running random code
| from the internet in production...
| jrockway wrote:
| I don't think this is production, this is for running your
| tests. Your code in the "tests haven't run yet" state probably
| leak all the secrets they have access to and destroy the
| machine they're running on, so you don't let them have any
| secrets and create a new machine each time. "curl | bash" here
| just injects potential flakiness (as does "npm install" when
| npm dies, etc.)
|
| Obviously a lot of people treat their CI system as their CD
| system, and do things like letting tests have highly privileged
| access to their production k8s cluster. That's a terrible idea
| even if you aren't installing software with "curl | bash".
|
| So overall, I don't think this is worth a HN comment to
| complain about. People are going to install software in non-
| auditable non-reproducible ways.
| CraigJPerry wrote:
| There's an even faster strategy than this and it's easier to
| setup.
|
| You're going to deploy 4 CI pipelines (so make sure you're not
| manually putting together ci pipelines configs, use automation):
|
| Pipeline 1: A conveyor belt of environments. All this pipeline
| does is spin up fresh environments then run a short automated
| smoke test. Hydrate the env with the most recent mask from prod.
| The trigger condition is there's less than <Threshold>
| environments available. I did 8 on a whim and never saw a need to
| change it.
|
| Pipeline 2: Normal garden variety CI pipeline triggered on merges
| to main. Output of this will be two artifacts persisted: a built
| package and your unit test evidence
|
| Pipeline 3: Test your automated deployment by deploying the
| package build from #2 into the first of the queue of free envs
| from #1 trigger your end to end and integration and contract
| tests. Don't run your security or operability tests here.
|
| Pipeline 4: Async pipeline triggered on a 6hr schedule, do your
| long running stuff like fuzz testing here, your security tests
| etc. do these outside of the dev cycle.
|
| Release candidates can only be signed after a successful run
| through 2, 3 & 4. That means prod deploys are on a predictable
| cadence which users and ops are usually appreciative of rather
| than we drop it in when it's ready.
|
| The DevEx is pretty sweet - you don't see pipeline 1 or 4 in your
| build loop. Only the runtime of 3 would be comparable to the
| article - slightly faster than the article because no firecracker
| bringup overhead, no matter how small that is.
| drjasonharrison wrote:
| There are times when some corner of software development speaks
| a specialized language and this is an example.
|
| 1. Conveyor belt(?) of environments. Hydrate(?) the
| env(ironment). Mask(?) from prod(uction)
|
| 2. I think I got this. Typical "merge to main pipeline" with
| built product and test results as outputs.
|
| DevEx(?). And not sure why I wouldn't see pipeline #4 in my
| build loop because I can't deploy unless 2, 3 and 4 pass....
| Maybe you mean I don't wait to see it.
|
| Also not sure how it's faster because environments still need
| to be brought up. Unless you are trying to say that the
| environment is already running when the merge to master
| pipeline succeeds.
| forgotusername6 wrote:
| Used to do something similar with vsphere a while back. The
| servers took ages to get into the right state to test so much
| easier to just revert to snapshot to get a clean state.
| wyldfire wrote:
| Gee, why not just go straight to step 3 via fork/exec? Bound to
| shave off a few milliseconds beyond that 10x. And no firecracker
| required.
| melony wrote:
| If you a cloud host, you need a way to sandbox hostile code.
| Firecracker allows you to do that (it is a configuration of the
| traditional KVM virtualization system except lighter and
| faster, instead of booting a VPS which can take minutes, you
| can now spawn one in under a second).
| FooBarWidget wrote:
| Not just sandboxing, but just ensuring that each test runs in
| a clean environment, without interference from
| files/processes left behind by a previous or even concurrent
| test.
| wyldfire wrote:
| To clarify my post: I see the reason for Firecracker to exist
| in general, it's great. But does "e2e tests" include
| untrusted code? I think it really shouldn't.
|
| So why use firecracker here? Invoking your tests in a bare VM
| or container is great for making sure that you are
| controlling the environment and enumerating your system
| dependencies. But this post proposes discarding those things
| and instead using some saved state as the entry point into
| your Firecracker. So now you are booting from Your Image
| instead of a { Official Distro Image + Dependency Recipe }.
| It seems like a step backward.
| chrisseaton wrote:
| > But does "e2e tests" include untrusted code?
|
| What other possible way could CI work?
| melony wrote:
| The company that wrote the article is a e2e testing _cloud
| hosting_ company that runs your code in _their cloud_.
| colinchartier wrote:
| Author here!
|
| I think there's always been a push/pull of "fat base
| images" versus "install everything every time" - It's
| obviously subjective, but I think it's more important to
| run the tests on every commit than it is to start the
| environment from scratch.
|
| It's also not necessarily mutually exclusive, you could
| have a "staging branch" where you make something that looks
| a lot like production and then re-run end-to-end tests
| there, while running the per-branch tests with this method
| to avoid slowing down developers.
| legulere wrote:
| Because process isolation under unix is pretty lax. Processes
| have by default have all the rights of the user. And you might
| end up with a system different from the initial state
| ithkuil wrote:
| Firecracker is great and all, but the core idea here described
| works also with plain docker; i.e. there is nothing inherently
| firecracker specific to the basic technique
| colinchartier wrote:
| Author here!
|
| The three big differences are:
|
| 1. Docker doesn't deal with running processes (like postgres or
| redis), only the filesystem state
|
| 2. Docker doesn't have enough isolation, so you'd probably need
| to run it within qemu or firecracker for compliance in bigger
| teams
|
| 3. Docker-in-docker is still pretty painful, if you need to do
| anything nonstandard like change the size of /dev/shm, access
| /dev/kvm, or load kernel drivers, it'll take custom
| configuration.
| ignoramous wrote:
| Hi, offtopic but: is webapp.io a pivot from layerci, or just
| a rebranding?
|
| Interesting that you're folks now use firecracker. I assume
| it now fills in adequately for the previously homegrown tech
| at layerci [0]?
|
| [0] https://news.ycombinator.com/item?id=25979941
| colinchartier wrote:
| Just a rebranding! (The technology's gotten better as well,
| of course - we didn't used to use firecracker at all)
|
| https://webapp.io/blog/layerci-has-rebranded-to-webapp-io/
| throwaway894345 wrote:
| I'm confused. Why do you need to snapshot live processes? Are
| we concerned about startup time of Postgres or whatever?
| Also, why is isolation needed for e2e tests? Lastly, why is
| docker-in-docker a requirement, and how is that easier than
| qemu in qemu or qemu in docker or whatever?
| colinchartier wrote:
| > Why do you need to snapshot live processes?
|
| Often times there are long-living processes which rarely
| change but take a long time to warm up. The Bazel [1] agent
| for C++ projects, the buildkit [2] state for docker, or the
| running Postgres or Redis server for a cloud native app for
| example.
|
| It's why running "docker build" twice on your laptop is so
| fast, but running "docker build" in CI seems glacially
| slow.
|
| > why is docker-in-docker a requirement, and how is that
| easier than qemu in qemu or qemu in docker or whatever?
|
| The example given was running "docker-compose build", so
| you'd need either docker-in-firecracker (this post),
| docker-in-docker, or docker-in-qemu. You'd almost never run
| docker-compose build on bare metal in practice, because
| you'd immediately need to send the images you built
| somewhere else in order to use them.
|
| [1] https://bazel.build/ [2]
| https://docs.docker.com/develop/develop-
| images/build_enhance...
| cpuguy83 wrote:
| But that's state on disk, not process state. It should
| not affect startup time in buildkit.
|
| I'm not experienced enough with Bazel to comment on that.
| cpuguy83 wrote:
| Docker does handle snapshots of running processes. It's
| called checkpoint/restore, it utilizes the CRIU tooling to do
| this.
|
| In terms of doing this in a CI env like actions where you may
| have different types of machines serving you, it may be
| problematic as the machine specs need to pretty closely
| match.
| jitl wrote:
| Yeah, I don't like that the article itself treats building the
| DB seed data, etc, into the Firecracker VM image like this is
| impossible to do in Docker. The techniques are good things to
| do -- but it's very tenuous how the techniques are connected to
| Firecracker.
|
| I've do all of the above using multi-layered Docker files and a
| cron CI job to rebuild the base integration test image every 6
| hours. Sure if you need the isolation, Firecracker is the way
| to go. But if you invest primarily in container shenanigans to
| speed up CI with Docker, it's not too much extra work to wrap
| it in a Firecracker VM, plain QEMU, or whatever once you start
| wanting more isolation.
|
| Also, maybe I'm holding it wrong but Docker in Docker had not
| bitten us yet on our GitHub action runners.
| lgierth wrote:
| You don't need a management daemon running though, and get a
| complete virtualized kernel that can be customized if needed.
| bornfreddy wrote:
| Ok, so IIUC, the main difference with firecracker versus
| docker is that processes are better separated from each other
| ("micro VM" instead of namespaces) and that one can run a
| customized kernel. But for e2e tests I've written, neither of
| these advantages mattered.
|
| I do love the idea of taking a snapshot of a prebuilt
| database image and can see where this would really speed up
| the tests.
| tedunangst wrote:
| But why does it require firecracker and not qemu?
| colinchartier wrote:
| QEMU takes much longer to save/restore snapshots, and it's much
| harder to do via the API
| [deleted]
| n8ta wrote:
| Sounds like having an actual non-ephemeral computer with extra
| steps...
___________________________________________________________________
(page generated 2022-04-17 23:00 UTC)