[HN Gopher] Execute Docker Containers as QEMU MicroVMs
       ___________________________________________________________________
        
       Execute Docker Containers as QEMU MicroVMs
        
       Author : DarkPlayer
       Score  : 122 points
       Date   : 2021-06-16 16:05 UTC (6 hours ago)
        
 (HTM) web link (mergeboard.com)
 (TXT) w3m dump (mergeboard.com)
        
       | [deleted]
        
       | riobard wrote:
       | A few years ago I invested in a small startup called `hyper.sh`.
       | It open sourced a container runtime called `runV` which provided
       | exactly this: security of virtual machines plus convenience of
       | containers.
       | 
       | The project later merged with Intel Clear Container to become
       | what's now called Kata Containers (https://katacontainers.io/)
       | and is now widely used by several Internet giants like Alibaba
       | and Baidu.
       | 
       | The startup was acquired by Ant Finance a couple of years ago.
       | 
       | (I recorded a podcast with one of hyper.sh engineer if you can
       | listen to Mandarin https://pan.icu/25)
        
         | [deleted]
        
         | polskibus wrote:
         | How does it differ from Firecracker?
        
           | riobard wrote:
           | I'm not familiar with later development, but AFAIK
           | Firecracker came much later and now you can actually use
           | Firecracker as Kata Container's hypervisor in addition to
           | QEMU.
        
         | temp_praneshp wrote:
         | Probably off topic: Back in 2014-15 at my first job, when I was
         | working on openstack, they used to show up at the summits. They
         | were super smart and very generous with their time when I had
         | questions. I wondered sometime in 2020 what happened to them,
         | I'm happy they had a decent exit.
        
         | lifty wrote:
         | I worked with their tech, testing it, and I loved the product.
         | It was definitely ahead of its time. Similar in some ways to
         | what Fly is doing these days, without the edge.
        
         | cptnapalm wrote:
         | I was looking at Kata containers a few days ago. I'm pretty new
         | to trying to use VMs/containers for services; purely hobby
         | level. Couldn't figure out how to use them, but that's not
         | necessarily a knock on them as I also can't get OpenBSD
         | wireguard to work either.
        
       | forty wrote:
       | Isn't firecracker an AWS tech?
        
         | cpach wrote:
         | That's correct.
         | 
         | https://github.com/firecracker-microvm/firecracker
        
       | encryptluks2 wrote:
       | Why not run containers in VMs in containers in VMs? :)
       | 
       | Seriously, VMs are hardly as secure as many people want to
       | believe unless you're utilizing enclaves and even that has
       | vulnerabilities. I think a better approach is Seccomp and
       | whatever other filtering makes sense.
        
         | dboreham wrote:
         | Machine Turducken.
        
         | handrous wrote:
         | A while back I did some looking at FreeBSD jails to try to
         | figure out why they don't have more mindshare (especially when
         | paired with the nigh-superpower-granting ZFS).
         | 
         | I came away baffled that they weren't more widely-promoted,
         | compared with Docker and friends. After thinking about it for a
         | while, all I can figure is they're so straightforward to use
         | and well-documented that there's no room to make one's name, or
         | to make a buck, re-packaging them or wrapping them in complex
         | tools, so there's little money or glory (= personal marketing
         | via open-source project leadership/contributions) in promoting
         | them.
         | 
         | [EDIT] that is: what would be a blog post in LXC/Docker land...
         | doesn't exist, because it's covered perfectly well in the docs.
         | What would be a simple open-source tool... becomes a blog post,
         | because it's short, simple, and clear enough not to merit
         | special software, but just a quick guide to existing tools.
         | What would be a business, becomes a simple open-source tool
         | without enough of a difficulty/convenience "moat" to support a
         | business.
        
           | nicolaslem wrote:
           | TrueNAS exposed me to FreeBSD jails but what put me off is
           | that there does not seem to be an equivalent of "docker
           | build".
           | 
           | Jails seem to be treated like OpenVZ containers in the Linux
           | world: a lighter alternative to virtual machines, not a way
           | to build and distribute applications like Docker.
           | 
           | This is just my take after playing a few hours with jails, I
           | would happily be proven wrong.
        
           | tyingq wrote:
           | If technically best in the container space mattered, Illumos
           | would be everywhere...
        
             | tptacek wrote:
             | People say this a lot too, but Illumos also uses shared-
             | kernel isolation. Linux + gVisor is probably
             | (significantly) superior to it as far as security goes.
        
             | cestith wrote:
             | Or z/OS
        
           | tptacek wrote:
           | Jails are still shared-kernel isolation. Docker's reputation
           | is mired in its earlier implementations, when it wasn't
           | really even intended for multitenant isolation. Modern
           | Docker, running with unprivileged containers (which is the
           | norm), is substantially hardened. The real win over Docker is
           | losing the shared kernel, which is what lots of people are
           | doing, so the win to Jails is marginal.
        
           | boardwaalk wrote:
           | I suspect the answer includes it not being Linux, even with
           | the compatibility layer available.
        
             | handrous wrote:
             | I'm sure that's some of it, but the trend seems to be
             | moving away from leveraging OS-level tools _anyway_. As
             | long as your containers (or jails) and the single important
             | binary in each one start up OK and your network tuning on
             | the parent OS isn 't completely screwed up, the rest barely
             | matters anymore.
        
               | coder543 wrote:
               | It seems like you're missing a lot of things.
               | 
               | As a developer, how do I run FreeBSD Jails on my MacBook
               | during development? With Docker for Mac, it is trivial
               | for me to do everything on my Mac, and the fact that
               | there is a virtual machine is completely invisible to me.
               | Everything "Just Works". With FreeBSD Jails, I would have
               | to actually interact with a VM constantly, including the
               | pain of shipping files back and forth.
               | 
               | As a developer, are popular databases and applications
               | pre-packaged as FreeBSD Jails so that I can spin one up
               | on my laptop with a single command? Where is the Docker
               | Hub equivalent?
               | 
               | As a developer, how do I orchestrate a collection of
               | FreeBSD Jails for each project? With Docker, I define a
               | single `docker-compose.yml` file for each project. With a
               | single `docker-compose up`, the entire project is running
               | _including_ dependencies such as databases and other
               | related projects in a completely reproducible fashion.
               | This makes it trivial for coworkers to spin up a project
               | on their machine and immediately be productive without
               | spending an hour trying to get all the right versions of
               | everything installed and up and running.
               | 
               | As someone responsible for deploying an application to
               | production, what is the story around FreeBSD Jails for
               | deploying across a cluster? Is there a Kubernetes-
               | equivalent that can manage the allocation of resources,
               | blue-green deployments, and manage the lifecycle of my
               | FreeBSD Jails?
               | 
               | As someone responsible for deploying an application to
               | production, do any of the major clouds support FreeBSD
               | Jails? With Docker images, I can deploy those straight to
               | ECS Fargate, Google Cloud Run, and half a dozen other
               | services. Then I don't even have to think about my own
               | infrastructure unless I need some really specialized
               | hardware for a specific application.
               | 
               | > the rest barely matters anymore.
               | 
               |  _Everything else_ matters so much.
               | 
               | As to your earlier point about ZFS, most Linux distros
               | these days seem to trivially support ZFS. Even TrueNAS is
               | working on switching to Linux with their TrueNAS Scale
               | offering.
               | 
               | It's not that I'm opposed to FreeBSD... FreeBSD is just a
               | hard sell. It's hard to pin down exactly what you're
               | gaining by throwing out all the collective Linux
               | knowledge of an organization and switching to FreeBSD.
               | FreeBSD is an N-th tier platform for pretty much every
               | programming language except C, so good luck when you run
               | into random subtle problems. Also, good luck doing
               | hardware accelerated machine learning inference or
               | training on FreeBSD... it's _probably_ possible?
               | 
               | > the single important binary
               | 
               | This is also such a weird thing to throw out there. I
               | like a good Go program myself, but _most_ companies are
               | not only deploying single-binary statically linked
               | applications. Most companies are also deploying some kind
               | of Ruby, Python, or Java application... none of which are
               | likely to be a single file in practice. Most of them will
               | have a variety of shared libraries, and I don 't know if
               | I've ever seen a Ruby application shipped in a `FROM
               | scratch` container before. Technically possible, but
               | that's just not common reality as far as I've seen. It
               | sounds like you're proposing that everyone is already
               | running in `FROM scratch` containers, so a FreeBSD Jail
               | is just a drop-in replacement.
               | 
               | Linux containers are far from perfect, but as a
               | developer... I _have_ played with FreeBSD Jails before,
               | and come away frustrated by all the work you have to do
               | yourself.
        
               | handrous wrote:
               | > > the single important binary
               | 
               | > This is also such a weird thing to throw out there. I
               | like a good Go program myself, but most companies are not
               | only deploying single-binary statically linked
               | applications. Most companies are also deploying some kind
               | of Ruby, Python, or Java application... none of which are
               | likely to be a single file in practice.
               | 
               | Sure, but usual practice with containers is to put each
               | thing in its own, unless they are _very_ tightly coupled.
               | Web-app with a SQL database and a memory cache? Three
               | containers. You _can_ do otherwise, but that 's typical.
               | Usually each container ends up with one main, important
               | running process, and not much else.
               | 
               | [EDIT]
               | 
               | > As someone responsible for deploying an application to
               | production, what is the story around FreeBSD Jails for
               | deploying across a cluster? Is there a Kubernetes-
               | equivalent that can manage the allocation of resources,
               | blue-green deployments, and manage the lifecycle of my
               | FreeBSD Jails?
               | 
               | > As someone responsible for deploying an application to
               | production, do any of the major clouds support FreeBSD
               | Jails? With Docker images, I can deploy those straight to
               | ECS Fargate, Google Cloud Run, and half a dozen other
               | services. Then I don't even have to think about my own
               | infrastructure unless I need some really specialized
               | hardware for a specific application.
               | 
               | These are exactly the kinds of things I was thinking of
               | when I noted that the OS itself has been seriously
               | diminished in importance, for modern workflows. I agree
               | that most commercial or high-profile open-source "cloud"
               | tools and platforms are built around LXC/Docker.
        
               | coder543 wrote:
               | > Sure, but usual practice with containers is to put each
               | thing in its own, unless they are very tightly coupled.
               | Web-app with a SQL database and a memory cache? Three
               | containers. You can do otherwise, but that's typical.
               | Usually each container ends up with one main, important
               | running process, and not much else.
               | 
               | I agree, but... getting all the application dependencies
               | in there is more than just getting a single binary in
               | there. If it's just a single-binary Go program, then a
               | Jail works just fine, but it's not that simple for a Ruby
               | application. I'm definitely not talking about databases
               | running in the same container as the application. That's
               | where Kubernetes and docker-compose come in for multi-
               | container orchestration, which are things that FreeBSD
               | Jails don't have as far as I know.
               | 
               | > These are exactly the kinds of things I was thinking of
               | when I noted that the OS itself has been seriously
               | diminished in importance
               | 
               | Yes, but... these are all the things that FreeBSD doesn't
               | offer. These are the real reasons that people don't talk
               | about FreeBSD Jails in the same breath as Docker. The
               | Docker container itself (or the FreeBSD Jail) as a unit
               | of isolation is the least interesting part of the
               | ecosystem. All of the developer tools, orchestration
               | tools, and prebuilt images are what make the Docker
               | universe so interesting, and make FreeBSD Jails... less
               | interesting.
               | 
               | You said you were confused why Jails don't have more
               | mindshare. It has absolutely nothing to do with people
               | being able to invent useless tools and write blog posts
               | about them, and it has absolutely nothing to do with
               | FreeBSD Jails being _too well documented_. You kind of
               | implied those were the best explanations you could come
               | up with. Those are not the problems _at all_ , and it
               | seems disingenuous to me to say you think those are the
               | problems unless you _really_ didn 't know the things I
               | mentioned in my first reply.
        
               | oarsinsync wrote:
               | FreeBSD introduced Jails in 1999.
               | 
               | I used my first Jail in 2001.
               | 
               | Docker was started over a decade later in 2013.
               | 
               | It's reasonable to be confused why Jails lacks the
               | mindshare. "Because it lacks all these other over-the-top
               | features that we need" might be reasonable in response,
               | except that Docker didn't have any of these things on day
               | 0 either.
               | 
               | Jails had a 14 year head start, Docker reinvents the
               | wheel, and nor particularly well at first. Why did it
               | succeed more than Jails did? It wasn't because of the
               | piss-poor native Mac support.
        
               | tptacek wrote:
               | It seems pretty obvious that the big thing here is that
               | most people ship apps on Linux, not on FreeBSD.
        
               | handrous wrote:
               | My personal favorite thing about Docker, and the part I'd
               | most miss if I switched to Jails (which I'm fairly
               | confident could meet my needs with some fairly simple
               | scripts and aliases that wouldn't take me long to arrive
               | at, which is why I think there's so much less of an
               | "ecosystem" there, even a nascent and under-developed
               | one) is the way it forces projects to un-fuck their
               | configuration.
               | 
               | 500-line config, much of which few people ever care
               | about, with all kinds of ill-conceived nesting? Better
               | put the ~20 options that 99% of users ever touch in
               | environment variables, and document them. Weird state
               | garbage that's not captured in your config-on-disk?
               | Better figure it out and get it into env vars, and have
               | your startup script use those to transparently manage
               | whatever bad decisions you made re: state in the past.
               | Shit files all over the system? Better get that sorted
               | out so people can handle persistence with at the _very_
               | most three total mounts--and oh, gee, look, now your
               | simple example docker-compose also serves to document
               | where exactly you store files. And so on.
               | 
               | (my second-favorite thing is that it's a de-facto cross-
               | distro package manager with very up-to-date packages that
               | are trivial to completely and cleanly uninstall)
        
               | vermaden wrote:
               | > As a developer, are popular databases and applications
               | pre-packaged as FreeBSD Jails so that I can spin one up
               | on my laptop with a single command?
               | 
               | The closest you can get is BastilleBSD (framework for
               | FreeBSD Jails) and their templates - available here:
               | 
               | https://github.com/BastilleBSD/templates
               | https://bastillebsd.org/templates/
        
         | tptacek wrote:
         | I don't know what people generally believe.
         | 
         | But the attack surface of a Linux kernel is very large, is
         | pretty unpredictable, and can't be coherently masked out with
         | rules (my favorite example Jann Horn's VM reference count bug,
         | which was a simple concurrency flaw in the core virtual memory
         | system). By comparison, a Linux KVM hypervisor is not just a
         | subset of the kernel by definition, but also a much smaller
         | codebase, a tiny fraction of the whole kernel.
         | 
         | Replacing shared-kernel isolation like seccomp-filtered
         | containers with VMs is, architecturally, simply the replacement
         | of a large trusted computing base with a smaller one. If the
         | overhead is acceptable, it's hard to argue with from a security
         | perspective.
        
         | riobard wrote:
         | That's the approach taken by Google's gVisor (at the cost of
         | I/O and network performance).
        
           | fsociety wrote:
           | gVisor, for better or for worse, does a whole lot of other
           | things than just seccomp filtering, and it shows in
           | performance tests.
        
           | encryptluks2 wrote:
           | gVisor does more than filtering, they basically reimplemented
           | the syscalls in an application kernel. At least with seccomp
           | the performance overhead is minimal.
        
           | tptacek wrote:
           | No, that's really not at all what gVisor is. gVisor is best
           | thought of as user-mode Linux --- a complete reimplementation
           | of most of the OS kernel. It's not a system call filter; it's
           | something much closer to a VM than to seccomp.
           | 
           | gVisor is a very cool codebase. As an illustration of the
           | approach: it includes its own TCP/IP stack; we use it in our
           | command-line dev tool to allow people to SSH to their VMs
           | over WireGuard without having to install WireGuard or obtain
           | privileges to manage WireGuard.
        
         | gorkish wrote:
         | OK; https://github.com/harvester/harvester
         | 
         | Security and performance aren't the only driving forces; there
         | are a lot of technical and operational benefits to the
         | abstraction and standard interfaces that you get when running
         | stacks that might otherwise look like someone took an Xzibit
         | meme too far.
         | 
         | Also remember on a modern system, there are often at least 2
         | additional layers at work abstracting interfaces to the "bare
         | metal" OS already.
        
           | encryptluks2 wrote:
           | I'm not disagreeing that abstraction can be useful, but the
           | overhead of a VM is unnecessary if utilizing the full
           | potential of containers. Afterall, the Linux Kernel is acting
           | as the hypervisor already, so might as well trust it enough
           | to properly sandbox containers too and use the right
           | functionality to do so. I also think that running a
           | virtualization layer adds quite a bit of complexity, so while
           | it is cool that projects and companies have made it work and
           | integrated it with a container solution, eliminating the VM
           | layer altogether seems more ideal IMO.
        
       | ashishbijlani wrote:
       | > Can we somehow combine the advantages of the docker ecosystem
       | with VMs?
       | 
       | Shameless plug: this is exactly what our goal is with
       | https://kwarantine.xyz We are creating a new hypervisor (from
       | scratch) that can run strongly isolated Docker/LXC containers.
        
         | mikepurvis wrote:
         | Is this what gvisor is? https://github.com/google/gvisor
        
           | ashishbijlani wrote:
           | No, gVisor is from Google. They emulate system calls in user-
           | space and use VMs, which increases runtime performance
           | overhead. We use hardware virtualization to directly run
           | containers -- no I/O emulation, no expensive VM exits, scale
           | as needed. Initial comparison with FC/GVisor/Xen here:
           | https://github.com/ashishbijlani/kwarantine
        
             | monocasa wrote:
             | I'm not sure gvisor requires vm exits. Their first backend
             | used ptrace very similarly to how user mode Linux worked.
             | 
             | Minor quip though since ptrace might even be slower than vm
             | exits; your core point stands.
        
               | rkeene2 wrote:
               | User Mode Linux is still around and works well. I use it
               | when I need a "fakeroot" without any special privileges
               | on the host.
               | 
               | https://rkeene.org/viewer/tmp/fakeroot.sh.htm
        
             | tptacek wrote:
             | It sounds like you just said "yes, but what we're building
             | is faster". The userland Linux emulation is a security
             | benefit, not a liability.
        
         | amscanne wrote:
         | The "fork" sounds like you blue pill the OS for each container?
         | I'm assuming the concept is like Cappsule [1] or Bromium [2]?
         | 
         | [1] https://cappsule.github.io/ [2]
         | https://en.wikipedia.org/wiki/Bromium#/media/File:Bromium-en...
        
           | ashishbijlani wrote:
           | fork here is COW on the host kernel (i.e., copying EPT
           | entries). We will post detailed technical documentation soon.
        
       | eatonphil wrote:
       | There are a few existing projects out there like this (running
       | Docker images as virtual machines, specifically) if folks are
       | interested. Slim [0] is the one I can remember off the top of my
       | head. I think there are a couple more.
       | 
       | Still, neat to have the walkthrough here in this post.
       | 
       | https://github.com/ottomatica/slim
        
       | thekevjames wrote:
       | I had fun exploring Docker->VM conversion a while back [1],
       | though the larger goal in my case was to be able to make the
       | build path to custom GCP VM Images a bit simpler. Exciting to see
       | other cases where folks are finding this sort of flow useful!
       | 
       | 1: https://thekev.in/blog/2019-08-05-dockerfile-bootable-
       | vm/ind...
        
       | rwmj wrote:
       | https://katacontainers.io/ ?
        
         | bonzini wrote:
         | Yes, indeed. However it's nice to see directly the mechanisms
         | that let Kata do its magic.
        
       | gravypod wrote:
       | Something I'd be very interested in: building a PXE image from
       | something declarative like Dockerfiles.
        
         | justincormack wrote:
         | Try LinuxKit https://github.com/linuxkit/linuxkit
        
         | laurencerowe wrote:
         | Google Container Optimized OS is basically this I think. It's
         | what's used when you start a GCE instance with a docker image.
         | 
         | https://cloud.google.com/container-optimized-os/
        
       | OldGoodNewBad wrote:
       | I think a lot of folks are going out of their way to
       | misunderstand what happened. Yes there are other similar projects
       | and containers. No, none come from a long established _COMMUNITY
       | RUN PROJECT_. This is something akin to the difference between
       | VirtualBox and OpenBSD's vmd. Ones a product with a "free" tier,
       | the other is a community project.
        
       | tptacek wrote:
       | As I understand the landscape here, the big enabling win of
       | microvms is faster boot time; there's a cool qemu-lite slide deck
       | that goes into detail about how they cut down boot time:
       | 
       | https://www.linux-kvm.org/images/d/d2/03x05B-Chao_Peng-Light...
       | 
       | The big win was slashing away the BIOS stuff.
       | 
       | We use AWS's Firecracker to turn our customers Docker containers
       | into Firecracker microvms (Firecracker is Amazon's Rust VMM, the
       | engine for Fargate and Lambda). Anecdotally: in my dev
       | environment, the difference between Firecracker boot times and
       | native Docker container startup is imperceptible; the logging we
       | do swamps the VM boot stuff. It's _very_ fast.
        
       ___________________________________________________________________
       (page generated 2021-06-16 23:00 UTC)