[HN Gopher] Kata Containers: Virtual Machines that feel and perf...
       ___________________________________________________________________
        
       Kata Containers: Virtual Machines that feel and perform like
       containers
        
       Author : flanked-evergl
       Score  : 112 points
       Date   : 2023-07-17 12:55 UTC (10 hours ago)
        
 (HTM) web link (katacontainers.io)
 (TXT) w3m dump (katacontainers.io)
        
       | danjc wrote:
       | Serious question - at this point why not just use actual VM's?
        
         | rcoveson wrote:
         | That's what this is. "Containers" implemented as actual VMs.
        
         | spicyusername wrote:
         | The API for containers is widely used.
        
       | revskill wrote:
       | Should i run `docker run helloworld` or use katacontainer VM to
       | output Hello world ?
        
         | flanked-evergl wrote:
         | If you install kata with https://github.com/kata-
         | containers/kata-containers/blob/main...
         | 
         | Then you just use it with:
         | 
         | docker run --runtime io.containerd.kata.v2 --rm -it hello-world
        
       | symlinkk wrote:
       | So why would you use this over regular containers? Is this for
       | people who think containers are insecure?
        
       | jitl wrote:
       | Kata used "Linux kernel Direct Access filesystem (DAX)" to
       | directly share access of the host filesystem to the guest kernel.
       | I thought this was pretty interesting, but it sounds like a
       | possible spot to start a jailbreak. I'm guessing these kinds of
       | optimizations along with using super simple virtualized devices
       | is what gives Kata its almost-cgroups-like performance.
       | 
       | > Mapping files using DAX provides a number of benefits over more
       | traditional VM file and device mapping mechanisms:
       | 
       | > Mapping as a direct access device allows the guest to directly
       | access the host memory pages (such as via Execute In Place
       | (XIP)), bypassing the guest kernel's page cache. This zero copy
       | provides both time and space optimizations.
       | 
       | > Mapping as a direct access device inside the VM allows pages
       | from the host to be demand loaded using page faults, rather than
       | having to make requests via a virtualized device (causing
       | expensive VM exits/hypercalls), thus providing a speed
       | optimization.
       | 
       | > Utilizing mmap(2)'s MAP_SHARED shared memory option on the host
       | allows the host to efficiently share pages.
       | 
       | From https://github.com/kata-containers/kata-
       | containers/tree/main...
        
         | insanitybit wrote:
         | Yeah, it's worth understanding the attack surface of DAX - if
         | someone has information I'd be very interested. That said, you
         | could mitigate it in other ways depending on your use case.
         | 
         | Having gone through an evaluation of Firecracker's security my
         | main conclusion was that sandboxing the processes in the guest
         | is the highest 'bang for your buck' way to reduce escapes.
        
         | hinkley wrote:
         | Would it be simple enough to stick a union file system over the
         | top of the host file system?
        
           | jitl wrote:
           | Do you mean inside the guest? I'm sure they do stuff like
           | that; they operate the normal sort of cgroups + fs mappings
           | inside the guest VM to create inner containers. Eg all the
           | containers in a Kubernetes pod run inside the same VM, but
           | different containers within the VM.
        
           | rwmj wrote:
           | virtiofs supports DAX.
        
       | zackmorris wrote:
       | I wish that someone would put up a honeypot website running
       | various containers holding a root user, where the only goal is to
       | break out of the container.
       | 
       | I suspect that nearly all containerization software is insecure.
       | Especially with timing attacks like side-channel attacks (Row
       | Hammer, etc):
       | 
       | https://en.wikipedia.org/wiki/Timing_attack
       | 
       | https://en.wikipedia.org/wiki/Side-channel_attack
       | 
       | In the end, the only way to "prove" container security is to be
       | able to point to the fact that nobody has broken out of it yet.
       | It's ..remarkable that our entire cloud infrastructure runs on
       | containers that have never been audited by brute-force in this
       | manner.
        
         | wmf wrote:
         | That honeypot was called VPS hosting and over ten years it
         | didn't get hacked much if at all.
        
         | bozhark wrote:
         | Quantum hack?
         | 
         | Nah, it's 2023, brute it
        
         | zsims wrote:
         | You mean like http://www.lambdashell.com/ ?
        
         | Quarrel wrote:
         | > I suspect that nearly all containerization software is
         | insecure.
         | 
         | This seems, well, naive?
         | 
         | You think there aren't a lot of people that have tried to break
         | out of cloud containers?
         | 
         | Both EC2 and gcloud have had issues over the years with
         | container breakout and leaks.
         | 
         | Complex (ie literally layers of operating systems) software has
         | bugs. Yes. But so does the non-containerised base-case.
         | 
         | We have the best style of honeypots you could ever ask for
         | already running- payment infrastructure on the internet. Go get
         | 'em.
        
       | dsp_person wrote:
       | Vagrant also kind-of fits the description, but I've been looking
       | for an alternative due to it's libvirt support being iffy (the
       | plugin breaks often or has issues compared to the virtualbox
       | provider, also not many official image providers).
       | 
       | Here's one option I came across recently that's pretty nice and
       | simple, using virsh and cloud images, and making it easy to spin
       | up a VM and ssh into it. [1]
       | 
       | One use case I'm looking for is to provision system images
       | including ZFS pools/dataset, and not sharing the host kernel
       | module.
       | 
       | [1] https://earlruby.org/2023/02/quickly-create-guest-vms-
       | using-...
        
         | alexeldeib wrote:
         | I love the vagrant UX, I'd be thrilled to see a refresh for
         | 2023. Box management is odd. Weaveworks ignite/footloose are
         | also good but don't quite scratch my itch. I think vagrant UX
         | with ignite style kernel/rootfs as oci image is a nice flow.
        
       | solarkraft wrote:
       | I'm very interested in the concept, but found Kata to be hard to
       | get going with.
       | 
       | Last time I looked (a few months ago), the documentation was
       | pretty sparse or outdated. A lot of documentation I found stated
       | something like "we broke that in the new version, you can't
       | actually do that right now", like using it with Docker, which I
       | would much prefer over setting up Kubernetes.
       | 
       | I still think it's an absolutely great thing, but the on
       | onboarding could have been a lot less rocky.
       | 
       | FWIW, I eventually kind of DIYed it with QEMU MicroVM and
       | virtiofs - never did anything with it though.
        
         | flanked-evergl wrote:
         | > Last time I looked (a few months ago), the documentation was
         | pretty sparse or outdated.
         | 
         | It still is, though it works somewhat seamlessly when
         | installing with https://github.com/kata-containers/kata-
         | containers/blob/main...
         | 
         | Though only one of the hypervisors works well.
        
         | solarkraft wrote:
         | On the onboarding point: Almost all projects would do
         | themselves a big service by putting work into the onboarding
         | experience because it's a key factor in how many people will
         | use and help developing the project.
        
       | grahamgooch wrote:
       | From Baidu
       | https://katacontainers.io/collateral/ApplicationOfKataContai...
        
       | blastonico wrote:
       | [flagged]
        
       | bionsystem wrote:
       | I don't understand how can they be as fast as regular containers
       | if they run an entire kernel on top of an hypervisor ?
        
         | mugsie wrote:
         | there is pretty low overhead if you are opinionated - this is
         | very similar to firecracker (AWS) tooling, so cut down
         | hypervisor with ~ 0 devices, and a cut down guest OS means
         | pretty quick boot times
        
         | kobalsky wrote:
         | there's probably a big asterisk there. the correct term is
         | probably "fast enough"
         | 
         | virtualization adds very overhead, a Windows VM running with a
         | dedicated GPU can get 95% of the host's score on 3dmark.
         | 
         | the biggest issue on these cases is IO which can be handled in
         | a few ways.
        
         | rwmj wrote:
         | I don't have numbers either, but it's a combination of extreme
         | focus on the boot path and virtio drivers, and traditional
         | containers now being quite heavyweight to start (especially
         | when run via Kubernetes).
         | 
         | The big problem with Katacontainers is not whether or not they
         | are slightly faster or slower than containers, but the fixed
         | memory allocation which means you must first know and then
         | allocate the maximum amount of memory they might ever need up
         | front. This can practically limit the number of Katacontainers
         | you can run to something much smaller than is possible with
         | ordinary containers, since RAM is the constrained resource on
         | most servers.
         | 
         | Nevertheless, with confidential computing coming along, it's
         | likely that at some point in the future many containers will
         | really be VMs, since current CPUs implement confidential
         | computing on top of existing VM primitives (and that's
         | basically necessary due to the way the guest RAM is encrypted).
         | It's likely that any workload that touches PII, finance,
         | health, etc will be required to use confidential computing.
        
           | mochomocha wrote:
           | Slightly off topic, but regarding the larger memory footprint
           | of Kata containers, what is your opinion on KSM effectiveness
           | in general for VMs?
        
             | rwmj wrote:
             | We had a lot of reports of ksm/ksmtuned consuming a lot of
             | CPU and not making a lot of difference. I think it works
             | well for certain workloads, and can be quite pessimal for
             | others. There are also security concerns because you can
             | leak information about (eg) what glibc is being used by
             | another tenant using timing attacks. So you'd probably want
             | to turn it off if multiple tenants can be using a single
             | node.
        
           | nimish wrote:
           | Kata containers support memory ballooning like most modern
           | VMs: https://en.wikipedia.org/wiki/Memory_ballooning so a
           | fixed allocation isn't needed, reducing over provisioning
           | 
           | https://github.com/kata-containers/kata-
           | containers/blob/d50f... uses virtio-mem
        
             | rwmj wrote:
             | This isn't a substitute (nor is virtio-mem, the modern
             | equivalent). The problem is the application running in
             | userspace inside the guest cannot request more memory when,
             | for example, it does a mmap or sbrk.
        
               | [deleted]
        
               | nimish wrote:
               | Interesting, that's a very annoying constraint then
        
               | linlinjin wrote:
               | i couldn't understand the comment that "the application
               | running in user space cannot request more memory" - can
               | someone explain whats the point of memorybalooning
               | anywhere if an application cannot signal when the system
               | should actually provision physical memory from the
               | 'baloon'
        
               | spockz wrote:
               | Which application languages and frameworks support this
               | kind of dynamic memory allocation? For predictability in
               | performance and throughput reasons we benchmark our Java
               | applications on specific cpu and memory constraints and
               | specific heap and memory settings. How would an app in a
               | container suddenly give back ram? A garbage collected
               | application may be able to do that by collecting garbage.
               | Possibly. But others?
        
               | rwmj wrote:
               | No idea about Java, but any C program will request memory
               | using mmap, and may give it back using munmap. This
               | doesn't work when the program is running inside a VM, but
               | does work for containers (which are basically just
               | regular processes).
        
               | fragmede wrote:
               | With VM balloning, VMs are also able to claim and release
               | memory to the hypervisor/host OS.
        
           | cogman10 wrote:
           | > but the fixed memory allocation which means you must first
           | know and then allocate the maximum amount of memory they
           | might ever need up front.
           | 
           | Yup, that's always been the big reason to use containers for
           | me. Startup time and runtime performance are nice benefits,
           | but the memory usage is the giant win. Freeing memory in
           | response to the apps need and also not needing extra memory
           | for running the various OS parts and pieces.
           | 
           | The down side is, of course, security. But that was always
           | the case with containers.
        
             | tracker1 wrote:
             | I think in the longer run, WASM might displace a lot of
             | both in practical terms.
        
               | ehutch79 wrote:
               | Why would wasm replace containers? If you're going to run
               | a binary why not just compile it for the local system?
               | 
               | We've always had 'compile once, run anywhere' but there's
               | always been caveats and gotchas.
        
               | skybrian wrote:
               | Something still compiles a WASM binary for the local
               | system. Possibly, being able to optimize the WASM without
               | recompiling it from source might be a win? Not needing
               | separate binaries for ARM and x86 is nice, so it should
               | run on a Mac more easily. Also, it runs on an edge server
               | or in a browser, even on a phone, if you care about that.
               | 
               | I don't think it will replace Docker files since they let
               | you package up such a wide variety of existing server
               | software and WASM is more limited. But if your software
               | does compile to WASM then maybe you don't care about
               | that.
               | 
               | I think of WASM more like a plugin format, but I expect
               | there will be a lot of engineering effort put into
               | optimizing it, like happened with V8 for JavaScript. Not
               | all web standards win, but betting against one that's
               | well-established and has a lot of support seems like a
               | mistake.
        
               | bombela wrote:
               | wasm targets the wasm runtime Virtual Machine (ie: a
               | JavaScript VM). Offering fine grained isolation compared
               | to virtualizing the whole operating system.
               | 
               | edit: don't shoot the messenger. I was merely
               | highlighting the main difference between native and
               | webasm in the context of the discussion.
        
               | meepmorp wrote:
               | You could say the same thing for anything that targets
               | JVM or CLR, and they're far more mature than any JS
               | runtime.
        
               | outworlder wrote:
               | Or even Lua, which is trivial to sandbox.
        
               | wongarsu wrote:
               | For serverless (as in AWS-lambda-like) I agree. In that
               | usecase WASM provides a better security barrier than
               | containers, with faster cold-start time (which is really
               | important for the scaling promise of these services).
               | 
               | For the stuff people run on their Kubernetes clusters I
               | have more mixed expectations. Containers are more
               | universal, but I can totally see a microservice
               | architecture running as a lot of WASM runtimes with a
               | handful of containers.
        
           | hodgesrm wrote:
           | > The big problem with Katacontainers is not whether or not
           | they are slightly faster or slower than containers, but the
           | fixed memory allocation which means you must first know and
           | then allocate the maximum amount of memory they might ever
           | need up front.
           | 
           | Conversely the problem with containers is that memory
           | allocation _including the OS page cache_ is not guaranteed.
           | That 's bad for a lot of applications, especially databases.
           | It seems Docker has some support for shared page cache but
           | it's not in the Kubernetes pod spec as far as I can see. [0]
           | You would probably need some kind of annotations and a
           | specialized controller to make this work.
           | 
           | [0] https://github.com/kubernetes/kubernetes/issues/43916
        
         | insanitybit wrote:
         | The short version is that kernels support guest/host
         | relationships natively so that guests can pass operations
         | directly to the host without having to go through an additional
         | system call. Everywhere you do this is attack surface where an
         | attacker in the guest can communicate with privileged
         | facilities, so you want to minimize this where you can.
         | 
         | There's usually overhead in the places where the communication
         | requires an additional hop. If you want your host filesystem
         | isolated you're going to need a translation layer and it will
         | be slower. If you're willing to open up your host OS's
         | filesystem, you can basically get ~0 overhead.
        
         | Alifatisk wrote:
         | Yeah I'd like to see some numbers on that, like startup time.
        
           | theaiquestion wrote:
           | Depending on your use case there's potentially negligible
           | startup time. On the scale of single digit seconds to less
           | then half a second depending on how much work you put into
           | optimizing it. For some applications this will be too slow
           | (mainly the type where you boot a container per request,
           | although flyio seems to make it work), I think for a _lot_ of
           | applications this wouldn't be noticed.
           | 
           | Kata gives you a few different options for what/how you'd
           | like to boot including firecracker.
           | 
           | This isn't exclusive to firecracker but if you stay
           | lightweight you can have vm's booting under a half second if
           | you're using slim images.
           | 
           | https://jvns.ca/blog/2021/01/23/firecracker--start-a-vm-
           | in-l...
           | 
           | I honestly think for a lot of people, vm's with the
           | convenience/orchestration tools of containers make more sense
           | for a lot of general use cases simply because of the security
           | benefits. The convenience still needs some work though.
        
             | insanitybit wrote:
             | Unless you're dealing with a multi-tenant situation I'm not
             | super convinced that a VM is worth the effort. It's not the
             | perf, it's the need to make your kernel, root file system,
             | and other infra needed to make it all work.
             | 
             | Compare that to a docker container where there's basically
             | 0 additional work that has to be done to be up and running.
             | 
             | For most cases I'd be really tempted to work on hardening
             | the docker container than on setting up a VM. Things like
             | Apparmor and seccomp in particular would likely go a very
             | long way.
        
               | [deleted]
        
       | airocker wrote:
       | Is it possible to add kata containers in eks or gke?
        
         | otterley wrote:
         | Not in EKS. EC2 doesn't support nested virtualization today.
        
           | rwmj wrote:
           | Peer pods are meant to solve this (one day) by running the
           | "containers" (ie VMs) to the side as regular AWS instances
           | and peering the communications:
           | 
           | https://www.redhat.com/en/blog/red-hat-openshift-
           | sandboxed-c...
        
       | xyst wrote:
       | > Runs in a dedicated kernel, providing isolation of network, I/O
       | and memory and can utilize hardware-enforced isolation with
       | virtualization VT extensions.
       | 
       | So if it's a dedicated kernel, can this fool game anti cheat
       | systems into thinking it's not in a VM? Or still the same
       | problem?
        
         | jitl wrote:
         | If a game anti-cheat system can detect that a regular VM is a
         | VM, then it will also detect Kata VM is a VM. In both cases
         | you're running a game in a VM with a dedicated kernel.
         | 
         | Kata VMs are especially VM-y because they use a lot of VM-only
         | features that wouldn't work with real hardware to enhance
         | performance by sharing work between the guest and the host.
        
           | xyst wrote:
           | Well it's not that it "detects" it but most game AC (Valgrind
           | or BattleEye) install at the ring0 layer (kernel). So on VMs,
           | that doesn't exist (from what I understand). But with this it
           | should be enough to fool it into thinking it's installed in
           | the right space.
        
       ___________________________________________________________________
       (page generated 2023-07-17 23:01 UTC)