[HN Gopher] Kata Containers: Virtual Machines that feel and perf...
___________________________________________________________________
Kata Containers: Virtual Machines that feel and perform like
containers
Author : flanked-evergl
Score : 112 points
Date : 2023-07-17 12:55 UTC (10 hours ago)
(HTM) web link (katacontainers.io)
(TXT) w3m dump (katacontainers.io)
| danjc wrote:
| Serious question - at this point why not just use actual VM's?
| rcoveson wrote:
| That's what this is. "Containers" implemented as actual VMs.
| spicyusername wrote:
| The API for containers is widely used.
| revskill wrote:
| Should i run `docker run helloworld` or use katacontainer VM to
| output Hello world ?
| flanked-evergl wrote:
| If you install kata with https://github.com/kata-
| containers/kata-containers/blob/main...
|
| Then you just use it with:
|
| docker run --runtime io.containerd.kata.v2 --rm -it hello-world
| symlinkk wrote:
| So why would you use this over regular containers? Is this for
| people who think containers are insecure?
| jitl wrote:
| Kata used "Linux kernel Direct Access filesystem (DAX)" to
| directly share access of the host filesystem to the guest kernel.
| I thought this was pretty interesting, but it sounds like a
| possible spot to start a jailbreak. I'm guessing these kinds of
| optimizations along with using super simple virtualized devices
| is what gives Kata its almost-cgroups-like performance.
|
| > Mapping files using DAX provides a number of benefits over more
| traditional VM file and device mapping mechanisms:
|
| > Mapping as a direct access device allows the guest to directly
| access the host memory pages (such as via Execute In Place
| (XIP)), bypassing the guest kernel's page cache. This zero copy
| provides both time and space optimizations.
|
| > Mapping as a direct access device inside the VM allows pages
| from the host to be demand loaded using page faults, rather than
| having to make requests via a virtualized device (causing
| expensive VM exits/hypercalls), thus providing a speed
| optimization.
|
| > Utilizing mmap(2)'s MAP_SHARED shared memory option on the host
| allows the host to efficiently share pages.
|
| From https://github.com/kata-containers/kata-
| containers/tree/main...
| insanitybit wrote:
| Yeah, it's worth understanding the attack surface of DAX - if
| someone has information I'd be very interested. That said, you
| could mitigate it in other ways depending on your use case.
|
| Having gone through an evaluation of Firecracker's security my
| main conclusion was that sandboxing the processes in the guest
| is the highest 'bang for your buck' way to reduce escapes.
| hinkley wrote:
| Would it be simple enough to stick a union file system over the
| top of the host file system?
| jitl wrote:
| Do you mean inside the guest? I'm sure they do stuff like
| that; they operate the normal sort of cgroups + fs mappings
| inside the guest VM to create inner containers. Eg all the
| containers in a Kubernetes pod run inside the same VM, but
| different containers within the VM.
| rwmj wrote:
| virtiofs supports DAX.
| zackmorris wrote:
| I wish that someone would put up a honeypot website running
| various containers holding a root user, where the only goal is to
| break out of the container.
|
| I suspect that nearly all containerization software is insecure.
| Especially with timing attacks like side-channel attacks (Row
| Hammer, etc):
|
| https://en.wikipedia.org/wiki/Timing_attack
|
| https://en.wikipedia.org/wiki/Side-channel_attack
|
| In the end, the only way to "prove" container security is to be
| able to point to the fact that nobody has broken out of it yet.
| It's ..remarkable that our entire cloud infrastructure runs on
| containers that have never been audited by brute-force in this
| manner.
| wmf wrote:
| That honeypot was called VPS hosting and over ten years it
| didn't get hacked much if at all.
| bozhark wrote:
| Quantum hack?
|
| Nah, it's 2023, brute it
| zsims wrote:
| You mean like http://www.lambdashell.com/ ?
| Quarrel wrote:
| > I suspect that nearly all containerization software is
| insecure.
|
| This seems, well, naive?
|
| You think there aren't a lot of people that have tried to break
| out of cloud containers?
|
| Both EC2 and gcloud have had issues over the years with
| container breakout and leaks.
|
| Complex (ie literally layers of operating systems) software has
| bugs. Yes. But so does the non-containerised base-case.
|
| We have the best style of honeypots you could ever ask for
| already running- payment infrastructure on the internet. Go get
| 'em.
| dsp_person wrote:
| Vagrant also kind-of fits the description, but I've been looking
| for an alternative due to it's libvirt support being iffy (the
| plugin breaks often or has issues compared to the virtualbox
| provider, also not many official image providers).
|
| Here's one option I came across recently that's pretty nice and
| simple, using virsh and cloud images, and making it easy to spin
| up a VM and ssh into it. [1]
|
| One use case I'm looking for is to provision system images
| including ZFS pools/dataset, and not sharing the host kernel
| module.
|
| [1] https://earlruby.org/2023/02/quickly-create-guest-vms-
| using-...
| alexeldeib wrote:
| I love the vagrant UX, I'd be thrilled to see a refresh for
| 2023. Box management is odd. Weaveworks ignite/footloose are
| also good but don't quite scratch my itch. I think vagrant UX
| with ignite style kernel/rootfs as oci image is a nice flow.
| solarkraft wrote:
| I'm very interested in the concept, but found Kata to be hard to
| get going with.
|
| Last time I looked (a few months ago), the documentation was
| pretty sparse or outdated. A lot of documentation I found stated
| something like "we broke that in the new version, you can't
| actually do that right now", like using it with Docker, which I
| would much prefer over setting up Kubernetes.
|
| I still think it's an absolutely great thing, but the on
| onboarding could have been a lot less rocky.
|
| FWIW, I eventually kind of DIYed it with QEMU MicroVM and
| virtiofs - never did anything with it though.
| flanked-evergl wrote:
| > Last time I looked (a few months ago), the documentation was
| pretty sparse or outdated.
|
| It still is, though it works somewhat seamlessly when
| installing with https://github.com/kata-containers/kata-
| containers/blob/main...
|
| Though only one of the hypervisors works well.
| solarkraft wrote:
| On the onboarding point: Almost all projects would do
| themselves a big service by putting work into the onboarding
| experience because it's a key factor in how many people will
| use and help developing the project.
| grahamgooch wrote:
| From Baidu
| https://katacontainers.io/collateral/ApplicationOfKataContai...
| blastonico wrote:
| [flagged]
| bionsystem wrote:
| I don't understand how can they be as fast as regular containers
| if they run an entire kernel on top of an hypervisor ?
| mugsie wrote:
| there is pretty low overhead if you are opinionated - this is
| very similar to firecracker (AWS) tooling, so cut down
| hypervisor with ~ 0 devices, and a cut down guest OS means
| pretty quick boot times
| kobalsky wrote:
| there's probably a big asterisk there. the correct term is
| probably "fast enough"
|
| virtualization adds very overhead, a Windows VM running with a
| dedicated GPU can get 95% of the host's score on 3dmark.
|
| the biggest issue on these cases is IO which can be handled in
| a few ways.
| rwmj wrote:
| I don't have numbers either, but it's a combination of extreme
| focus on the boot path and virtio drivers, and traditional
| containers now being quite heavyweight to start (especially
| when run via Kubernetes).
|
| The big problem with Katacontainers is not whether or not they
| are slightly faster or slower than containers, but the fixed
| memory allocation which means you must first know and then
| allocate the maximum amount of memory they might ever need up
| front. This can practically limit the number of Katacontainers
| you can run to something much smaller than is possible with
| ordinary containers, since RAM is the constrained resource on
| most servers.
|
| Nevertheless, with confidential computing coming along, it's
| likely that at some point in the future many containers will
| really be VMs, since current CPUs implement confidential
| computing on top of existing VM primitives (and that's
| basically necessary due to the way the guest RAM is encrypted).
| It's likely that any workload that touches PII, finance,
| health, etc will be required to use confidential computing.
| mochomocha wrote:
| Slightly off topic, but regarding the larger memory footprint
| of Kata containers, what is your opinion on KSM effectiveness
| in general for VMs?
| rwmj wrote:
| We had a lot of reports of ksm/ksmtuned consuming a lot of
| CPU and not making a lot of difference. I think it works
| well for certain workloads, and can be quite pessimal for
| others. There are also security concerns because you can
| leak information about (eg) what glibc is being used by
| another tenant using timing attacks. So you'd probably want
| to turn it off if multiple tenants can be using a single
| node.
| nimish wrote:
| Kata containers support memory ballooning like most modern
| VMs: https://en.wikipedia.org/wiki/Memory_ballooning so a
| fixed allocation isn't needed, reducing over provisioning
|
| https://github.com/kata-containers/kata-
| containers/blob/d50f... uses virtio-mem
| rwmj wrote:
| This isn't a substitute (nor is virtio-mem, the modern
| equivalent). The problem is the application running in
| userspace inside the guest cannot request more memory when,
| for example, it does a mmap or sbrk.
| [deleted]
| nimish wrote:
| Interesting, that's a very annoying constraint then
| linlinjin wrote:
| i couldn't understand the comment that "the application
| running in user space cannot request more memory" - can
| someone explain whats the point of memorybalooning
| anywhere if an application cannot signal when the system
| should actually provision physical memory from the
| 'baloon'
| spockz wrote:
| Which application languages and frameworks support this
| kind of dynamic memory allocation? For predictability in
| performance and throughput reasons we benchmark our Java
| applications on specific cpu and memory constraints and
| specific heap and memory settings. How would an app in a
| container suddenly give back ram? A garbage collected
| application may be able to do that by collecting garbage.
| Possibly. But others?
| rwmj wrote:
| No idea about Java, but any C program will request memory
| using mmap, and may give it back using munmap. This
| doesn't work when the program is running inside a VM, but
| does work for containers (which are basically just
| regular processes).
| fragmede wrote:
| With VM balloning, VMs are also able to claim and release
| memory to the hypervisor/host OS.
| cogman10 wrote:
| > but the fixed memory allocation which means you must first
| know and then allocate the maximum amount of memory they
| might ever need up front.
|
| Yup, that's always been the big reason to use containers for
| me. Startup time and runtime performance are nice benefits,
| but the memory usage is the giant win. Freeing memory in
| response to the apps need and also not needing extra memory
| for running the various OS parts and pieces.
|
| The down side is, of course, security. But that was always
| the case with containers.
| tracker1 wrote:
| I think in the longer run, WASM might displace a lot of
| both in practical terms.
| ehutch79 wrote:
| Why would wasm replace containers? If you're going to run
| a binary why not just compile it for the local system?
|
| We've always had 'compile once, run anywhere' but there's
| always been caveats and gotchas.
| skybrian wrote:
| Something still compiles a WASM binary for the local
| system. Possibly, being able to optimize the WASM without
| recompiling it from source might be a win? Not needing
| separate binaries for ARM and x86 is nice, so it should
| run on a Mac more easily. Also, it runs on an edge server
| or in a browser, even on a phone, if you care about that.
|
| I don't think it will replace Docker files since they let
| you package up such a wide variety of existing server
| software and WASM is more limited. But if your software
| does compile to WASM then maybe you don't care about
| that.
|
| I think of WASM more like a plugin format, but I expect
| there will be a lot of engineering effort put into
| optimizing it, like happened with V8 for JavaScript. Not
| all web standards win, but betting against one that's
| well-established and has a lot of support seems like a
| mistake.
| bombela wrote:
| wasm targets the wasm runtime Virtual Machine (ie: a
| JavaScript VM). Offering fine grained isolation compared
| to virtualizing the whole operating system.
|
| edit: don't shoot the messenger. I was merely
| highlighting the main difference between native and
| webasm in the context of the discussion.
| meepmorp wrote:
| You could say the same thing for anything that targets
| JVM or CLR, and they're far more mature than any JS
| runtime.
| outworlder wrote:
| Or even Lua, which is trivial to sandbox.
| wongarsu wrote:
| For serverless (as in AWS-lambda-like) I agree. In that
| usecase WASM provides a better security barrier than
| containers, with faster cold-start time (which is really
| important for the scaling promise of these services).
|
| For the stuff people run on their Kubernetes clusters I
| have more mixed expectations. Containers are more
| universal, but I can totally see a microservice
| architecture running as a lot of WASM runtimes with a
| handful of containers.
| hodgesrm wrote:
| > The big problem with Katacontainers is not whether or not
| they are slightly faster or slower than containers, but the
| fixed memory allocation which means you must first know and
| then allocate the maximum amount of memory they might ever
| need up front.
|
| Conversely the problem with containers is that memory
| allocation _including the OS page cache_ is not guaranteed.
| That 's bad for a lot of applications, especially databases.
| It seems Docker has some support for shared page cache but
| it's not in the Kubernetes pod spec as far as I can see. [0]
| You would probably need some kind of annotations and a
| specialized controller to make this work.
|
| [0] https://github.com/kubernetes/kubernetes/issues/43916
| insanitybit wrote:
| The short version is that kernels support guest/host
| relationships natively so that guests can pass operations
| directly to the host without having to go through an additional
| system call. Everywhere you do this is attack surface where an
| attacker in the guest can communicate with privileged
| facilities, so you want to minimize this where you can.
|
| There's usually overhead in the places where the communication
| requires an additional hop. If you want your host filesystem
| isolated you're going to need a translation layer and it will
| be slower. If you're willing to open up your host OS's
| filesystem, you can basically get ~0 overhead.
| Alifatisk wrote:
| Yeah I'd like to see some numbers on that, like startup time.
| theaiquestion wrote:
| Depending on your use case there's potentially negligible
| startup time. On the scale of single digit seconds to less
| then half a second depending on how much work you put into
| optimizing it. For some applications this will be too slow
| (mainly the type where you boot a container per request,
| although flyio seems to make it work), I think for a _lot_ of
| applications this wouldn't be noticed.
|
| Kata gives you a few different options for what/how you'd
| like to boot including firecracker.
|
| This isn't exclusive to firecracker but if you stay
| lightweight you can have vm's booting under a half second if
| you're using slim images.
|
| https://jvns.ca/blog/2021/01/23/firecracker--start-a-vm-
| in-l...
|
| I honestly think for a lot of people, vm's with the
| convenience/orchestration tools of containers make more sense
| for a lot of general use cases simply because of the security
| benefits. The convenience still needs some work though.
| insanitybit wrote:
| Unless you're dealing with a multi-tenant situation I'm not
| super convinced that a VM is worth the effort. It's not the
| perf, it's the need to make your kernel, root file system,
| and other infra needed to make it all work.
|
| Compare that to a docker container where there's basically
| 0 additional work that has to be done to be up and running.
|
| For most cases I'd be really tempted to work on hardening
| the docker container than on setting up a VM. Things like
| Apparmor and seccomp in particular would likely go a very
| long way.
| [deleted]
| airocker wrote:
| Is it possible to add kata containers in eks or gke?
| otterley wrote:
| Not in EKS. EC2 doesn't support nested virtualization today.
| rwmj wrote:
| Peer pods are meant to solve this (one day) by running the
| "containers" (ie VMs) to the side as regular AWS instances
| and peering the communications:
|
| https://www.redhat.com/en/blog/red-hat-openshift-
| sandboxed-c...
| xyst wrote:
| > Runs in a dedicated kernel, providing isolation of network, I/O
| and memory and can utilize hardware-enforced isolation with
| virtualization VT extensions.
|
| So if it's a dedicated kernel, can this fool game anti cheat
| systems into thinking it's not in a VM? Or still the same
| problem?
| jitl wrote:
| If a game anti-cheat system can detect that a regular VM is a
| VM, then it will also detect Kata VM is a VM. In both cases
| you're running a game in a VM with a dedicated kernel.
|
| Kata VMs are especially VM-y because they use a lot of VM-only
| features that wouldn't work with real hardware to enhance
| performance by sharing work between the guest and the host.
| xyst wrote:
| Well it's not that it "detects" it but most game AC (Valgrind
| or BattleEye) install at the ring0 layer (kernel). So on VMs,
| that doesn't exist (from what I understand). But with this it
| should be enough to fool it into thinking it's installed in
| the right space.
___________________________________________________________________
(page generated 2023-07-17 23:01 UTC)