[HN Gopher] We replaced Firecracker with QEMU
___________________________________________________________________
We replaced Firecracker with QEMU
Author : hugodutka
Score : 308 points
Date : 2023-07-10 14:15 UTC (8 hours ago)
(HTM) web link (hocus.dev)
(TXT) w3m dump (hocus.dev)
| amarshall wrote:
| No mention of Cloud Hypervisor [1]...perhaps they don't know
| about it? It's based in part on Firecracker and supports free
| page reporting, virtio-blk-pci, PCI passthrough, and (I believe)
| discard in virtio-blk.
|
| [1]: https://www.cloudhypervisor.org/
| hugodutka wrote:
| We do, and we'd love to use it in the future. We've found that
| it's not ready for prime time yet and it's missing some
| features. The biggest problem was that it does not support
| discard operations yet. Here's a short writeup we did about
| VMMs that we considered: https://github.com/hocus-
| dev/hocus/blob/main/rfd/0002-worksp...
| amarshall wrote:
| Thanks for the link to the elaboration! FYI footnotes 3 and 4
| seem to be swapped.
| CompuIves wrote:
| At CodeSandbox we use Firecracker for hosting development
| environments, and I agree with the points. Though I don't think
| that means you should not use Firecracker for running long-lived
| workloads.
|
| We reclaim memory with a memory balloon device, for the disk
| trimming we discard (& compress) the disk, and for i/o speed we
| use io_uring (which we only use for scratch disks, the project
| disks are network disks).
|
| It's a tradeoff. It's more work and does require custom
| implementations. For us that made sense, because in return we get
| a lightweight VMM that we can more easily extend with
| functionality like memory snapshotting and live VM cloning
| [1][2].
|
| [1]: https://codesandbox.io/blog/how-we-clone-a-running-vm-
| in-2-s...
|
| [2]: https://codesandbox.io/blog/cloning-microvms-using-
| userfault...
| wiradikusuma wrote:
| Fly uses Firecracker, and they host long-running processes. I
| wonder what's their opinion about it.
| tptacek wrote:
| We like Firecracker. People should use whatever makes sense for
| them.
| gorbak25 wrote:
| I think their usecase makes a lot of sense as their workloads
| consume a predefined amount of ram. As a customer you rent a VM
| with a specified amount of memory so fly.io does not care about
| reclaiming it from a running VM.
| no_wizard wrote:
| Depends on if they're using smart memory allocation to keep
| costs lower, IE, if they can pattern that certain workloads
| only need N amount of memory at Y time, they can effectively
| borrow memory from one VM for usage in another that has an
| opposite statistical likelihood of needing that memory.
|
| This is why paying for dedicated memory is often more
| expensive than its counter part, because that dedicated
| memory is not considered as part of pooling.
| adql wrote:
| I think it would be easier to just enable KSM (KVM can use
| it to deduplicate memory and save some RAM on duplicate
| blocks like people loading same libraries into memory)
| nerpderp82 wrote:
| Shit like this is how you burn customers and then they find
| out you aren't changing the sheets, they change hotels.
|
| Just because you can doesn't mean you should.
| klabb3 wrote:
| Almost everything is oversubscribed in our world.
|
| Malloc will happily "return" the 15 TiB you asked for.
|
| If 10000 people called 911 at the same time, only a tiny
| fraction would get through (and even fewer would get
| help).
|
| Evacuating a large city by road would result in giant
| traffic jams.
|
| There are 5-8x as many parking spots as there are cars
| (and we still can't find a goddamn spot).
|
| And of course... the great toilet paper shortage of 2020.
| no_wizard wrote:
| Super common though. Everyone from Hetzner to AWS does
| this, unless you specifically pay for full allocation,
| which is typically ~2x the price.
|
| Chances are you are using systems that do this and you
| haven't even noticed.
| yokaze wrote:
| Werner Vogels seems to disagree:
| https://twitter.com/Werner/status/25137574680
| rob-olmos wrote:
| Tweeted Sep 21, 2010, might still be true for EC2 VMs
| though.
| mulmen wrote:
| I don't see the problem. Is this not the value
| proposition of the cloud? At scale things like block
| level de-duplication and over-provisioning of compute are
| more efficient to say nothing of power and cooling. This
| efficiency drives down cost. As long as your data isn't
| leaking and you get the hardware you pay for when you
| need it what's the problem?
| geodel wrote:
| >.. they change hotels
|
| And then they learn all hotels are doing exactly same
| thing. One hotel doing is risk, all hotels doing is
| industry standard.
|
| Airlines, hotels, restaurants, doctors and so on
| oversubscribe all the time. Whoever complains are free to
| move and add to their further disappointments.
| imtringued wrote:
| Memory (DDR4) is like 1.50EUR per GB nowadays. There is
| no need to skimp on it. The most expensive component is
| still the CPU if you actually want to do work and not
| just idle your webserver 99% of the time.
| no_wizard wrote:
| this ignores just how competitive these services have
| become and how much competition there is in this
| marketplace. Anything that allows a company to balance
| user experience with cost of hardware will be evaluated.
| The fact is this is super common because even though RAM
| is super cheap, at scale, these things add up. If you as
| a provider don't have to spend 120K on RAM and can use
| that on more CPUs or something else, it can mean having a
| competitive edge over the provider that _does_ have to
| spend that 120K.
| ushakov wrote:
| It's a common technique though. I believe it's called
| oversubscription, where you rent the same hardware to
| more tenants hoping they won't use it all at once.
|
| Fly.io themselves admitted they're oversubscribed and AWS
| is doing the same for _years_ now
|
| Source: https://fly.io/blog/the-serverless-server/
| heavyset_go wrote:
| Someone posted this and then immediately deleted their comment:
| https://qemu.readthedocs.io/en/latest/system/i386/microvm.ht...
|
| I didn't know it existed until they posted, but QEMU has a
| Firecracker-inspired target:
|
| > _microvm is a machine type inspired by Firecracker and
| constructed after its machine model._
|
| > _It's a minimalist machine type without PCI nor ACPI support,
| designed for short-lived guests. microvm also establishes a
| baseline for benchmarking and optimizing both QEMU and guest
| operating systems, since it is optimized for both boot time and
| footprint._
| lessaligned wrote:
| "the fork was very very bad for eating soup - this is a story
| about how we migrated to a spoon"
|
| ...firecracker does fine what it was designed to - short running
| fast start workloads.
|
| (oh, and the article starts by slightly misusing a bunch of
| technical terms, firecracker's not technically a hypervisor per
| se)
| sheepscreek wrote:
| ..so is it more to support directly deploying functions to the
| cloud? Like, what AWS Lambda and CloudFront Functions might be
| built on?
| yjftsjthsd-h wrote:
| I'm pretty sure firecracker was literally created to underlie
| AWS Lambda.
|
| EDIT: Okay, https://www.geekwire.com/2018/firecracker-amazon-
| web-service... says my "pretty sure" memory is in fact
| correct.
| bkettle wrote:
| As does the paper [1] with details in section 4.1.
|
| [1]: https://www.usenix.org/system/files/nsdi20-paper-
| agache.pdf
| zokier wrote:
| That being said, firecracker also runs long-running tasks
| on AWS in the form of Fargate
| dathinab wrote:
| yes, it was created originally for AWS Lambda
|
| mainly it's optimized to run code only shortly (init time max
| 10s, max usage is 15min, and default max request time 130s
| AFIK)
|
| also it's focused on thin server less functions, like e.g.
| deserialize some request, run some thin simple business logic
| and then delegate to other lambdas based on it. This kind of
| functions often have similar memory usage per-call and if a
| call is an outlier it can just discard the VM instance soon
| after (i.e. at most after starting up a new instance, i.e. at
| most 10s later)
| dathinab wrote:
| it's not that simple many other companies running longer
| running jobs, including their competition, use Firecracker
|
| so while Firecracker was designed for thing running just a few
| seconds there are many places running it with jobs running way
| longer then that
|
| the problem is if you want to make it work with long running
| general purpose images you don't control you have to put a ton
| of work into making it work nicely on all levels of you
| infrastructure and code ... which is costly ... which a startup
| competing on a online dev environment compared to e.g. a vm
| hosting service probably shouldn't wast time on
|
| So AFIK the decision in the article make sense the reasons but
| listed for the decision are oversimplified to a point you could
| say they aren't quite right. Idk. why, could be anything from
| the engineer believing that to them avoiding issues with some
| shareholder/project lead which is obsessed with "we need to do
| Firecracker because competition does so too".
| [deleted]
| mgaunard wrote:
| I know a good way to make a process make the most of the hardware
| and play cooperatively with other processes: don't use
| virtualization.
|
| I will never understand the whole virtual machine and cloud
| craze. Your operating system is better than any hypervisor at
| sharing resources efficiently.
| kdmytro wrote:
| I want to segregate, not to share.
| akira2501 wrote:
| > cloud craze.
|
| Automatic scaling is great. Cloud parallelization (a.k.a fork)
| is absolutely wild once you get it rolling. Code deployments
| are incredibly simple. Never having to worry about physical
| machines or variable traffic loads is worth the small overhead
| they charge me for the wrapper. The generic system wide
| permissions model is an absolute joy once you get over the
| learning curve.
| gwd wrote:
| Listen people, Firecracker is NOT A HYPERVISOR. A hypervisor runs
| right on the hardware. KVM is a hypervisor. Firecracker is a
| process that controls KVM. If you want to call firecracker (and
| QEMU, when used in conjunction with KVM) a VMM ("virtual machine
| monitor") I won't complain. But please please please, we need a
| word for what KVM and Xen are, and "hypervisor" is the best fit.
| Stop using that word for a user-level process like Firecracker.
| jstarks wrote:
| Keep fighting the good fight, friend.
|
| Although I'll note that the line between a VMM and hypervisor
| are not always clear. E.g., KVM includes some things that other
| hypervisors delegate to the VMM (such as instruction
| completion). And macOS's hypervisor.framework is almost a pass
| through to the CPU's raw capabilities.
| foundry27 wrote:
| Nitpick: it's not accurate to say that a hypervisor, by
| definition, runs right on the hardware. Xen (as a type-1
| hypervisor) has this property; KVM (as a type-2 hypervisor)
| does not. It's important to remember that the single core
| responsibility of a hypervisor is to divide hardware resources
| and time between VMs, and this decision-making doesn't require
| bare-metal.
|
| For those unfamiliar, the informal distinction between type-1
| and type-2 is that type-1 hypervisors are in direct control of
| the allocation of all resources of the physical computer, while
| type-2 hypervisors operate as some combination of being "part
| of" / "running on" a host operating system, which owns and
| allocates the resources. KVM (for example) gives privileged
| directions to the Linux kernel and its virtualization kernel
| module for how to manage VMs, and the kernel then schedules and
| allocates the appropriate system resources. Yes, the type-2
| hypervisor needs kernel-mode primitives for managing VMs, and
| the kernel runs right on the hardware, but those primitives
| aren't making management decisions for the division of hardware
| resources and time between VMs. The type-2 hypervisor is making
| those decisions, and the hypervisor is scheduled by the OS like
| any other user-mode process.
| veber-alex wrote:
| KVM is a type-1 hypervisor [1]
|
| [1]: https://www.redhat.com/en/topics/virtualization/what-is-
| KVM
| compscigirl wrote:
| Not really, calling KVM a type-1 is a misunderstanding of
| what the "bare-metal" distinction is referring to. The real
| difference between the two types is whether the hypervisor
| owns the hardware or not. In the case of a type-1, the
| hypervisor runs below the kernel and controls access to the
| hardware, even for the kernel. In type-2, the hypervisor
| runs on the kernel, which owns the hardware, and must go
| through the kernel to use hardware resources.
| cthalupa wrote:
| But that's not how that works. KVM is as "bare-metal" in
| access to the system as ESXi is. The hypervisor code runs
| in ring 0 in both cases.
| monocasa wrote:
| There's arguments in both directions for something like
| kvm. Wiki states it pretty well:
|
| > The distinction between these two types is not always
| clear. For instance, KVM and bhyve are kernel modules[6]
| that effectively convert the host operating system to a
| type-1 hypervisor.[7] At the same time, since Linux
| distributions and FreeBSD are still general-purpose
| operating systems, with applications competing with each
| other for VM resources, KVM and bhyve can also be
| categorized as type-2 hypervisors.[8]
|
| https://en.wikipedia.org/wiki/Hypervisor#Classification
| cthalupa wrote:
| Type-1 and type-2 hypervisor is terminology that should at
| this point be relegated to the past.
|
| It was never popularly used in a way accurate to the origin
| of the classification - in the original paper by Popek and
| Goldberg talked about formal proofs for the two types and
| they really have very little to do with how the terms began
| being used in the 90s and 00s. Things have changed a lot with
| computers since the 70s when the paper was written and the
| terminology was coined.
|
| So, language evolves, and Type-1 and Type-2 came to mean
| something else in common usage. And this might have made
| sense to differentiate something like esx from vmware
| workstation in their capabilities, but it's lost that utility
| in trying to differentiate Xen from KVM for the overwhelming
| majority of use cases.
|
| Why would I say it's useless in trying to differentiate, say,
| Xen and KVM? Couple of reasons:
|
| 1) There's no performance benefit to type-1 - a lot of
| performance sits on the device emulation side, and both are
| going to default to qemu there. Other parts are based heavily
| on CPU extensions, and Xen and KVM have equal access there.
| Both can pass through hardware, support sr-iov, etc., as
| well.
|
| 2) There's no overhead benefit in Xen - you still need a dom0
| VM, which is going to arguably be even more overhead than a
| stripped down KVM setup. There's been work on dom0less Xen,
| but it's frankly in a rough state and the related drawbacks
| make it challenging to use in a production environment.
|
| Neither term provides any real advantage or benefit in
| reasoning between modern hypervisors.
| gwd wrote:
| > Type-1 and type-2 hypervisor is terminology that should
| at this point be relegated to the past.
|
| Maybe it's because of the time I grew up in, but in my mind
| the prototypical Type-I hypervisor is VMWare ESX Server;
| and the prototypical Type-II hypervisor is VMWare
| Workstation.
|
| It should be noted that VMWare Workstation always required
| a kernel module (either on Windows or Linux) to run; so the
| core "hypervisor-y" bit runs in kernel mode either way. So
| what's the difference?
|
| The key difference between those two, to me is: Is the
| thing at the bottom designed exclusively to run VMs, such
| that every other factor gives way? Or does the thing at the
| bottom have to "play nice" with random other processes?
|
| The scheduler for ESX Server is written explicitly to
| schedule VMs. The scheduler for Workstation is the Windows
| scheduler. Under ESX, your VMs are the star of the show;
| under Workstation, your VMs are competing with the random
| updater from the printer driver.
|
| Xen is like ESX Sever: VMs are the star of the show. KVM is
| like Workstation: VMs are "just" processes, and are
| competing with whatever random bash script was created at
| startup.
|
| KVM gets loads of benefits from being in Linux; like, it
| had hypervisor swap from day one, and as soon as anyone
| implements something new (like say, NUMA balancing) for
| Linux, KVM gets it "for free". But it's not _really_ for
| free, because the cost is that KVM has to make
| accommodations to all the other use cases out there.
|
| > There's no performance benefit to type-1 - a lot of
| performance sits on the device emulation side, and both are
| going to default to qemu there.
|
| Er, both KVM and Xen try to switch to paravirtualized
| interfaces as fast as possible, to minimize the emulation
| that QEMU has to do.
| cthalupa wrote:
| >Maybe it's because of the time I grew up in, but in my
| mind the prototypical Type-I hypervisor is VMWare ESX
| Server; and the prototypical Type-II hypervisor is VMWare
| Workstation.
|
| My point is that these are largely appropriated terms -
| neither would fit the definitions of type 1 or type 2
| from the early days when Popek and Goldberg were writing
| about them.
|
| > Or does the thing at the bottom have to "play nice"
| with random other processes?
|
| From this perspective, Xen doesn't count. You can have
| all sorts of issues from the dom0 side and competing with
| resources - you mention PV drivers later, and you can
| 100% run into issues with VMs because of how dom0
| schedules blkback and netback when competing with other
| processes.
|
| ESXi can also run plenty of unmodified linux binaries -
| go back in time 15 years and it's basically a fully
| featured OS. There's a lot running on it, too. Meanwhile,
| you can build a linux kernel with plenty of things
| switched off and a root filesystem with just the bare
| essentials for managing kvm and qemu that is even less
| useful for general purpose computing than esxi.
|
| >Er, both KVM and Xen try to switch to paravirtualized
| interfaces as fast as possible, to minimize the emulation
| that QEMU has to do.
|
| There are more things being emulated than there are PV
| drivers for, but this is a bit outside of my point.
|
| For KVM, the vast majority of implementations are using
| qemu for managing their VirtIO devices as well -
| https://developer.ibm.com/articles/l-virtio/ - you'll
| notice that IBM even discusses these paravirtual drivers
| directly in context of "emulating" the device. Perhaps a
| better way to get the intent across here would be saying
| qemu handles the device model.
|
| From a performance perspective, ideally you'd want to
| avoid PV here too and go with sr-iov devices or
| passthrough.
| bonzini wrote:
| According to the actual paper that introduced the
| distinction, and adjusting for change of terminology in the
| last 50 years, a type-1 hypervisor runs in kernel space and a
| type-2 hypervisor runs in user space. x86 is not
| virtualizable by a type-2 hypervisor, except by software
| emulation of the processor.
|
| What actually can change is the amount of work that the
| kernel-mode hypervisor leaves to a less privileged (user
| space) component.
|
| For more detail see
| https://www.spinics.net/lists/kvm/msg150882.html
| rdtsc wrote:
| I think people just pick the coolest sounding term. Imagine
| someone is sharing what they are working on, what's cooler
| sounding "I am working on a virtual machine monitor" or "I am
| working on a hypervisor". Hypervisor just sounds futuristic and
| awesome.
|
| It's like with "isomorphic" code. That just sounds much cooler
| than "js that runs on the client and the server".
| arun-mani-j wrote:
| I think you could help me answer the question that has been in
| my mind for a month :)
|
| Is there any article that tells the difference and relationship
| between KVM, QEMU, libvirt, virt-manager, Xen, Proxmox etc.
| with their typical use cases?
| jacobr1 wrote:
| [flagged]
| doliveira wrote:
| Honestly, when we're discussing with humans we want the
| humans' response. Anyone of us can ask ChatGPT, I don't
| know why people keep pasting its answers
| solarpunk wrote:
| people keep doing it because it lets them farm upvotes.
| theossuary wrote:
| Damn, outdone by ChatGPT
| bonzini wrote:
| Pretty good I must say. AWS however is using KVM only these
| days!
| orangepurple wrote:
| What is AWS Nitro then?
| bonzini wrote:
| Nitro is the custom hardware that runs in AWS data
| centers (or in outposts). You can use it bare metal, or
| if you use VM instances it will run Linux (plus KVM) and
| a custom user space component that replaces QEMU.
| theossuary wrote:
| KVM is a Linux kernel implementation of the cpu extensions to
| accelerate vms to near bare metal speeds.
|
| Qemu is a user space system emulator. It can emulate in
| software different architectures like ARM, x86, etc. It can
| also emulate drivers, networking, disks, etc. Is called via
| the command line.
|
| The reason you'll see Qemu/KVM a lot is because Qemu is the
| emulator, the things actually running the VM. And it utilizes
| KVM (on linux, OSX has HVF, for example) to accelerate the VM
| when the host architecture matches the VM's.
|
| Libvirt is an XML based API on top of Qemu (and others). It
| allows you to define networks, VMs (it calls them domains),
| and much more with a unified XML schema through libvirtd.
|
| Virsh is a CLI tool to manage libvirtd. Virt-manager is a GUI
| to do the same.
|
| Proxmox is Debian under the hood with Qemu/KVM running VMs.
| It provides a robust web UI and easy clustering capabilities.
| Along with nice to haves like easy management of disks, ceph,
| etc. You can also manage Ceph through an API with Terraform.
|
| Xen is an alternative hypervisor (like esxi). Instead of
| running on top of Linux, Xen has it's own microkernel. This
| means less flexibility (there's no Linux body running
| things), but also simpler to manage and less attack surface.
| I haven't played much with xen though, KVM is kind of the
| defacto, but IIRC AWS used to use a modified Xen before KVM
| came along and ate Xen's lunch.
| cthalupa wrote:
| > Xen is an alternative hypervisor (like esxi). Instead of
| running on top of Linux, Xen has it's own microkernel. This
| means less flexibility (there's no Linux body running
| things), but also simpler to manage and less attack
| surface.
|
| You're going to need dom0 (a "control domain") on any Xen
| host. Gotta have something running xl and the rest of the
| toolstack for managing it. dom0less technically exists but
| the drawbacks mean it's not really usable by most people in
| a production situation.
| adql wrote:
| >Xen is an alternative hypervisor (like esxi). Instead of
| running on top of Linux, Xen has it's own microkernel. This
| means less flexibility (there's no Linux body running
| things), but also simpler to manage and less attack
| surface. I haven't played much with xen though, KVM is kind
| of the defacto, but iirc AWS uses a modified Xen.
|
| If you actually played with Xen you'd know it's not
| actually easier to manage. And increased security claims
| are dubious at best, as same thing that would be attacked
| (dom0 managing the whole thing and running linux) have
| direct unfettered access to xen microkernel. There is
| reason many sites migrated away from Xen to KVM. Also many
| Xen drivers de facto run part Linux dom0 instance so you
| don't even get that isolation.
|
| We ran Xen for few years, as KVM at first was still not as
| refined and Xen was first to mature market, and it was just
| million little annoying things.
|
| KVM offers far simple and straightforward management. A VM
| is just a process. You can look at its CPU usage via normal
| tools. No magic. No driver problems.
| monocasa wrote:
| > I haven't played much with xen though, KVM is kind of the
| defacto, but iirc AWS uses a modified Xen.
|
| AWS switched to KVM, and even a lot of AWS systems that
| report themselves as Xen are running as KVM with a compat
| shim to say it's Xen.
| gabereiser wrote:
| KVM is kernel-based virtual machine, with libvirt being its
| API abstraction over all of it. QEMU is a virtual machine
| host that leverages kvm or software virtualization to spin up
| machines on the host. virt-manager does the same. Xen is
| another virtual machine host, like KVM. Proxmox is a virtual
| machine manager (like QEMU, virt-manager) but is web based.
| Libvirt will provide abstraction for kvm,qemu,xen
|
| Use cases: proxmox web interface exposed on your local
| network on a KVM Linux box that uses QEMU to manage VM's.
| Proxmox will allow you to do that from the web. QEMU is great
| for single or small fleet of machines but should be automated
| for any heavy lifting. Proxmox will do that.
| bonzini wrote:
| This is almost entirely wrong especially as far as QEMU,
| Libvirt and virt-manager are concerned.
|
| QEMU is a low level process that represents the virtual
| machine. It has no equivalent in Xen. Using QEMU directly
| is not a good idea unless your needs for VM configurations
| change all the time and you hardly reuse VMs.
|
| Libvirt is at a higher level than QEMU. It manages the QEMU
| processes and gives them access to system resources (image
| files, network interfaces, pass-through PCI devices). It
| also makes it easy to manage the configuration of your
| virtual machines and the resources they use.
|
| Higher still is virt-manager, which is a GUI interface for
| libvirt. Proxmox sits at roughly the same level as virt-
| manager.
| gabereiser wrote:
| How? KVM and Xen are kernel level. QEMU uses KVM but also
| has a software virtualization capability. Libvirt is an
| API abstraction over it all. virt-manager is a gui app to
| manage libvirt machines. Proxmox as well. Proxmox VE
| talks to VMHost via libvirt.
| bonzini wrote:
| Libvirt does not use KVM. Libvirt uses either QEMU (which
| in turn might or might not use KVM) or Xen or other
| hypervisors. So it's incorrect to say that Libvirt
| abstracts over KVM.
|
| And virt-manager indeed manages Libvirt machines so it's
| not at the level of QEMU as you wrote in the parent
| comment:
|
| > Proxmox is a virtual machine manager (like QEMU, virt-
| manager)
| gabereiser wrote:
| Semantics, libvirt abstracts over KVM via QEMU because
| QEMU/KVM/HVT is all one driver.
| bonzini wrote:
| KVM is not enough to create a virtual machine. KVM only
| virtualizes the processor, not a whole machine.
| gabereiser wrote:
| >Using KVM, one can run multiple virtual machines running
| unmodified Linux or Windows images. Each virtual machine
| has private virtualized hardware: a network card, disk,
| graphics adapter, etc.
|
| Straight from their site. QEMU is the user space
| interface, KVM the kernel space driver. It's enough to
| run whatever OS. That's the point.
|
| For libvirt: https://libvirt.org/drivers.html
|
| They support a bunch as well.
| cthalupa wrote:
| I don't want to necessarily make this an argument to/from
| authority, but for some context here - you are discussing
| this with Paolo Bonzini, maintainer of KVM, contributor
| to QEMU. In the list of people that best understand the
| difference and demarcation points between KVM and QEMU,
| he's pretty far up there.
| bonzini wrote:
| Exactly, it's QEMU that abstracts over the processor
| virtualization APIs of KVM, Microsoft WHPX, Apple
| Hypervisor.framework etc. Not Libvirt.
| Izmaki wrote:
| I don't know if _one_ such article exists, but here is a
| piece of tech doc from oVirt (yet another tool) that shows
| how - or that - VDSM is used by oVirt to communicate with
| QEMU through libvirt: https://www.ovirt.org/develop/architect
| ure/architecture.html...
|
| In really simple terms, so simple that I'm not 100% sure they
| are correct:
|
| * KVM is a hypervisor, or rather it lets you turn linux into
| a hypervisor [1], which will let you run VMs on your machine.
| I've heard KVM is rather hard to work with (steep learning
| curve). (Xen is also a hypervisor.)
|
| * QEMU is a wrapper-of-a-sorts (a "machine emulator and
| virtualizer" [2]) which can be used on top of KVM (or Xen).
| "When used as a virtualizer, QEMU achieves near native
| performance by executing the guest code directly on the host
| CPU. QEMU supports virtualization when executing under the
| Xen hypervisor or using the KVM kernel module in Linux." [2]
|
| * libvirt "is a toolkit to manage virtualization platforms"
| [3] and is used, e.g., by VDSM to communicate with QEMU.
|
| * virt-manager is "a desktop user interface for managing
| virtual machines through libvirt" [4]. The screenshots on the
| project page should give an idea of what its typical use-case
| is - think VirtualBox and similar solutions.
|
| * Proxmox is the above toolstack (-ish) but as one product.
|
| ---
|
| [1] https://www.redhat.com/en/topics/virtualization/what-is-
| KVM
|
| [2] https://wiki.qemu.org/Main_Page
|
| [3] https://libvirt.org/
|
| [4] https://virt-manager.org/
| Izmaki wrote:
| Ps: typically if you want to run VMs you are faced with two
| paths only: 1) you want one or more VMs on your personal
| computer or 2) you want one or more VMs in an office
| environment.
|
| On the first path you are likely going to be just fine with
| VirtualBox, VMWare Workstation or Hyper-V (Windows only) /
| Parallels (Mac intended). Which one you should pick depends
| on your desired use of the machines.
|
| On the second path you would go with a solution that deals
| with the nitty-gritty details, such as Proxmox, oVirt,
| Hyper-V, ESXi, or any of the other many available options -
| granted you are not going full cloud-based, which opens up
| a whole lot of different options too.
|
| You would generally never need to worry about which
| components are needed where and why. I've had to worry
| about it once or twice before, because I've had to debug
| why an oVirt solution was not behaving like I wanted it to
| behave. Knowing the inner workings helps in that case.
| crabbone wrote:
| > On the first path you are likely going to be just fine
| with VirtualBox, VMWare Workstation or Hyper-V (Windows
| only) / Parallels (Mac intended).
|
| As a Linux user, why would you want to use VirtualBox or
| VMWare Workstation? They are not so well integrated with
| the system, and, frankly, VirtualBox is more of a toy VM
| player... just go for virt-manager. It gives a
| conceptually similar interface to VirtualBox, but better
| integration with the rest of the system. Especially, when
| it comes to stuff like sending different key
| combinations.
|
| I honestly cannot think of a single benefit to using
| VirtualBox (and I'm less familiar with VMWare player)
| compared to virt-manager. My guess is that it's more
| often used because it's also a common choice on MS
| Windows, so, you get more hits if you are going to search
| the Web for questions associated to VMs / you'd get
| tutorials for how to set up a VM that use VirtualBox.
| But, if you apply yourself to learning how either one of
| these works, you'd see no reason to choose it.
| mkesper wrote:
| The only reason for using it would be that the GUI is a
| little bit more capable. But feature crippling without
| the PUEL addons is horrible.
| [deleted]
| MuffinFlavored wrote:
| > virtual machine monitor
|
| Is it good to think of libvirt as a virtual machine mointor, or
| is that more "virtual machine management"?
| zbrozek wrote:
| I'd love to get a clear explanation of what libvirt actually
| does. As far as I can tell it's a qemu argument assembler and
| launcher. For my own use-case, I just launch qemu from
| systemd unit files:
|
| https://wiki.archlinux.org/title/QEMU#With_systemd_service
| bonzini wrote:
| The main important point is that Libvirt takes care of
| privilege separation.
|
| It makes sure that if your VM and/or QEMU are broken out
| of, there are extra layers to prevent getting access to the
| whole physical machine. For example it runs QEMU as a very
| limited user and, if you're using SELinux, the QEMU process
| can hardly read any file other than the vm image file.
|
| By contrast the method in the arch wiki runs QEMU as root.
| QEMU is exposed to all sort of untrusted input, so you
| really don't want it to run as root.
|
| Libvirt also handles cross machine operations such as live
| migration, and makes it easier to query a bunch of things
| from QEMU.
|
| For more info see https://www.redhat.com/en/blog/all-you-
| need-know-about-kvm-u...
| zbrozek wrote:
| Thank you, this is the best explanation of the utility
| that I've seen. I appreciate it.
| MuffinFlavored wrote:
| > As far as I can tell it's a qemu argument assembler
|
| Raises a few questions to me:
|
| Can you use KVM/do KVM stuff without QEMU?
|
| Can you do libvirt stuff without QEMU?
|
| Hoping the answers to both aren't useless/"technically, but
| why would you want to?"
| zokier wrote:
| There is cute article from lwn demoing using kvm directly
| without anything else: https://lwn.net/Articles/658511/
| mrweasel wrote:
| That is really neat, it never occurred to me that you
| could just use KVM as a sort of sandbox.
| simcop2387 wrote:
| > Can you use KVM/do KVM stuff without QEMU?
|
| Yes there's a few things out there like Firecracker that
| use KVM without using QEMU. I'm not completely aware of
| all of them but they do exist
|
| > Can you do libvirt stuff without QEMU?
|
| Yes it can also manager LXC containers and a few other
| types like Xen and Bhyve and Virtuozzo, like QEMU without
| KVM. The without KVM part is important to letting you run
| VMs that are emulating other architectures than the
| native one.
|
| For a good bit of this, it is "why would you want to" but
| there are definitely real cases where you'd want to be
| able to do this. Like the LXC or Virtuozzo support means
| that you can run lighter weight containers (same
| underlying tech as Docker essentially) through the same
| orchestration/management that you use for virtual
| machines. And the Bhyve support lets you do the same
| thing for running things on top of FreeBSD (though I've
| never used it this way) so that a heterogeneous mix of
| hosts is managed through the same interfaces.
| jononor wrote:
| libvirt supports other hypervisors than QEMU, such as
| VirtualBox, Xen and LXC.
| MuffinFlavored wrote:
| I know ChatGPT responses get downvoted here but I fed it
| this to help me better understand this:
|
| > is KVM a hypervisor? is it type 1 or type 2? is QEMU a
| hypervisor, is it type 1 or type 2? if QEMU is using KVM,
| is QEMU then not a hypervisor in that use case?
|
| Yes, KVM (Kernel-Based Virtual Machine) is indeed a
| hypervisor. It's a type 1 hypervisor, also known as a
| "bare metal" hypervisor. This is because KVM directly
| runs on the host's hardware to control the hardware and
| to manage guest operating systems. The fact that it's a
| Linux kernel module that allows the Linux kernel to
| function as a hypervisor makes it very efficient.
|
| QEMU (Quick Emulator) is a bit more complex. By itself,
| it is technically a type 2 or "hosted" hypervisor,
| meaning it runs within a conventional operating system
| environment. QEMU is a generic, open-source machine
| emulator and virtualizer that can emulate a variety of
| hardware types and host a range of guest operating
| systems.
|
| However, when QEMU is used with KVM, the picture changes
| somewhat. In this case, KVM provides the hardware
| virtualization where it allows the host machine to call
| CPU instructions of full virtualization. Then, QEMU
| emulates the hardware resources and provides the user
| interface for the VM, thus allowing for better
| performance and usability. It's this combination of KVM's
| hardware acceleration capabilities and QEMU's emulation
| capabilities that makes them often used together.
|
| In this case, QEMU is not acting purely as a hypervisor;
| it's providing hardware emulation and user interface for
| the VMs, while KVM is the part providing the hypervisor
| functionality. However, we often refer to the combination
| of "QEMU/KVM" as a unit when talking about this mode of
| operation.
| adql wrote:
| > Can you use KVM/do KVM stuff without QEMU?
|
| There are projects doing that althought qemu is the de
| facto standard and best bet if you don't need to boot
| your machines in 20ms
|
| > Can you do libvirt stuff without QEMU?
|
| Libvirt have _many_ backends so yes. IIRC it can even
| manage virtualbox, althought I 'm, not sure why anyone
| would want to
|
| > Hoping the answers to both aren't useless/"technically,
| but why would you want to?"
|
| ...why? Is there a problem kvm+qemu+libvirt doesn't solve
| for you?
| adql wrote:
| It's a lot of glue to present a consistent interface but it
| also does the management part.
|
| "API to virtualization system" would probably be closest
| approximation but it also does some more advanced stuff like
| coordinating cross-host VM migration
| datadeft wrote:
| Tl;dr: We tried to misuse technology and we failed. If
| Firecracker was developed for a single binary executed fir a
| short period of time why do you try to use it for multiple
| executables running for a long time? Does it make any sense to
| even try?
| zokier wrote:
| AWS uses firecracker to execute long-running Fargate tasks; its
| hardly misuse
| mike_hock wrote:
| Where in the "sales" pitch on the fancy-CSS website as well as
| the README does it say only to use it for single-shot
| workloads?
|
| I think the complaints are perfectly valid.
| favflam wrote:
| I know that Firecracker does not let you bind mount volumes, but
| QEMU does. So, we changed to QEMU from Firecracker. If you run
| the workloads in Kubernetes, you just have to change a single
| value in a yaml file to change the runtime.
|
| I would be scared to let unknown persons use QEMU that bind
| mounts volumes as that is a huge security risk. Firecracker, I
| think, was designed from the start to run un-sanitized workloads,
| hence, no bind mounting.
| brunoborges wrote:
| > you just have to change a single value in a yaml file
|
| Most dangerous 12-words sentence.
| Havoc wrote:
| I toyed with it a bit and was delighted to get it running. Only
| to discover getting even basic networking going is another
| mission in itself.
|
| Light is cool but for many tasks that level of Spartan is
| overkill
|
| If I'm investing time in light it might as well be wasm tech
| k__ wrote:
| _" Firecracker...'s excellent for running short-lived
| workloads...A little-known fact about Firecracker is its lack of
| support... for long-lived workloads."_
|
| Okay.
| rgbrenner wrote:
| _" Firecracker's RAM footprint starts low, but once a workload
| inside allocates RAM, Firecracker will never return it to the
| host system."_
|
| Firecracker has a balloon device you can inflate (ie: acquire as
| much memory inside the VM as possible) and then deflate...
| returning the memory to the host. You can do this while the VM is
| running.
|
| https://github.com/firecracker-microvm/firecracker/blob/main...
| 0xbadcafebee wrote:
| That and the fact that this was after "several weeks of
| testing" tells me this team doesn't have much virtualization
| experience. Firecracker is designed to quickly virtualize 1
| headless stateless app (like a container), not run hundreds of
| different programs in a developer environment.
| CompuIves wrote:
| Yes, we use this at CodeSandbox for reclaiming memory to the
| host (and to reduce snapshot size when we hibernate the VM).
| paulv wrote:
| The first footnote says _If you squint hard enough, you 'll
| find that Firecracker does support dynamic memory management
| with a technique called ballooning. However, in practice, it's
| not usable. To reclaim memory, you need to make sure that the
| guest OS isn't using it, which, for a general-purpose workload,
| is nearly impossible_
| adql wrote:
| Yeah it's pretty hard problem as you'd need to defragment
| physical memory (while fixing all the virtual-to-physical
| mappings) to make contiguous block to free
| [deleted]
| dspillett wrote:
| _> which, for a general-purpose workload, is nearly
| impossible_
|
| That depends on the workload and the maximum memory allocated
| to the guest OS.
|
| A lot of workloads rely on the OS cache/buffers to manage IO
| so unless RAM is quite restricted you can call in to release
| that pretty easily prior to having the balloon driver do its
| thing. In fact I'd not be surprised to be told the balloon
| process does this automatically itself.
|
| If the workload does its own IO management and memory
| allocation (something like SQL Server which will eat what RAM
| it can and does its own IO cacheing) or the VM's memory
| allocation is too small for OS caching to be a significant
| use after the rest of the workload (you might pair memory
| down to the bare minimum like this for a "fairly static
| content" server that doesn't see much variation in memory
| needs and can be allowed to swap a little if things grow
| temporarily), then I'd believe is it more difficult. That is
| hardly the use case for firecracker though so if that is the
| sort of workload being run perhaps reassessing the tool used
| for the job was the right call.
|
| Having said that my use of VMs is generally such that I can
| give them a good static amount of RAM for their needs and
| don't need to worry about dynamic allocation, so I'm far from
| a subject expert here.
|
| And, isn't firecraker more geared towards short-lived VMs,
| quick to spin up, do a job, spin down immediately (or after
| only a short idle timeout if the VM might answer another
| request if one comes in immediately or is already queued), so
| you are better off cycling VMs, which is probably happening
| anyway, than messing around with memory balloons? Again, I'm
| not talking from a position of personal experience here so
| corrections/details welcome!
| dathinab wrote:
| > is nearly impossible
|
| for many mostly "general purpose" use cases it's quite
| viable, or else ~fly.io~ AWS Fargate wouldn't be able to use
| it
|
| this doesn't mean it's easy to implement the necessary
| automatized tooling etc.
|
| so it's depending on your dev resources and priorities it
| might be a bad choice
|
| still I feel the article was had quite a bit a being subtil
| judgemental while moving some quite relevant parts for the
| content of the article into a footnote and also omitting that
| this "supposedly unusable tool" is used successfully by
| various other companies...
|
| like as it it was written by and engineer being overly
| defensive about their decision due having to defend it to the
| 100th time because shareholders, customers, higher level
| management just wouldn't shut up about "but that uses
| Firecracker"
| tedunangst wrote:
| I'm struggling to understand how qemu with free page
| reporting isn't exactly the same as a firecracker balloon.
| londons_explore wrote:
| I really want VM's to integrate 'smarter' with the host.
|
| For example, if I'm running 5 VM's, there is a good chance that
| many of the pages are identical. Not only do I want those pages
| to be deduplicated, but I want them to be zero-copy (ie. not
| deduplicated after-the-fact by some daemon).
|
| To do that, the guest block cache needs to be integrated with the
| host block-cache, so that whenever some guest application tries
| to map data from disk, the host notices that another virtual
| machine has already caused this data to be loaded, so we can just
| map the same page of already loaded data into the VM that is
| asking.
| jarym wrote:
| Well that's all nice, but that would also need to be compute-
| efficient for it to be worthwhile and near-real-time dedupe of
| memory pages would be a REALLY tough challenge.
| londons_explore wrote:
| Pretty straightforward for disk blocks. Many VM disks are
| already de-duped, either through snapshopping or through copy
| on write host filesystems.
|
| The host block cache will end up deduplicating it
| automatically because all the 'copies' lead back to the same
| block on disk.
| gorbak25 wrote:
| I've tried to use virtio-pmem + DAX for the page cache to not
| be duplicated between the guest and the host. In practice the
| RAM overhead of virtio-pmem is unacceptable and it doesn't
| support discard operations at all. So yes a better solution
| would be needed.
| adql wrote:
| https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.ht...
|
| zero-copy is harder as one system upgrade on one of them will
| trash it, but KSM is overall pretty effective at saving some
| memory on similar VMs
| re-thc wrote:
| KVM has KSM (kernel samepage merging) since a long time ago
| that de-duplicates pages.
| londons_explore wrote:
| But that makes a copy first, and only later notices that the
| pages are the same and merges them again.
|
| Better to not make copies in the first place.
| adql wrote:
| it's not really possible in hypervisor as it doesn't know
| what guest will be putting in its memory beforehand
| Liquid_Fire wrote:
| How are you going to know in advance that the pages are
| going to be the same?
|
| e.g. your guest kernel is loading an application into
| memory, by reading some parts of an ELF file from disk.
| Presumably each VM has its own unique disk, so the
| hypervisor can't know that this is "the same" page of data
| as another VM has without actually reading it into memory
| first and calculating a hash or something.
|
| If the VMs share a disk image (e.g. the image is copy-on-
| write), then I could see it being feasible - e.g. with KVM,
| even if your VMs are instantiated by distinct userspace
| processes, they would probably share the pages as they mmap
| the same disk image. You would still need your virtualised
| disk device to support copy-on-write, which may or may not
| be possible depending on your use case.
|
| But your copy-on-write disk images will probably quickly
| diverge in a way that makes most pages not shareable,
| unless you use some sort of filesystem optimised for that.
|
| Lastly, since you mentioned Chromium or Slack in another
| comment - I'm sure you'll find nearly all of the loading
| time there is not spent loading the executable from disk,
| but actually executing it (and all its
| startup/initialisation code). So this probably won't be the
| speedup you're imagining. It would just save memory.
| londons_explore wrote:
| Chromium load time with an empty profile on my system is
| 4.5 seconds with a cleared disk block cache, and 2.0
| seconds with a warm disk cache.
|
| So startup time could be better than halved. Seems worth
| it.
| londons_explore wrote:
| > pages not shareable, unless you use some sort of
| filesystem optimised for that.
|
| btrfs on the host would have support for deduplication of
| identical pages in the disk images. It's true that a CPU-
| costly scan would be needed to identify new shared pages,
| if for example, two VM's are both updated to the latest
| distro release.
| ec109685 wrote:
| It has side channel attacks so be careful when enabling:
| https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)
| scarface_74 wrote:
| This seems like a security issue waiting to happen when you're
| running code from different users.
| hamandcheese wrote:
| If you already know so much about your application(s), are you
| sure you need virtualization?
| drbawb wrote:
| The second I read "shared block cache" my brain went to
| containers.
|
| If you want data colocated on the same filesystem, then put
| it on the same filesystem. VMs suck, nobody spins up a whole
| fabricated IBM-compatible PC and gaslights their executable
| _because they want to._ [1] They do it because their OS (a)
| doesn't have containers, (b) doesn't provide strong enough
| isolation between containers, or (c) the host kernel can't
| run their workload. (Different ISA, different syscalls,
| different executable format, etc.)
|
| Anyone who has ever tried to run heavyweight VMs atop a
| snapshotting volume already knows the idea of "shared blocks"
| is a fantasy; as soon as you do one large update inside the
| guest the delta between your volume clones and the base
| snapshot grows immensely. That's why Docker et al. has a
| concept of layers and you describe your desired state as a
| series of idempotent instructions applied to those layers.
| That's possible _because_ Docker operates semantically on a
| filesystem; much harder to do at the level of a block device.
|
| Is the a block containing b"hello, world" part of a program's
| text section, or part of a user's document? You don't know,
| because the guest is asking you for an LBA, not a path, not
| modes, not an ACL, etc. - If you don't know that, the host
| kernel has no idea how the page should be mapped into memory.
| Furthermore storing the information to dedup common blocks is
| non-trivial: go look at the manpage for ZFS' deduplication
| and it is littered w/ warnings about the performance, memory,
| and storage implications of dealing with the dedup table.
|
| [1]: https://www.youtube.com/watch?v=coFIEH3vXPw
| tjoff wrote:
| Doubt it is worth the hassle. How many do you really expect to
| be identical?
|
| An OS isn't large. Your spotify/slack/browser instance is of
| comparable size. Says more about browser based apps but still.
| londons_explore wrote:
| Basically all code pages should be the same if some other VM
| has the same version of ubuntu and running the same version
| of spotify/slack.
|
| And remember that as well as RAM savings, you also get
| 'instant loading' because there is no need to do slow SSD
| accesses to load hundreds of megabytes of a chromium binary
| to get slack running...
| drbawb wrote:
| > An OS isn't large. Your spotify/slack/browser instance is
| of comparable size.
|
| A fairly recent Windows 11 Pro image is ~26GB unpacked and
| 141k dirents. After finishing OOBE it's already running like
| >100 processes, >1000 threads, and >100k handles. My Chrome
| install is ~600MB and 115 dirents. (Not including UserData.)
| It runs ~1 process per tab. Comparable in scope and
| complexity? That's debatable, but I tend to agree that modern
| browsers are pretty similar in scope to what an OS should be.
| (The other day my "web browser" flashed the firmware on the
| microcontroller for my keyboard.)
|
| They're not even close to "being comparable in size,"
| although I guess that says more about Windows.
| cthalupa wrote:
| My reading was that the "comparable in size" was more about
| memory footprint and less about usage of storage
| shrubble wrote:
| OpenVZ does this. If you have 5 VMs each loading the same
| library then memory is conserved, as I understand it.
| anthk wrote:
| kvm does the same with KSM.
| shrubble wrote:
| Not precisely, in that KSM does it after the fact while
| OpenVZ has it occur as a consequence of its design, on the
| loading of the program.
|
| See (OpenVZ) "Containers share dynamic libraries, which
| greatly saves memory." It's just 1 Linux kernel when you
| are running OpenVZ containers.
|
| https://docs.openvz.org/openvz_users_guide.webhelp/_openvz_
| c...
|
| See (KVM/KSM): "KSM enables the kernel to examine two or
| more already running programs and compare their memory. If
| any memory regions or pages are identical, KSM reduces
| multiple identical memory pages to a single page. This page
| is then marked copy on write."
|
| https://access.redhat.com/documentation/en-
| us/red_hat_enterp...
|
| In KVM's defense, it supports a much wider range of OSes;
| OpenVZ only really does different versions of Linux, while
| KVM can run OpenBSD/FreeBSD/NetBSD/Windows and even OS/2 in
| addition to Linux.
| andrewflnr wrote:
| Are you sure you're not thinking "copy on write" rather than
| "zero copy"? The latter implies you can predict in advance
| which pages will be the same forever...
| londons_explore wrote:
| The pages would be copy-on-write, but since this would mostly
| be for code pages, they would never be written, and therefore
| never copied.
|
| By 'zero copy', I mean that when a guest tries to read a
| page, if another guest has that page in RAM, then no copy
| operation is done to get it into the memory space of the 2nd
| guest.
| ComputerGuru wrote:
| The article did an ok job of explaining the firecracker
| limitations they ran into but it was extremely skimpy when it
| came to qemu and just rushed to the conclusion "we did a lot of
| work so try our product."
| naikrovek wrote:
| yeah I was reading so I could find out what they did.
|
| I understand that they need to sell their product but jeez.
| don't leave us hanging like that
| hugodutka wrote:
| I didn't want to go into all the technical details, but we
| have another write-up that goes into details about RAM
| management: https://github.com/hocus-
| dev/hocus/blob/main/rfd/0003-worksp...
|
| Other than making sure we release unused memory to the host,
| we didn't customize QEMU that much. Although we do have a
| cool layered storage solution - basically a faster
| alternative to QCOW2 that's also VMM independent. It's called
| overlaybd, and was created and implemented in Alibaba. That
| will probably be another blog post.
| https://github.com/containerd/overlaybd
| ComputerGuru wrote:
| > I didn't want to go into all the technical details
|
| HN is here for the technical details ;)
| gunapologist99 wrote:
| that should be the HN motto!
| xen2xen1 wrote:
| Thirded.. Ed.
| blinkingled wrote:
| > The main issue we've had with QEMU is that it has too many
| options you need to configure. For instance, enabling your VM to
| return unused RAM to the host requires at least three challenging
| tasks
|
| This just works on Hyper-V Linux guests btw. For all the crap MS
| gets they do some things very right.
| freeone3000 wrote:
| It _kind of_ just works. It's actually broken under Debian 13
| for some reason; memory usage infinitely balloons if the
| feature is enabled.
___________________________________________________________________
(page generated 2023-07-10 23:00 UTC)