[HN Gopher] CRIU, a project to implement checkpoint/restore func...
___________________________________________________________________
CRIU, a project to implement checkpoint/restore functionality for
Linux
Author : JeremyNT
Score : 103 points
Date : 2024-06-21 16:59 UTC (6 hours ago)
(HTM) web link (criu.org)
(TXT) w3m dump (criu.org)
| londons_explore wrote:
| I pulled apart the innards of CRIU because I needed to be able to
| checkpoint and restore a process within a few microseconds.
|
| The project ended up being a dead end because it turned out
| running my program in a QEMU whole system vm and then fork()ING
| QEMU worked faster.
| kn100 wrote:
| could you tell me a bit more about what you're doing?
| londons_explore wrote:
| The goal was to have a web browser (chromium) able to 'guess'
| stuff about what response it will get from the network (ie.
| Will the server return the same JavaScript blob as last
| time). We start executing the JavaScript as-if the guess is
| correct. If the guess is wrong, we revert to a snapshot.
|
| It lets you make good use of CPU time whilst waiting for the
| network.
|
| It turns out simple heuristics can get 99% accuracy on the
| question of 'will the server return the same result as last
| time for this non-cachable response'.
|
| However, since my machine has many CPU cores it made sense to
| have many 'speculative' copies of the browser going at once.
|
| A regular fork() call would have worked, if not for the fact
| chromium is multi thread and multi process, and it's next to
| impossible to fork multiple processes as a group.
| salamander014 wrote:
| Sorry if I'm being thick, but why not just cache the
| response?
|
| If you are guessing at the data anyway, what's the
| difference?
|
| Why set up an entire speculative execution engine / runtime
| snapshot rollback framework when it sounds like adding
| heuristic decision caching would solve this problem?
| InvisGhost wrote:
| Sounds like they were caching it since they could execute
| it before getting the response. The difference is that
| they wanted to avoid the situation where they execute
| stale code that the server never would've served. So they
| can execute the stale code while waiting for the response
| then either toss the result or continue on with it once
| they determine if the server response changed.
| vient wrote:
| There is a QEMU fork used by Nyx fuzzer, may be interesting to
| you https://github.com/nyx-fuzz/QEMU-Nyx
|
| Basically, for the fuzzing purposes speed is paramount so they
| made some changes to speed up snapshot restoring. Don't know
| the limitations but since it is used to fuzz full operating
| systems, there should not be many.
|
| I believe it should be faster than forking because why even
| patch QEMU otherwise.
| PhilipRoman wrote:
| Would love a low-tech version of this which simply suspends the
| process and puts all mapped pages in swap (no persistence across
| reboot ofc). I think it could be used for scheduling large
| memory-bound jobs whose resource usage is not known in advance.
| dividuum wrote:
| Not sure that's needed. Sending SIGSTOP (or using cgroup
| freeze) and letting the linux memory management do its job
| should do most of that already.
| Agingcoder wrote:
| Yes I confirm it works - I've been using this very behavior
| for years
| zeotroph wrote:
| I discovered CRIU in this video below (1h) "Container Migration
| and CRIU Details with Adrian Reber (Red Hat)", it has a live demo
| and the details about how much "user space" it really is. Here
| with the RH podman fork of docker.
|
| Since everyone is treating containers as cattle CRIU doesn't seem
| to get much attention, and might be why a video and not a blog
| post was my first introduction.
|
| https://www.youtube.com/watch?v=-7DgNxyuz_o
| JeremyNT wrote:
| > Since everyone is treating containers as cattle CRIU doesn't
| seem to get much attention
|
| Yeah, I guess that's probably the reason. If you're engineering
| your workloads with the idea that the world might "poof" out
| from under you at any moment you'd never wonder about / reach
| for something like CRIU.
|
| It's a trick that I'd never much thought about, but now that
| I've learned it exists (so many years late) I find myself
| wondering about the path not taken here. It _feels_ like it
| should be incredibly useful... but I can 't figure out exactly
| what I'd want to do with it myself.
| yencabulator wrote:
| > Since everyone is treating containers as cattle CRIU doesn't
| seem to get much attention
|
| Nah, it's more like "I don't trust that thing to not cause
| weird behavior in production".
|
| VM-level snapshots are standard practice[1] because the
| abstraction there is right-sized for being able to do that
| reliably. CRIU isn't, because it's trying to solve a much
| harder problem.
|
| [1]: And even there, beware cloning running memory state, you
| can get weird interactions from two identical parties trying to
| talk to the same 3rd service, separated by time. Cloning disk
| snapshots is _much_ safer, and even there you can screw up
| because of duplicate machine IDs, crypto keys, nonces, etc.
| znpy wrote:
| > Here with the RH podman fork of docker
|
| small nit: podman is not a docker fork, it's a completely
| different codebase written from scratch
| cpuguy83 wrote:
| It is absolutely not from scratch. It is definitely reusing
| docker code (or at least it used to).
| usixk wrote:
| seriously cool project, used it at a prev workplace to checkpoint
| http servers for absolutely dirt nasty start speeds
| bityard wrote:
| nginx gone wild?
| ahlCVA wrote:
| I once used CRIU to implement the hacky equivalent of save-lisp-
| and-die to speed up the startup process of a low-powered embedded
| system where the main application was misguidedly implemented in
| Erlang and loading all the code took minutes each time the device
| started. It worked better than it should have (though in the end
| it wasn't shipped because nobody (except the customer) cared
| enough about the startup behavior and eventually the product got
| canned (for different reasons)).
| phildenhoff wrote:
| What was misguided about using Erlang? That it was so expensive
| CPU-wise to start up?
| jlokier wrote:
| CRIU is used by LXD to save the state of an LXD container, very
| similar to suspending or snapshotting a virtual machine.
|
| Unfortunately, I was disappointed to find `lxd stop --stateful`
| couldn't save _any_ of my LXD containers. There was always some
| error or other. This is how I learned about CRIU, as it was due
| to limitations of CRIU when used with the sorts of things running
| in LXD. # lxc stop --stateful test
| (00.121636) Error (criu/namespaces.c:423): Can't dump nested uts
| namespace for 2685261 (00.121645) Error
| (criu/namespaces.c:682): Can't make utsns id (00.150794)
| Error (criu/util.c:631): exited, status=1 (00.190680) Error
| (criu/util.c:631): exited, status=1 (00.191997) Error
| (criu/cr-dump.c:1768): Dumping FAILED. Error: snapshot dump
| failed
|
| LXD is generally used with "distro-like" containers, like running
| a small Debian or Ubuntu distro, rather than single-application
| containers as are used with Docker.
|
| It turns out CRIU can't save the state of those types of
| containers, so in practice `lxd stop --stateful` never worked for
| me.
|
| I'd have to switch to VMs if I want their state saved across host
| reboots, but those don't have other behaviours regarding host-
| guest filesystem sharing that I needed.
|
| In practice this meant I had to live with never rebooting the
| host. Thankfully Linux just keeps on working for years without a
| reboot :-)
| virtuous_sloth wrote:
| Stephane Graber (key Incus nee LXD contributor) just did a
| video about developing placement scriptlets in the Starlark
| language but the interesting thing is, if I'm interpreting what
| I saw correctly, his cluster was 6 beefy servers plus 3 decent-
| sized VMs and the idea was, I think, that containers could get
| placed on the nested VMs, neatly solving the migration issue
| with containers. The interesting part was it looked like the 3
| VMs in the cluster may have been themselves in the cluster.
|
| I could be wrong, though. Interesting approach if true
| ranger_danger wrote:
| > Linux just keeps on working for years without a reboot
|
| Except I would strongly suggest not doing that as there have
| been some very nasty security issues fixed as of late.
| overspeed wrote:
| Great project!
|
| For long running containerised simulations, this saves a lot of
| time on failures ( as long as you have a safe place to write the
| snapshots to ) by not restarting from 0 every time.
| albertzeyer wrote:
| We considered to use sth like this to cache some Python program
| state to speed up the startup time, as the startup time was quite
| long for some of our scripts (due to slow NFS, but also importing
| lots of libs, like PyTorch or TensorFlow). We wanted to store the
| program state right after importing the modules and loading some
| static stuff, before executing the actual script or doing other
| dynamic stuff. So updating the script is still possible while
| keeping the same state.
|
| Back then, CRIU turned out to not be an option for us. E.g. one
| of the problems was that it was not possible to be used as non-
| root (https://github.com/checkpoint-restore/criu/pull/1930). I
| see that this PR was merged now, so maybe this works now? Not
| sure if there are other issues.
|
| We also considered DMTCP (https://github.com/dmtcp/dmtcp/) as
| another alternative to CRIU, but that had other issues (I don't
| remember).
|
| The solution I ended up was to implement a fork server. Some
| server proc starts initially and only preloads the modules and
| maybe other things, and then waits. Once I want to execute some
| script, I can fork from the server and use this forked process
| right away. I used similar logic as in reptyr
| (https://github.com/nelhage/reptyr) to redirect the PTY. This
| worked quite well.
|
| https://github.com/albertz/python-preloaded
| Retr0id wrote:
| I've interacted with some of these features as a means of code
| injection into running processes. (checkpoint, patch the
| checkpoint data, restore)
|
| It's useful because, by design, it's difficult for the process to
| even notice it's been stopped. And while it's stopped, you can
| apply arbitrary patches completely atomically.
| eikenberry wrote:
| I'm keeping an eye on this project as a way to give containers
| used with immutable distro installs (eg. silverblue) a kind of
| user-space hibernation feature. So I could hibernate different
| container workspaces at will. I would find this very useful for
| development projects where I often have a lot of state that I
| lose whenever I need to reboot or whatever. Last time I looked
| there were still to many limitations on what it could checkpoint,
| but maybe one day.
| yencabulator wrote:
| CRIU is 11 years old, don't expect it to be any more usable in
| the near future.
| Manouchehri wrote:
| rr took about 9 years to get first class aarch64 support.
|
| https://github.com/rr-debugger/rr/issues/1373
| nravic wrote:
| We do this using CRIU right now!
| https://github.com/cedana/cedana
|
| In fact one of our customer's use cases is exactly what you
| describe, allowing users to "hibernate" container workspaces.
| jasonvorhe wrote:
| Interesting. I built a very primitive prototype for a hosting
| company a while a back where I wanted to figure out if we could
| offer something close to a live migration of one Linux account on
| host x to host y without causing a lot of downtime. The product
| didn't support containers and isolation was just based on Linux
| user accounts so we couldn't just use Docker.
|
| Just a few months ago I was talking to a startup founder at
| KubeCon who built a product based on CRIU. Unfortunately I forgot
| the company's name. (And I can't find that git repo with the
| prototype anywhere, even in my backups. Sad.)
| nravic wrote:
| I'm probably the cofounder of the guy you spoke with! Here's
| our repo: https://github.com/cedana/cedana
| arjvik wrote:
| For my OS class's final project last quarter, I built a way to
| live-migrate a process (running on a custom OS we built from
| scratch) from one Raspberry Pi to another, essentially using
| checkpoint/restore!
|
| Getting the code cleaned up enough to post it has been on my to-
| do list for quite some time, and this has inspired me to do it
| soon!
| arjvik wrote:
| Should mention, the coolest part is that I never sent over
| "all" the memory used by the process, because it was difficult
| to tell what is needed and what isn't. Instead, I was clever
| with virtual memory, and when a page of memory was needed that
| wasn't loaded by the recipient Pi, it would request and lazy-
| load just that page from the provider Pi, and with some careful
| bookkeeping mark that the page was owned by the recipient Pi.
| zozbot234 wrote:
| > Instead, I was clever with virtual memory, and when a page
| of memory was needed that wasn't loaded by the recipient Pi,
| it would request and lazy-load just that page from the
| provider Pi, and with some careful bookkeeping mark that the
| page was owned by the recipient Pi.
|
| I wonder if that "trick" can be extended to a full
| implementation of distributed shared memory, i.e. multiple
| nodes running separate tasks in a single address space and
| implementing cache coherence over the network. Probably needs
| quite a bit of extra compiler/runtime support so it wouldn't
| really apply to standard binaries, but it might still be
| useful nonetheless.
| nradclif wrote:
| Partitioned Global Address Space (PGAS) compilers/runtimes
| do something similar to that. Unified Parallel C
| (UPC,https://upc.lbl.gov/) and Coarray Fortran/Coarray C++
| (https://docs.nersc.gov/development/programming-
| models/coarra...) are good examples commonly used in HPC.
| Fabric Attached Memory (OpenFAM,
| https://openfam.github.io/) is another example.
| monus wrote:
| I built crik[1] to orchestrate CRIU operations inside a container
| running in Kubernetes so that you can migrate containers when
| spot node gets a shutdown signal. Presented it at KubeCon Paris
| 2024 [2] with a deep dive for those interested in the technical
| details.
|
| [1]: https://github.com/qawolf/crik
|
| [2]: The Party Must Go On - Resume Pods After Spot Instance
| Shutdown, https://kccnceu2024.sched.com/event/1YeP3
| EatFlamingDeath wrote:
| Can this be used for something like Steam Deck? It would be nice
| for when you are running a game and needs to stop but will resume
| gameplay later.
| whartung wrote:
| How do things like this handle sockets? Is there some kind of
| first class event that the app can detect, or does it just
| "close" them all and assume the app can cleanly reconnect to
| reestablish them (once they detect that the socket has rudely
| closed on them)?
| loeg wrote:
| https://github.com/checkpoint-restore/criu?tab=readme-ov-fil...
|
| > One of the CRIU features is the ability to save and restore
| state of a TCP socket without breaking the connection. This
| functionality is considered to be useful by itself, and we have
| it available as the libsoccr library.
| touisteur wrote:
| There are many ways to go about it. Standard way recommended
| with libsoccr (the criu library to handle tcp socket
| checkpoint/restore) is to install a firewall rule to filter
| packets during checkpoint and let tcp resend whatever it needs
| to resync whenever the socket is restored.
|
| If you want your original process to continue living after the
| checkpoint and not lose packets during checkpoint, you can go a
| pretty long way with the 'plug' tc, IFBs. And if you're
| aventurous, lots of support for getsockopt/setsockopt and
| ioctls have been or are being merged within io_uring so
| checkpointing a big-buffered TCP socket can cost under 100us,
| even less IIRC.
___________________________________________________________________
(page generated 2024-06-21 23:00 UTC)