[HN Gopher] CRIU, a project to implement checkpoint/restore func...
       ___________________________________________________________________
        
       CRIU, a project to implement checkpoint/restore functionality for
       Linux
        
       Author : JeremyNT
       Score  : 103 points
       Date   : 2024-06-21 16:59 UTC (6 hours ago)
        
 (HTM) web link (criu.org)
 (TXT) w3m dump (criu.org)
        
       | londons_explore wrote:
       | I pulled apart the innards of CRIU because I needed to be able to
       | checkpoint and restore a process within a few microseconds.
       | 
       | The project ended up being a dead end because it turned out
       | running my program in a QEMU whole system vm and then fork()ING
       | QEMU worked faster.
        
         | kn100 wrote:
         | could you tell me a bit more about what you're doing?
        
           | londons_explore wrote:
           | The goal was to have a web browser (chromium) able to 'guess'
           | stuff about what response it will get from the network (ie.
           | Will the server return the same JavaScript blob as last
           | time). We start executing the JavaScript as-if the guess is
           | correct. If the guess is wrong, we revert to a snapshot.
           | 
           | It lets you make good use of CPU time whilst waiting for the
           | network.
           | 
           | It turns out simple heuristics can get 99% accuracy on the
           | question of 'will the server return the same result as last
           | time for this non-cachable response'.
           | 
           | However, since my machine has many CPU cores it made sense to
           | have many 'speculative' copies of the browser going at once.
           | 
           | A regular fork() call would have worked, if not for the fact
           | chromium is multi thread and multi process, and it's next to
           | impossible to fork multiple processes as a group.
        
             | salamander014 wrote:
             | Sorry if I'm being thick, but why not just cache the
             | response?
             | 
             | If you are guessing at the data anyway, what's the
             | difference?
             | 
             | Why set up an entire speculative execution engine / runtime
             | snapshot rollback framework when it sounds like adding
             | heuristic decision caching would solve this problem?
        
               | InvisGhost wrote:
               | Sounds like they were caching it since they could execute
               | it before getting the response. The difference is that
               | they wanted to avoid the situation where they execute
               | stale code that the server never would've served. So they
               | can execute the stale code while waiting for the response
               | then either toss the result or continue on with it once
               | they determine if the server response changed.
        
         | vient wrote:
         | There is a QEMU fork used by Nyx fuzzer, may be interesting to
         | you https://github.com/nyx-fuzz/QEMU-Nyx
         | 
         | Basically, for the fuzzing purposes speed is paramount so they
         | made some changes to speed up snapshot restoring. Don't know
         | the limitations but since it is used to fuzz full operating
         | systems, there should not be many.
         | 
         | I believe it should be faster than forking because why even
         | patch QEMU otherwise.
        
       | PhilipRoman wrote:
       | Would love a low-tech version of this which simply suspends the
       | process and puts all mapped pages in swap (no persistence across
       | reboot ofc). I think it could be used for scheduling large
       | memory-bound jobs whose resource usage is not known in advance.
        
         | dividuum wrote:
         | Not sure that's needed. Sending SIGSTOP (or using cgroup
         | freeze) and letting the linux memory management do its job
         | should do most of that already.
        
           | Agingcoder wrote:
           | Yes I confirm it works - I've been using this very behavior
           | for years
        
       | zeotroph wrote:
       | I discovered CRIU in this video below (1h) "Container Migration
       | and CRIU Details with Adrian Reber (Red Hat)", it has a live demo
       | and the details about how much "user space" it really is. Here
       | with the RH podman fork of docker.
       | 
       | Since everyone is treating containers as cattle CRIU doesn't seem
       | to get much attention, and might be why a video and not a blog
       | post was my first introduction.
       | 
       | https://www.youtube.com/watch?v=-7DgNxyuz_o
        
         | JeremyNT wrote:
         | > Since everyone is treating containers as cattle CRIU doesn't
         | seem to get much attention
         | 
         | Yeah, I guess that's probably the reason. If you're engineering
         | your workloads with the idea that the world might "poof" out
         | from under you at any moment you'd never wonder about / reach
         | for something like CRIU.
         | 
         | It's a trick that I'd never much thought about, but now that
         | I've learned it exists (so many years late) I find myself
         | wondering about the path not taken here. It _feels_ like it
         | should be incredibly useful... but I can 't figure out exactly
         | what I'd want to do with it myself.
        
         | yencabulator wrote:
         | > Since everyone is treating containers as cattle CRIU doesn't
         | seem to get much attention
         | 
         | Nah, it's more like "I don't trust that thing to not cause
         | weird behavior in production".
         | 
         | VM-level snapshots are standard practice[1] because the
         | abstraction there is right-sized for being able to do that
         | reliably. CRIU isn't, because it's trying to solve a much
         | harder problem.
         | 
         | [1]: And even there, beware cloning running memory state, you
         | can get weird interactions from two identical parties trying to
         | talk to the same 3rd service, separated by time. Cloning disk
         | snapshots is _much_ safer, and even there you can screw up
         | because of duplicate machine IDs, crypto keys, nonces, etc.
        
         | znpy wrote:
         | > Here with the RH podman fork of docker
         | 
         | small nit: podman is not a docker fork, it's a completely
         | different codebase written from scratch
        
           | cpuguy83 wrote:
           | It is absolutely not from scratch. It is definitely reusing
           | docker code (or at least it used to).
        
       | usixk wrote:
       | seriously cool project, used it at a prev workplace to checkpoint
       | http servers for absolutely dirt nasty start speeds
        
         | bityard wrote:
         | nginx gone wild?
        
       | ahlCVA wrote:
       | I once used CRIU to implement the hacky equivalent of save-lisp-
       | and-die to speed up the startup process of a low-powered embedded
       | system where the main application was misguidedly implemented in
       | Erlang and loading all the code took minutes each time the device
       | started. It worked better than it should have (though in the end
       | it wasn't shipped because nobody (except the customer) cared
       | enough about the startup behavior and eventually the product got
       | canned (for different reasons)).
        
         | phildenhoff wrote:
         | What was misguided about using Erlang? That it was so expensive
         | CPU-wise to start up?
        
       | jlokier wrote:
       | CRIU is used by LXD to save the state of an LXD container, very
       | similar to suspending or snapshotting a virtual machine.
       | 
       | Unfortunately, I was disappointed to find `lxd stop --stateful`
       | couldn't save _any_ of my LXD containers. There was always some
       | error or other. This is how I learned about CRIU, as it was due
       | to limitations of CRIU when used with the sorts of things running
       | in LXD.                 # lxc stop --stateful test
       | (00.121636) Error (criu/namespaces.c:423): Can't dump nested uts
       | namespace for 2685261       (00.121645) Error
       | (criu/namespaces.c:682): Can't make utsns id       (00.150794)
       | Error (criu/util.c:631): exited, status=1       (00.190680) Error
       | (criu/util.c:631): exited, status=1       (00.191997) Error
       | (criu/cr-dump.c:1768): Dumping FAILED.       Error: snapshot dump
       | failed
       | 
       | LXD is generally used with "distro-like" containers, like running
       | a small Debian or Ubuntu distro, rather than single-application
       | containers as are used with Docker.
       | 
       | It turns out CRIU can't save the state of those types of
       | containers, so in practice `lxd stop --stateful` never worked for
       | me.
       | 
       | I'd have to switch to VMs if I want their state saved across host
       | reboots, but those don't have other behaviours regarding host-
       | guest filesystem sharing that I needed.
       | 
       | In practice this meant I had to live with never rebooting the
       | host. Thankfully Linux just keeps on working for years without a
       | reboot :-)
        
         | virtuous_sloth wrote:
         | Stephane Graber (key Incus nee LXD contributor) just did a
         | video about developing placement scriptlets in the Starlark
         | language but the interesting thing is, if I'm interpreting what
         | I saw correctly, his cluster was 6 beefy servers plus 3 decent-
         | sized VMs and the idea was, I think, that containers could get
         | placed on the nested VMs, neatly solving the migration issue
         | with containers. The interesting part was it looked like the 3
         | VMs in the cluster may have been themselves in the cluster.
         | 
         | I could be wrong, though. Interesting approach if true
        
         | ranger_danger wrote:
         | > Linux just keeps on working for years without a reboot
         | 
         | Except I would strongly suggest not doing that as there have
         | been some very nasty security issues fixed as of late.
        
       | overspeed wrote:
       | Great project!
       | 
       | For long running containerised simulations, this saves a lot of
       | time on failures ( as long as you have a safe place to write the
       | snapshots to ) by not restarting from 0 every time.
        
       | albertzeyer wrote:
       | We considered to use sth like this to cache some Python program
       | state to speed up the startup time, as the startup time was quite
       | long for some of our scripts (due to slow NFS, but also importing
       | lots of libs, like PyTorch or TensorFlow). We wanted to store the
       | program state right after importing the modules and loading some
       | static stuff, before executing the actual script or doing other
       | dynamic stuff. So updating the script is still possible while
       | keeping the same state.
       | 
       | Back then, CRIU turned out to not be an option for us. E.g. one
       | of the problems was that it was not possible to be used as non-
       | root (https://github.com/checkpoint-restore/criu/pull/1930). I
       | see that this PR was merged now, so maybe this works now? Not
       | sure if there are other issues.
       | 
       | We also considered DMTCP (https://github.com/dmtcp/dmtcp/) as
       | another alternative to CRIU, but that had other issues (I don't
       | remember).
       | 
       | The solution I ended up was to implement a fork server. Some
       | server proc starts initially and only preloads the modules and
       | maybe other things, and then waits. Once I want to execute some
       | script, I can fork from the server and use this forked process
       | right away. I used similar logic as in reptyr
       | (https://github.com/nelhage/reptyr) to redirect the PTY. This
       | worked quite well.
       | 
       | https://github.com/albertz/python-preloaded
        
       | Retr0id wrote:
       | I've interacted with some of these features as a means of code
       | injection into running processes. (checkpoint, patch the
       | checkpoint data, restore)
       | 
       | It's useful because, by design, it's difficult for the process to
       | even notice it's been stopped. And while it's stopped, you can
       | apply arbitrary patches completely atomically.
        
       | eikenberry wrote:
       | I'm keeping an eye on this project as a way to give containers
       | used with immutable distro installs (eg. silverblue) a kind of
       | user-space hibernation feature. So I could hibernate different
       | container workspaces at will. I would find this very useful for
       | development projects where I often have a lot of state that I
       | lose whenever I need to reboot or whatever. Last time I looked
       | there were still to many limitations on what it could checkpoint,
       | but maybe one day.
        
         | yencabulator wrote:
         | CRIU is 11 years old, don't expect it to be any more usable in
         | the near future.
        
           | Manouchehri wrote:
           | rr took about 9 years to get first class aarch64 support.
           | 
           | https://github.com/rr-debugger/rr/issues/1373
        
         | nravic wrote:
         | We do this using CRIU right now!
         | https://github.com/cedana/cedana
         | 
         | In fact one of our customer's use cases is exactly what you
         | describe, allowing users to "hibernate" container workspaces.
        
       | jasonvorhe wrote:
       | Interesting. I built a very primitive prototype for a hosting
       | company a while a back where I wanted to figure out if we could
       | offer something close to a live migration of one Linux account on
       | host x to host y without causing a lot of downtime. The product
       | didn't support containers and isolation was just based on Linux
       | user accounts so we couldn't just use Docker.
       | 
       | Just a few months ago I was talking to a startup founder at
       | KubeCon who built a product based on CRIU. Unfortunately I forgot
       | the company's name. (And I can't find that git repo with the
       | prototype anywhere, even in my backups. Sad.)
        
         | nravic wrote:
         | I'm probably the cofounder of the guy you spoke with! Here's
         | our repo: https://github.com/cedana/cedana
        
       | arjvik wrote:
       | For my OS class's final project last quarter, I built a way to
       | live-migrate a process (running on a custom OS we built from
       | scratch) from one Raspberry Pi to another, essentially using
       | checkpoint/restore!
       | 
       | Getting the code cleaned up enough to post it has been on my to-
       | do list for quite some time, and this has inspired me to do it
       | soon!
        
         | arjvik wrote:
         | Should mention, the coolest part is that I never sent over
         | "all" the memory used by the process, because it was difficult
         | to tell what is needed and what isn't. Instead, I was clever
         | with virtual memory, and when a page of memory was needed that
         | wasn't loaded by the recipient Pi, it would request and lazy-
         | load just that page from the provider Pi, and with some careful
         | bookkeeping mark that the page was owned by the recipient Pi.
        
           | zozbot234 wrote:
           | > Instead, I was clever with virtual memory, and when a page
           | of memory was needed that wasn't loaded by the recipient Pi,
           | it would request and lazy-load just that page from the
           | provider Pi, and with some careful bookkeeping mark that the
           | page was owned by the recipient Pi.
           | 
           | I wonder if that "trick" can be extended to a full
           | implementation of distributed shared memory, i.e. multiple
           | nodes running separate tasks in a single address space and
           | implementing cache coherence over the network. Probably needs
           | quite a bit of extra compiler/runtime support so it wouldn't
           | really apply to standard binaries, but it might still be
           | useful nonetheless.
        
             | nradclif wrote:
             | Partitioned Global Address Space (PGAS) compilers/runtimes
             | do something similar to that. Unified Parallel C
             | (UPC,https://upc.lbl.gov/) and Coarray Fortran/Coarray C++
             | (https://docs.nersc.gov/development/programming-
             | models/coarra...) are good examples commonly used in HPC.
             | Fabric Attached Memory (OpenFAM,
             | https://openfam.github.io/) is another example.
        
       | monus wrote:
       | I built crik[1] to orchestrate CRIU operations inside a container
       | running in Kubernetes so that you can migrate containers when
       | spot node gets a shutdown signal. Presented it at KubeCon Paris
       | 2024 [2] with a deep dive for those interested in the technical
       | details.
       | 
       | [1]: https://github.com/qawolf/crik
       | 
       | [2]: The Party Must Go On - Resume Pods After Spot Instance
       | Shutdown, https://kccnceu2024.sched.com/event/1YeP3
        
       | EatFlamingDeath wrote:
       | Can this be used for something like Steam Deck? It would be nice
       | for when you are running a game and needs to stop but will resume
       | gameplay later.
        
       | whartung wrote:
       | How do things like this handle sockets? Is there some kind of
       | first class event that the app can detect, or does it just
       | "close" them all and assume the app can cleanly reconnect to
       | reestablish them (once they detect that the socket has rudely
       | closed on them)?
        
         | loeg wrote:
         | https://github.com/checkpoint-restore/criu?tab=readme-ov-fil...
         | 
         | > One of the CRIU features is the ability to save and restore
         | state of a TCP socket without breaking the connection. This
         | functionality is considered to be useful by itself, and we have
         | it available as the libsoccr library.
        
         | touisteur wrote:
         | There are many ways to go about it. Standard way recommended
         | with libsoccr (the criu library to handle tcp socket
         | checkpoint/restore) is to install a firewall rule to filter
         | packets during checkpoint and let tcp resend whatever it needs
         | to resync whenever the socket is restored.
         | 
         | If you want your original process to continue living after the
         | checkpoint and not lose packets during checkpoint, you can go a
         | pretty long way with the 'plug' tc, IFBs. And if you're
         | aventurous, lots of support for getsockopt/setsockopt and
         | ioctls have been or are being merged within io_uring so
         | checkpointing a big-buffered TCP socket can cost under 100us,
         | even less IIRC.
        
       ___________________________________________________________________
       (page generated 2024-06-21 23:00 UTC)