https://iximiuz.com/en/posts/containers-vs-pods/ Start hereSubscribeArticlesProjectsCuratedAbout [3f6614864d] Hi, it's Ivan! I'm a Software Engineer at heart, SRE at day job, and tech storyteller at night. Subscribe to my monthly newsletter or follow me on Twitter for quality content on Containers, Kubernetes, Cloud Native stack, and Programming! Containers vs. Pods - Taking a Deeper Look October 28, 2021 Containers,Kubernetes,Linux / Unix Containers could have become a lightweight VM replacement. However, the most widely used form of containers, standardized by Docker/OCI, encourages you to have just one [S:process:S] service per container. Such an approach has a bunch of pros - increased isolation, simplified horizontal scaling, higher reusability, etc. However, there is a big con - in the wild, virtual (or physical) machines rarely run just one service. While Docker tries to offer some workarounds to create multi-service containers, Kubernetes makes a bolder step and chooses a group of cohesive containers, called a Pod, as the smallest deployable unit. When I stumbled upon Kubernetes a few years ago, my prior VM and bare-metal experience allowed me to get the idea of Pods pretty quickly. Or so thought I... Starting working with Kubernetes, one of the first things you learn is that every pod gets a unique IP and hostname and that within a pod, containers can talk to each other via localhost. So, it's kinda obvious - a pod is like a tiny little server. After a while, though, you realize that every container in a pod gets an isolated filesystem and that from inside one container, you don't see processes running in other containers of the same pod. Ok, fine! Maybe a pod is not a tiny little server but just a group of containers with a shared network stack. But then you learn that containers in one pod can communicate via shared memory! So, probably the network namespace is not the only shared thing... This last finding was the final straw for me. So, I decided to have a deep dive and see with my own eyes: * How Pods are implemented under the hood * What is the actual difference between a Pod and a Container * How one can create Pods using Docker. And on the way, I hope it'll help me to solidify my Linux, Docker, and Kubernetes skills. Examining a container The OCI Runtime Spec doesn't limit container implementations to only Linux containers, i.e., the ones implemented with namespaces and cgroups. However, unless otherwise is stated explicitly, the word container in this article refers to this rather traditional form. Setting up a playground Before taking a look at the namespaces and cgroups constituting a container, let's set up a playground real quick: $ cat > Vagrantfile </cgroup: PID=$(docker inspect --format '{{.State.Pid}}' foo) # Check cgroupfs node for the container main process (4727). $ cat /proc/${PID}/cgroup 11:freezer:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 10:blkio:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 9:rdma:/ 8:pids:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 7:devices:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 6:cpuset:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 5:cpu,cpuacct:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 4:memory:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 3:net_cls,net_prio:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 2:perf_event:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 1:name=systemd:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0 0::/system.slice/containerd.service Hm... Seems like Docker uses the /docker/ pattern. Ok, anyways: ID=$(docker inspect --format '{{.Id}}' foo) # Check the memory limit. $ cat /sys/fs/cgroup/memory/docker/${ID}/memory.limit_in_bytes 536870912 # Yay! It's the 512MB we requested! # See the CPU limits. ls /sys/fs/cgroup/cpu/docker/${ID} Interesting that starting a container without explicitly setting any resource limits configures a cgroup anyway. I haven't really checked, but my guess is that while CPU and RAM consumption is unrestricted by default, cgroups might be used to limit some device access from inside a container. Here is how I picture containers in my head after this investigation: Docker container. Examining a pod Now, let's take a look at Kubernetes Pods. Much like with containers, the implementation of pods can vary between different CRI runtimes. For instance, when Kata Containers is used as one of the supported runtime classes, some of the pods can be true virtual machines! And expectedly, the VM-based pods differ in implementation and capabilities from the pods implemented with traditional Linux containers. To keep the Containers and Pods fair comparison, the Pod examination will be done on a Kubernetes cluster that uses containerd/runc runtime. And that's exactly what Docker uses under the hood to run containers. Setting up a playground This time the playground is set up using minikube with the VirtualBox driver and containerd runtime. To quickly install minikube and kubectl, you can use the handy arkade tool written by Alex Ellis: # Install arkade () $ curl -sLS https://get.arkade.dev | sh $ arkade get kubectl minikube $ minikube start --driver virtualbox --container-runtime containerd For the guinea-pig pod, the following would do: $ kubectl --context=minikube apply -f - < 'ipc:[4026532717]' lrwxrwxrwx 1 root root 0 Oct 24 14:05 mnt -> 'mnt:[4026532719]' lrwxrwxrwx 1 root root 0 Oct 24 14:05 net -> 'net:[4026532614]' lrwxrwxrwx 1 root root 0 Oct 24 14:05 pid -> 'pid:[4026532720]' lrwxrwxrwx 1 root root 0 Oct 24 14:05 uts -> 'uts:[4026532716]' # sleep container sudo ls -l /proc/5035/ns ... lrwxrwxrwx 1 100 101 0 Oct 24 14:05 ipc -> 'ipc:[4026532717]' lrwxrwxrwx 1 100 101 0 Oct 24 14:05 mnt -> 'mnt:[4026532721]' lrwxrwxrwx 1 100 101 0 Oct 24 14:05 net -> 'net:[4026532614]' lrwxrwxrwx 1 100 101 0 Oct 24 14:05 pid -> 'pid:[4026532722]' lrwxrwxrwx 1 100 101 0 Oct 24 14:05 uts -> 'uts:[4026532716]' While it might be tricky to notice, the httpbin and sleep containers actually reuse the net, uts, and ipc namespaces of the pause container! Again, this can be cross-checked with crictl: # Inspect httpbin container. $ sudo crictl inspect dfb1cd29ab750 { ... "namespaces": [ { "type": "pid" }, { "type": "ipc", "path": "/proc/4966/ns/ipc" }, { "type": "uts", "path": "/proc/4966/ns/uts" }, { "type": "mount" }, { "type": "network", "path": "/proc/4966/ns/net" } ], ... } # Inspect sleep container. $ sudo crictl inspect 097d4fe8a7002 ... I think the above findings perfectly explain the ability of containers in the same pod: * to talk to each other + via localhost and/or + using IPC means (shared memory, message queues, etc.) * to have a shared domain and hostname. However, after seeing how all these namespaces are freely reused between containers, I started to suspect that the default boundaries can be shacked. And indeed, a more thorough read of the Pod API spec showed that with the shareProcessNamespace flag set to true pod's containers will have four common namespaces instead of the default three. But there was a more shocking finding - hostIPC, hostNetwork, and hostPID flags can make the containers use the corresponding host's namespaces Interesting that the CRI API spec seems to be even more flexible. At least syntactically, it allows scoping the net, pid, and ipc namespaces to either CONTAINER, POD, or NODE. So, hypothetically, a pod where containers cannot talk to each other via localhost can be constructed Inspecting pod's cgroups Ok, what's up with pod's cgroups? systemd-cgls can nicely visualize the cgroups hierarchy: $ sudo systemd-cgls Control group /: -.slice +-kubepods | +-burstable | | +-pod4a8d5c3e-3821-4727-9d20-965febbccfbb | | | +-f0e87a93304666766ab139d52f10ff2b8d4a1e6060fc18f74f28e2cb000da8b2 | | | | +-4966 /pause | | | +-dfb1cd29ab750064ae89613cb28963353c3360c2df913995af582aebcc4e85d8 | | | | +-5001 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent | | | | +-5016 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent | | | +-097d4fe8a7002d69d6c78899dcf6731d313ce8067ae3f736f252f387582e55ad | | | +-5035 /bin/sleep 3650d ... So, the pod itself gets a parent node, and every container can be tweaked separately as well. This matches my expectations based on the fact that in the Pod manifest, resource limits can be set for every container in the pod individually. At this moment, a Pod in my head looks something like this: Kubernetes pod. Implementing Pods with Docker If a Pod under the hood is implemented as a bunch of semi-fused containers with a common cgroup parent, will it be possible to reproduce a Pod-like construct using Docker? Recently I already tried doing something similar to make multiple containers listen on the same socket, and I know that Docker allows creating a container that reuses an existing network namespace with the docker run --network container: syntax. But I also know that the OCI Runtime Spec defines only create and start commands. So, when you execute a command inside an existing container with docker exec , you actually run, i.e., create then start, a completely fresh container that just happens to reuse all the namespaces of the target container (proofs 1 & 2). It makes me pretty confident the Pods can be reproduced using the standard Docker commands. As a playground, the original vagrant box with Docker would do. However, I'll use an extra package to simplify dealing with cgroups: $ sudo apt-get install cgroup-tools Firstly, a parent cgroup entry needs to be configured. For the sake of brevity, I'll use only cpu and memory controllers: sudo cgcreate -g cpu,memory:/pod-foo # Check if the corresponding folders were created: ls -l /sys/fs/cgroup/cpu/pod-foo/ ls -l /sys/fs/cgroup/memory/pod-foo/ Secondly, a sandbox container should be created: $ docker run -d --rm \ --name foo_sandbox \ --cgroup-parent /pod-foo \ --ipc 'shareable' \ alpine sleep infinity Lastly, starting the actual containers reusing the namespaces of the sandbox container: # app (httpbin) $ docker run -d --rm \ --name app \ --cgroup-parent /pod-foo \ --network container:foo_sandbox \ --ipc container:foo_sandbox \ kennethreitz/httpbin # sidecar (sleep) $ docker run -d --rm \ --name sidecar \ --cgroup-parent /pod-foo \ --network container:foo_sandbox \ --ipc container:foo_sandbox \ curlimages/curl sleep 365d Have you noticed which namespace I omitted? Right, I couldn't share the uts namespace between containers. Seems like this possibility is not exposed currently in the docker run command. Well, that's a pity, of course. But apart from the uts namespace, it's a success! The cgroups look much like the ones created by Kubernetes itself: $ sudo systemd-cgls memory Controller memory; Control group /: +-pod-foo | +-488d76cade5422b57ab59116f422d8483d435a8449ceda0c9a1888ea774acac7 | | +-27865 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent | | +-27880 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent | +-9166a87f9a96a954b10ec012104366da9f1f6680387ef423ee197c61d37f39d7 | | +-27977 sleep 365d | +-c7b0ec46b16b52c5e1c447b77d67d44d16d78f9a3f93eaeb3a86aa95e08e28b6 | +-27743 sleep infinity The global list of namespaces also looks familiar: $ sudo lsns NS TYPE NPROCS PID USER COMMAND ... 4026532157 mnt 1 27743 root sleep infinity 4026532158 uts 1 27743 root sleep infinity 4026532159 ipc 4 27743 root sleep infinity 4026532160 pid 1 27743 root sleep infinity 4026532162 net 4 27743 root sleep infinity 4026532218 mnt 2 27865 root /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent 4026532219 uts 2 27865 root /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent 4026532220 pid 2 27865 root /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent 4026532221 mnt 1 27977 _apt sleep 365d 4026532222 uts 1 27977 _apt sleep 365d 4026532223 pid 1 27977 _apt sleep 365d And the httpbin and sidecar containers seems to share the ipc and net namespaces: # app container $ sudo ls -l /proc/27865/ns lrwxrwxrwx 1 root root 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]' lrwxrwxrwx 1 root root 0 Oct 28 07:56 mnt -> 'mnt:[4026532218]' lrwxrwxrwx 1 root root 0 Oct 28 07:56 net -> 'net:[4026532162]' lrwxrwxrwx 1 root root 0 Oct 28 07:56 pid -> 'pid:[4026532220]' lrwxrwxrwx 1 root root 0 Oct 28 07:56 uts -> 'uts:[4026532219]' # sidecar container $ sudo ls -l /proc/27977/ns lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]' lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 mnt -> 'mnt:[4026532221]' lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 net -> 'net:[4026532162]' lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 pid -> 'pid:[4026532223]' lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 uts -> 'uts:[4026532222]' Yay! Summarizing Containers and Pods are alike. Under the hood, they heavily rely on Linux namespaces and cgroups. However, Pods aren't just groups of containers. A Pod is a self-sufficient higher-level construct. All pod's containers run on the same machine (cluster node), their lifecycle is synchronized, and mutual isolation is weakened to simplify the inter-container communication. This makes Pods much closer to traditional VMs, bringing back the familiar deployment patterns like sidecar or reverse proxy. Resources * Kubernetes multi-container pods and container communication - a good technical read. * Pods with Docker - an interactive in-browser playground. More Containers articles from this blog * Containers aren't Linux processes * A journey from containerization to orchestration and beyond * Not every container has an operating system inside * You don't need an image to run a container * You need containers to build images * Implementing Container Runtime Shim: runc * docker, * kubernetes, * container, * pod, * namespaces, * cgroups Written by Ivan Velichko Follow me on twitter @iximiuz Liked this article? Let it be the beginning of a great friendship. Leave your email so I could notify you about new articles or any other interesting happenings around the topics of this blog. No spam whatsoever, I promise! Please enable JavaScript to view the comments powered by Disqus. Copyright Ivan Velichko (c) 2021 Feed Atom RSS