This is an extensive tutorial on how to set up a Kubernetes cluster that supports pod migration.
Statelessness is the basic foundation for microservices run inside Kubernetes. Outside it’s main application domain, the platform also appeals to the High Performance Computing (HPC) community for that infrastructure management can be delegated to cloud providers and it’s on-demand scaling. The challenge is that HPC jobs are usually long running and stateful. Jobs such as simulations or optimization problems usually keep their state in memory and state checkpointing on disk is not always available.
This is undesirable because failures are expected to occur.
Matters becomes even worse for jobs with unpredictable resource requirements. Unexpected spikes in memory can lead to out-of-memory node situations, which results in pods being killed. The catastrophic consequence is the complete loss of job progress from many hours or even days of compute time.
To avoid this, a migration of stateful pods to another node would be desirable.
Currently, Kubernetes does not support pod migration.
However, a PoC of a pod migration n prior work by Jakob Schrettenbrunner showed the feasibility. A proposal to support very basic checkpointing (forensic checkpoiting without restore) functionality has recently been accepted by the Kubernetes community as well and is expected to be available in future releases.
Building on the prior PoC of Jakob Schrettenbrunner, I want to show you step by step how to set up a Kubernetes cluster with pod migration functionality. Bootstrapping a Kubernetes cluster from scratch is not a trivial task, but kubeadm will help us. Jakob also provided some documentation on his setup and while very helpful it is far from complete and does not mention all potential gotchas. You might suspect already that this won’t be a quick and easy process, but I hope to make it a lot easier for you through this extensive tutorial.
To see what to expect, here is a quick demo of the steps to migrate a pod:
1. Cluster setup
The cluster consists of 1 master node and 2 worker nodes. The VMs are provisioned in Microsoft Azure. For migrating the pod across a worker node, Azure’s SMB file share server) is used. You might also use an NFS server (and it might even make things easier as mentioned later..), but this was not possible for company policy reasons in my case.
Kubernetes is bootstrapped using kubeadm. It’s tested with version v1.19.0-beta.0.1015+b521fb5114995f-dirty ( binaries are available here, but I recommend building from source).
To set up the cluster network, I followed this tutorial. You can use the web shell on Azure for this:
# Copyright The containerd Authors.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.[Unit]Description=containerd container runtime
RestartSec=5# Having non-zero Limit*s causes performance problems due to accounting overhead# in the kernel. We recommend using cgroups to do container-local accounting.LimitNPROC=infinity
# Comment TasksMax if your systemd version does not supports it.# Only systemd 226 and above support this version.TasksMax=infinity
I recommend to build from source, but you may also use the binaries inside bin.
Clone my fork and checkout the checkpoint branch. If you want to use the version that only uploads a zip to the file server (please read under [6. Set up file server](#6. Set up file server), use checkpoint-zip
# Note: This dropin only works with kubeadm and kubelet v1.11+[Service]Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamicallyEnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.EnvironmentFile=-/etc/default/kubelet
ExecStart=ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS$KUBELET_CONFIG_ARGS$KUBELET_KUBEADM_ARGS$KUBELET_EXTRA_ARGS --container-runtime-endpoint=/run/containerd/containerd.sock --v=9 --read-only-port=0 --anonymous-auth=true --authorization-mode=AlwaysAllow --container-runtime=remote
Now it’s time to test the migration, with a simple memory allocating pod (here 50 MB):
kubectl run counter1 --restart=Never --image "ghcr.io/schrej/podmigration-testapp:latest" -- -m 50. It’s important to set restartPolicy:Never to prevent the original container from restarting during migration (relevant for large migrations)!
Through kubectl get po -owide, you can get pod IP and increment a stateful counter.
Be sure to do this on the worker node:
Repeat the counter increment a few times, to validate the successful migration later.
The pod spec is identical, except that it has an additional field spec.clonePod :
The migration should be very fast.
Currently, the old pod gets broken during the migration. But the cloned pod should be running.
Requesting it’s endpoint with curl should return a number bigger than 1. Voila - you have successfully cloned a stateful pod in Kubernetes!
6. Set up file server
I had consistency problems for bigger file uploads with SMB. The container restore command is issued 1 second after the disk checkpoint has been saved completely.
However, at this time not all files of the checkpoint directory were uploaded successfully.
I circumvented this problem by storing the checkpoint on local disk and only storing a zipped archive on the server.
The temporary local-disk location is /var/lib/kubelet/check. Since, the OS disk is usually only 30GB, you will need to create a symbolic link to a bigger disk. In my case, a temporary disk with 500GB was mounted in /mnt.
To solve this, do:
Interestingly, the compression immensly reduced the checkpoint size for the simple example app. For 50GB of allocated memory, the compressed zip was only around 20MB!
This modification was done inside containerd in the branch checkpoint-zip.
The procedure is specific to Azure and is well documented here. The server should be mounted inside /var/libe/kubelet/migration. I used the static mount and my /etc/fstab entry looks like this: