How to perform a rolling change to nodes
This document (000020084) is provided subject to the disclaimer at the end of this document.
Situation
Task
In a Kubernetes cluster nodes can be treated as ephemeral building blocks providing the resources necessary for all workloads. Managing nodes in an immutable way is particularly common in a cloud environment.
In an on premise environment however, nodes can be recycled and updated, in general it's typical that nodes have a longer lifecycle.
There may be significant changes to nodes over time, for example: IP addresses, storage/filesystems migration to other hypervisors, data centers, large OS updates, or even migration between clusters.
To perform large changes like this, this article aims to provide example steps to apply large changes like this safely in a rolling fashion.
Pre-requisites
- A custom or imported cluster managed by Rancher, or an RKE/k3s cluster
- Access to the nodes in the cluster with sudo/root
- Permission to perform drain and delete actions on the nodes
If there are any single replica workloads, whenever possible it is ideal to ensure at least 2 replicas are configured for availablity during rolling changes. These are best scheduled on separate nodes, a preferred anti-affinity can help with this.
Steps
While performing a rolling change to nodes you will need to determine a batch size, effectively how many nodes you wish to take out of service at a time. Initially, it is recommended to perform the change on one node as a canary first, and testing the change has the desired outcome before doing more at once.
-
If you wish to maintain the number of nodes in the cluster while performing the rolling change, at this point you may wish to add new nodes, this ensures that when nodes are out of service the cluster maintains at least the original number of available nodes.
-
Drain the node, this can be done with
kubectl drain <node>
, or in the Rancher UI.This is particularly important to avoid disruptions to services, by draining first, service endpoints are updated to remove the pods from services, stopped, started on a new node in the cluster, and added back to the service safely.
If there are pods using local storage (commonly
emptyDir
volumes), and these should be drained, the--delete-local-data=true
will be needed, beware: the data will be lost. -
Optional Delete the node(s) from the cluster, this can be done with
kubectl delete <node>
. This is needed for changes that cannot be performed on existing nodes, such as IP address, hostnames, moving nodes to another cluster, and large configuration updates. Any pods and Kubernetes components running on the nodes will be removed.Note: if this is an
etcd
node, ensure that the cluster has quorum and at least two remainingetcd
nodes to maintain HA before performing this step.- For an imported cluster, there is no automated cleanup so at this point you would remove the node from the cluster configuration
- RKE, remove the node from the cluster.yaml file followed by an
rke up
- k3s, stop the k3s service and uninstall k3s using the script
- RKE, remove the node from the cluster.yaml file followed by an
- For an imported cluster, there is no automated cleanup so at this point you would remove the node from the cluster configuration
-
Optional If the node has been deleted in step 3, cleaning the node is important to ensure all previous history of the cluster, CNI devices, volumes, and containers are removed. This is especially important if the node is to be re-used in another cluster.
-
Perform the changes to the node, this could be automated with configuration management, scripted or manual steps.
-
Once step 5 is complete, add the node back to the desired cluster.
- In a custom cluster this can be done with the
docker run
command supplied in the Rancher UI - For an imported cluster the steps are different
- RKE, you would add this node to the cluster by configuring it in the cluster.yaml file, followed with an
rke up
- k3s, re-install k3s using the correct flags/variables
- RKE, you would add this node to the cluster by configuring it in the cluster.yaml file, followed with an
- In a custom cluster this can be done with the
-
Test the nodes with running workloads, and monitor before proceeding with the next node, or a larger batch size of nodes.
-
If additional nodes were added in step 1, these can be removed from the cluster at this point by following steps 2, 3, and 4.
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020084
- Creation Date: 06-May-2021
- Modified Date:09-Jul-2021
-
- SUSE Rancher
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com