[Rancher] Operational Advisory, 20220405: Rancher Kubernetes Distributions and Etcd 3.5 Updates
This document (000020632) is provided subject to the disclaimer at the end of this document.
Environment
Situation
- Users running Rancher 2.6.4+ who have deployed Rancher on a single node as a Docker installation. Reminder: This installation method is not recommended for any production environment and is only recommended for development/sandbox testing. If you are running Rancher on a managed Kubernetes cluster, then you will have to refer to your Kubernetes service provider to determine if you are affected by this advisory.
- Users running Kubernetes 1.22+ and 1.23+ of any of the Rancher Kubernetes Distributions (RKE, RKE2, K3s) and using etcd as your datastore. The default datastore for RKE and RKE2 is etcd. This applies to standalone Kubernetes clusters as well as any downstream clusters provisioned by Rancher.
Resolution
- Stop deploying into production any new Kubernetes clusters using Rancher Kubernetes distributions versions 1.22/1.23 until a proper fix is provided by the etcd maintainers and included into the affected distribution.
- Update your etcd configuration to enable the experimental-initial-corrupt-checkoption. This flag will be turned on by default in etcd v3.6, but does not by itself fix the problem; it can only detect and repair the issue if it does occur.
Note: each distribution has its own recommendation on how to enable this option; see below for more details. - Avoid terminating etcd unexpectedly (using kill –9, etc)
- For RKE1 clusters, avoid stopping/killing the etcd containers adhoc without properly cordoning/draining nodes and taking backups
- For RKE2 clusters,
- Avoid sending SIGKILL to the etcd or rke2 process.
- Avoid using the killall script (rke2-killall.sh) to stop RKE2 on servers hosting production workloads. The killall script is meant to clean up hosts prior to uninstallation or reconfiguration and should not be used as a substitute for properly cordoning/draining a node and stopping services.
- For k3s clusters,
- Avoid sending SIGKILL to the k3s process.
- Avoid using the killall script (k3s-killall.sh) to stop K3s on servers hosting production workloads. The killall script is meant to clean up hosts prior to uninstallation or reconfiguration and should not be used as a substitute for properly cordoning/draining a node and stopping services.
- Ensure nodes are not under significant memory pressure that may cause the Linux kernel to terminate the etcd process.
Ensure that nodes are not terminated unexpectedly. Avoid force-terminating VMs, unexpected power loss, etc.
How do I enable the recommended flag in etccd?
For Users provisioning RKE/k3s/RKE2 clusters through Rancher
Provisioned RKE Clusters
If you are running 1.22 or 1.23, upgrade to the following respective versions to enable the recommended experimental-initial-corrupt-check flag in etcd.
- RKE 1.22 - v1.22.7-rancher1-2
- RKE 1.23 (Experimental) - v1.23.4-rancher1-2
Provisioned k3s/RKE2 clusters are still in tech preview, so we do not recommend running production workloads on these clusters. If you have provisioned clusters, you can enable the recommended experimental-initial-corrupt-check flag by editing the cluster as YAML. If you have an imported k3s/RKE2 cluster, review the standalone Kubernetes distribution section.
- From the “Cluster Management” page, click the vertical three-dots on the right-hand side for the cluster you want to edit.
- From the menu, select “Edit YAML”.
- Edit the spec.rkeConfig.machineGlobalConfig.etcd-arg section of the YAML to add in an etcd argument. Note: Your YAML may be slightly different from the example below.
spec:
cloudCredentialSecretName: cattle-global-data:cc-xxxxx
kubernetesVersion: v1.22.7+rke2r2
localClusterAuthEndpoint: {}
rkeConfig:
chartValues:
rke2-calico: {}
etcd:
snapshotRetention: 5
snapshotScheduleCron: 0 */5 * * *
machineGlobalConfig:
cni: calico
etcd-arg: ["experimental-initial-corrupt-check=true"]
- Click “Save” at the bottom. Rancher will update the configuration and restart the necessary services.
RKE Clusters
As of RKE v1.3.8, the default version of Kubernetes was set to 1.22.x. In order to not use the default Kubernetes version, please set the kubernetes_version to other available versions in your cluster.yml file for any new deployments through RKE.
- If you already have an existing RKE1 cluster using an affected version, you can set experimental-initial-corrupt-check: true in extra_args for etcd.
RKE2/k3s Clusters
The flag only needs to be added if you are using HA with embedded etcd. Single-server clusters with sqlite, or clusters using HA with an external SQL datastore are not affected. The flag only needs to be enabled on servers, as agents do not run etcd.
Customization of etcd was introduced with 1.22.4 and 1.23.0, so if you are running a lower version of 1.22.x, then you will need to upgrade to at least 1.22.4 in order to customize the etcd configuration.
RKE2 Clusters
- Create or edit the config file at /etc/rancher/rke2/config.yaml.
- Add the following line to the end of the file:
- Save the config file and then run systemctl restart rke2-server to apply the change.
- Create or edit the config file at /etc/rancher/k3s/config.yaml.
- Add the following line to the end of the file:
- Save the config file and then run systemctl restart k3s to apply the change.
Additional Information
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020632
- Creation Date: 05-Apr-2022
- Modified Date:11-Apr-2022
-
- SUSE Rancher Harvester
- SUSE Rancher
- SUSE Rancher Longhorn
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com