How to recover an RKE cluster when all control plane nodes have failed
This document (000020695) is provided subject to the disclaimer at the end of this document.
Environment
Situation
Resolution
Pre-requisites
- Nodes to add to the cluster with control plane and etcd roles with adequate resources
- An offline copy of a snapshot to be used as the recovery point, often stored in S3 or copied off node filesystems to a backup location
Note: This article assumes that all control plane and etcd nodes are no longer functional and/or cannot be repaired via any other means, like a VM snapshot restore.
Steps
To recover the downstream cluster, any existing nodes with the control plane and/or etcd roles must be removed. Worker nodes can remain in the cluster, and these may continue to operate with running workloads.
Please use the following steps as a guideline to recover the cluster, from this point the cluster that has experienced the disaster will be referred to as the downstream cluster.
-
As a precaution, it's recommended to take a snapshot of the Rancher local cluster. Please see the documentation (RKE, RKE2, K3s) for the appropriate way to take a snapshot for the Rancher installation.
Alternatively the
rancher-backup
operator can be used to backup all of the related objects for restoration. -
Delete all nodes with the control plane and/or etcd roles from the downstream cluster in the Rancher UI.
The delete action can fail when the downstream cluster is in this condition, if nodes do not get removed, follow the below to remove it from the cluster:
- Click on the node and select
View in API
, click the delete button for the object - If this does not succeed, using
kubectl
or the Cluster Explorer for the Rancher local cluster, edit the correspondingnodes.management.cattle.io
object in the namespace that matches the downstream cluster ID to remove thefinalizers
field
- Click on the node and select
-
Add a clean node back to the cluster with the
all
role (control plane, etcd, worker). The IP address does not have to match any of the previous nodes. If the node has previously been used in a cluster, use the extended cleanup script steps to remove any previous configuration.The newly added node will fail to successfully register to the downstream cluster, it won't proceed past "Waiting to register with Kubernetes", this is normal.
-
Copy the snapshot into place on the new node, under the
/opt/rke/etcd-snapshots
directory structure.The filename must match a snapshot name in the list of snapshots in the Rancher UI for the downstream cluster, any snapshot should be usable, if the name is different, rename the file to match one of the known snapshots in the list.
-
Initiate a snapshot restore from Rancher UI using the same snapshot name used in the previous step.
-
Monitor the Rancher pod logs for progress.
To follow all pod logs at once, a kubeconfig for the Rancher local cluster can be used with this kubectl command:
kubectl logs -n cattle-system -l app=rancher -f -c rancher
-
Once the new node reaches the active state, check the cluster and add additional nodes by repeating step 3 when ready, the additional nodes can be added with only the control plane and etcd roles if desired.
As a follow up, once all desired nodes are added and the cluster is healthy, the control plane and etcd node roles can be configured as needed. For example, if the
all
role is not needed, update the the node by removing and adding the node again in a rolling fashion .
Additional Information
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020695
- Creation Date: 13-Jul-2022
- Modified Date:07-Aug-2024
-
- SUSE Rancher
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com