failed etcd snapshot with StorageError invalid object message

This document (000021078) is provided subject to the disclaimer at the end of this document.

Environment

Rancher 2.6.x
Rancher 2.7.x
Downstream RKE2 cluster
Kubernetes 1.22 and above

Situation

During the manual or recurring snapshot process you're seeing in Rancher logs the following error messages:

23/05/04 14:45:01 [INFO] [snapshotbackpopulate] rkecluster fleet-default/xxxx-yyyy-wwww: processing configmap kube-system/rke2-etcd-snapshots
2023/05/04 14:45:02 [ERROR] error syncing 'kube-system/rke2-etcd-snapshots': handler snapshotbackpopulate: rkecluster fleet-default/xxxx-yyyy-wwww: error while setting status missing=true on etcd snapshot /: Operation cannot be fulfilled on etcdsnapshots.rke.cattle.io "xxxx-yyyy-wwww-etcd-snapshot-clstr-k8s-xxxx-yyyy-wwww-0102b": StorageError: invalid object, Code: 4, Key: /registry/rke.cattle.io/etcdsnapshots/fleet-default/xxxx-yyyy-wwww-etcd-snapshot-clstr-k8s-xxxx-yyyy-wwww-0102b, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: fzzzzzz-1111-2222-3333-000000000000, UID in object meta: , requeuing

2023/05/04 14:47:01 [INFO] [snapshotbackpopulate] rkecluster fleet-default/xxxx-yyyy-wwww: processing configmap kube-system/rke2-etcd-snapshots
2023/05/04 14:47:10 [ERROR] rkecluster fleet-default/xxxx-yyyy-wwww: error while creating s3 etcd snapshot fleet-default/clstr-k8s-xxxx-yyyy-wwww-.-s3: ETCDSnapshot.rke.cattle.io "clstr-k8s-xxxx-yyyy-wwww-.-s3" is invalid: metadata.name: Invalid value: "clstr-k8s-xxxx-yyyy-wwww-.-s3": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')

This message is indicating that snapshot is in process but failing due to possibly:

- issue with the S3-compatible storage hosting solution
- not respecting the limit name length of objects that Kubernetes is allowed to (RFC 1123)
- corrupted snapshot with invalid/malformed name hosted on S3 bucket but Rancher (snapshot controller) keeps adding them to the configmap

Resolution

If it's complaining about the name length, then please double check the naming convention in your environment and change it to respect the requirement. Refer to RFC 1123 Kubernetes.

If it's complaining about snapshot itself, you may want to check the S3 bucket to see if there are still snapshots in the specified folder with the invalid file name.
The snapshot controller on the rke2 side will keep re-adding those snapshots to the confimap as long as the files exist in the bucket or on disk on the nodes.
Removing those corrupted / malformed snapshots from the bucket and/or nodes is useful in this case.

At some point Rancher can get stuck finishing the process even if the other parameters (S3 config, snapshot names etc...) are fine.
For that purpose, a small and safe edit can get around and let Rancher proceed properly like:

On Rancher UI, go to Cluster Management
Select the downstream cluster having the issue
Edit the cluster configuration and refer to ETCD
Change Folder Name in the S3 bucket configuration
Save and exit then wait for the change to finish applying
Edit the cluster again the same way
Change the folder back to its original name
Save and exit

This should re-instruct Rancher to use the right folder name again properly.
It's also good to restart the Rancher deployment for better responsiveness.

If the above doesn't have any effect, try the following steps

Save copies of etcd snapshots in another folder as a precaution.
Reduce the etcd snapshots retention to 10 snapshots (instead of 40) on the downstream cluster and disable S3 backups temporarily.
Edite the 'rke2-etcd-snapshots' ConfigMap on 'kube-system' on the downstream cluster and empty it out of its data (only keeping the manifest metadata)

> 'kubectl edit ConfigMap -n kube-system rke2-etcd-snapshots'.

After saving the edits above, Fleet triggered all of the snapshots it missed (this is likely because it had the old snapshot jobs in a queue).
Change the snapshot schedule to every 5 minutes to allow it to apply its retention settings and clean up the snapshots. This should work after waiting for the 5 minutes period.
After that, clean the on-demand snapshots since they do not get cleaned up automatically by the retention settings.

To do so, since the Delete option in the UI is not available, delete them on the local filesystem of each node. After around 15 to 20 minutes (maybe less depending on the environment), Rancher should reconcile the changes, and the old on-demand snapshots should be removed from the UI.

Then re-enable S3 snapshots and check if new snapshots are being taken there.
Finally, set back the schedule of snapshots as it was before (5 hours, 1 day etc...).

Note: the above steps are tested and validated on various clusters.

Cause

This can happen due to a couple of reasons such as:
Failure on an ETCD restore from backup
Failure during the upgrade process
Corrupted old/previous snapshots

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.