Troubleshooting RKE2 etcd Nodes
This document (000021653) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Rancher Prime 2.9.2+
RKE2 v1.27.16+
etcd v3.5.13+
Situation
Etcd is automatically compacted by the apiserver every 5 minutes, and is defragmented on startup with RKE2. Though RKE2 embedded etcd maintenance are done automatically at appropriate time, there is some cases where troubleshooting method can be helpful.
Resolution
Most of the steps below are borrowed from https://gist.github.com/superseb/3b78f47989e0dbc1295486c186e944bf#etcd . We used crictl command in this document. This can be also achieved by using kubectl as described in the url link above.
Troubleshooting etcd Nodes
This section contains commands and tips for troubleshooting nodes with the etcd role.
Checking if the etcd Container is Running
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
/var/lib/rancher/rke2/bin/crictl ps --name etcd
Example output:
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
d05aadf64ac22 c6b7a4f2f79b2 7 minutes ago Running etcd 0 08c3af3ff46f1 etcd-ranch26
etcd Container Logging
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
/var/lib/rancher/rke2/bin/crictl logs etcd-CONTAINER-ID
Check log for:
"msg":"prober found high clock drift"
This would point to NTP clock driftwhich will cause ETCD to act strangely.
etcd Cluster and Connectivity Checks
The address where etcd is listening depends on the address configuration of the host etcd is running on. If an internal address is configured for the host etcd is running on, the endpoint for etcdctl needs to be specified explicitly. If any of the commands respond with Error: context deadline exceeded, the etcd instance is unhealthy (either quorum is lost or the instance is not correctly joined in the cluster)
Check etcd Members on all Nodes
Output should contain all the nodes with the etcd role and the output should be identical on all nodes.
for etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do kubectl -n kube-system exec $etcdpod -- etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt member list; done
Check Endpoint Status
The values for RAFT TERM should be equal and RAFT INDEX should be not be too far apart from each other.
Command:
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status --cluster --write-out=table
Example output:
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.102:2379 | e16c23384aeb3678 | 3.5.9 | 58 MB | true | false | 61 | 6449917 | 6449916 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Check Endpoint Health
Command:
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint health --cluster --write-out=table
Example output:
+----------------------------+--------+------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+----------------------------+--------+------------+-------+
| https://192.168.1.102:2379 | true | 4.328099ms | |
+----------------------------+--------+------------+-------+
etcd Alarms
Command:
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt alarm list
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000021653
- Creation Date: 19-Dec-2024
- Modified Date:02-Jan-2025
-
- SUSE Rancher
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com