Troubleshooting RKE2 etcd Nodes

This document (000021653) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Rancher Prime 2.9.2+

RKE2 v1.27.16+

etcd v3.5.13+

Situation

Etcd is automatically compacted by the apiserver every 5 minutes, and is defragmented on startup with RKE2. Though RKE2 embedded etcd maintenance are done automatically at appropriate time, there is some cases where troubleshooting method can be helpful.

Resolution

Most of the steps below are borrowed from https://gist.github.com/superseb/3b78f47989e0dbc1295486c186e944bf#etcd . We used crictl command in this document. This can be also achieved by using kubectl as described in the url link above.

Troubleshooting etcd Nodes

This section contains commands and tips for troubleshooting nodes with the etcd role.

Checking if the etcd Container is Running

export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
/var/lib/rancher/rke2/bin/crictl ps --name etcd

Example output:

CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
d05aadf64ac22       c6b7a4f2f79b2       7 minutes ago       Running             etcd                0                   08c3af3ff46f1       etcd-ranch26

etcd Container Logging

export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
/var/lib/rancher/rke2/bin/crictl logs  etcd-CONTAINER-ID

Check log for:

"msg":"prober found high clock drift"

This would point to NTP clock driftwhich will cause ETCD to act strangely.

etcd Cluster and Connectivity Checks

The address where etcd is listening depends on the address configuration of the host etcd is running on. If an internal address is configured for the host etcd is running on, the endpoint for etcdctl needs to be specified explicitly. If any of the commands respond with Error: context deadline exceeded, the etcd instance is unhealthy (either quorum is lost or the instance is not correctly joined in the cluster)

Check etcd Members on all Nodes

Output should contain all the nodes with the etcd role and the output should be identical on all nodes.

for etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do kubectl -n kube-system exec $etcdpod -- etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt member list; done

Check Endpoint Status

The values for RAFT TERM should be equal and RAFT INDEX should be not be too far apart from each other.

Command:

export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status --cluster --write-out=table

Example output:

+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.102:2379 | e16c23384aeb3678 |   3.5.9 |   58 MB |      true |      false |        61 |    6449917 |            6449916 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Check Endpoint Health

Command:

export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint health --cluster --write-out=table

Example output:

+----------------------------+--------+------------+-------+
|          ENDPOINT          | HEALTH |    TOOK    | ERROR |
+----------------------------+--------+------------+-------+
| https://192.168.1.102:2379 |   true | 4.328099ms |       |
+----------------------------+--------+------------+-------+

etcd Alarms

Command:

export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt alarm list

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.