stonith:external/sbd resource agent fails with OCF_TIMEOUT

This document (000020383) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 15
SUSE Linux Enterprise High Availability Extension 12
SUSE Linux Enterprise Server for SAP Applications 15
SUSE Linux Enterprise Server for SAP Applications 12

Situation

After a single SBD iSCSI node failure in a multi-node SBD STONITH configuration, the following SBD failure is observed when checking Pacemaker's status with either "crm_mon" or "crm status" commands:

stonith-sbd_monitor_15000 on node1 'OCF_TIMEOUT' (198): call=25, status='Timed Out', exitreason='', last-rc-change='<timestamp>', queued=0ms, exec=15004ms

Screencapture of issue:

screencapture of OCF_TIMEOUT error

This situation is particularly relevant in cloud-based Pacemaker deployments.

Resolution

Ensure the affected SBD node is back online and run a resource cleanup on the stonith-sbd resource:

# crm resource cleanup stonith-sbd

Cause

The stonith:external/sbd resource agent monitor ran during a problem with a single SBD device, either because the fence agent's interval was set too low in the CIB or because the agent's monitor coincidentally ran at the same time as a single temporary SBD device outage.

See Additional Information section for full details on why this is an issue.

Status

Reported to Engineering

Additional Information

In order to fully understand this issue, one should be aware of the difference between the stonith:external/sbd agent which is a fencing agent executed by Pacemaker on a single cluster node and the /usr/sbin/sbd daemon which is a binary that is started by the sbd.service systemd unit as a dependency of pacemaker.service on all nodes in a SBD-based cluster. The STONITH action is carried out by the daemon, not the agent.

To avoid confusion, only agent and daemon are used in the below paragraph to refer to the HA cluster client-side software components. The devices, themselves, are referred to as "SBD device" or "SBD node" with SBD in upper-case.

The daemon constantly monitors for issues with the SBD devices and is resilient against single SBD device failures; any failures are logged by the daemon and fencing will remain functional as long as a majority of SBD devices are still accessible. The agent monitor, on the other hand, fails if any one of the SBD devices are inaccessible, an OCF_TIMEOUT error is then logged and the agent is marked with a FAILED status. When Pacemaker attempts to recover the agent by starting it on a different cluster node, the start operation will fail if the problem SBD node is still not present and the agent will remain in a Stopped state until the administrator rectifies any outstanding problems with the affected SBD node and cleans up the agent. If a problem with a single SBD device resolves itself before the agent runs its monitor, no manual intervention is required. This means that not only is the agent monitor redundant, but it is also undesirable, from an operations standpoint, to run the agent monitor at a relatively short interval, hence the agent's default monitor interval value of 3600 seconds.

The agent failure is only cosmetic as Pacemaker can still fence an unhealthy node with the daemon after the agent ends up in a FAILED or Stopped status so long as the fencing mechanism is still registered. One can check this with the "stonith_admin -L" command on any cluster node. If there is any real issue with Pacemaker's ability to fence a node, this command will return a "No Such Device" error.

See also:
https://www.linux-ha.org/wiki/SBD_Fencing

Excerpt:
"The sbd agent does not need to and should not be cloned. If all of your nodes run SBD, as is most likely, not even a monitor action provides a real benefit, since the daemon would suicide the node if there was a problem."

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

Document ID:000020383
Creation Date: 08-Sep-2021
Modified Date:10-Sep-2021
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com