stonith:external/sbd resource agent fails with OCF_TIMEOUT
This document (000020383) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise High Availability Extension 12
SUSE Linux Enterprise Server for SAP Applications 15
SUSE Linux Enterprise Server for SAP Applications 12
Situation
stonith-sbd_monitor_15000 on node1 'OCF_TIMEOUT' (198): call=25, status='Timed Out', exitreason='', last-rc-change='<timestamp>', queued=0ms, exec=15004ms
Screencapture of issue:
This situation is particularly relevant in cloud-based Pacemaker deployments.
Resolution
# crm resource cleanup stonith-sbd
Cause
See Additional Information section for full details on why this is an issue.
Status
Additional Information
To avoid confusion, only agent and daemon are used in the below paragraph to refer to the HA cluster client-side software components. The devices, themselves, are referred to as "SBD device" or "SBD node" with SBD in upper-case.
The daemon constantly monitors for issues with the SBD devices and is resilient against single SBD device failures; any failures are logged by the daemon and fencing will remain functional as long as a majority of SBD devices are still accessible. The agent monitor, on the other hand, fails if any one of the SBD devices are inaccessible, an OCF_TIMEOUT error is then logged and the agent is marked with a FAILED status. When Pacemaker attempts to recover the agent by starting it on a different cluster node, the start operation will fail if the problem SBD node is still not present and the agent will remain in a Stopped state until the administrator rectifies any outstanding problems with the affected SBD node and cleans up the agent. If a problem with a single SBD device resolves itself before the agent runs its monitor, no manual intervention is required. This means that not only is the agent monitor redundant, but it is also undesirable, from an operations standpoint, to run the agent monitor at a relatively short interval, hence the agent's default monitor interval value of 3600 seconds.
The agent failure is only cosmetic as Pacemaker can still fence an unhealthy node with the daemon after the agent ends up in a FAILED or Stopped status so long as the fencing mechanism is still registered. One can check this with the "stonith_admin -L" command on any cluster node. If there is any real issue with Pacemaker's ability to fence a node, this command will return a "No Such Device" error.
See also:
https://www.linux-ha.org/wiki/SBD_Fencing
Excerpt:
"The sbd agent does not need to and should not be cloned. If all of your nodes run SBD, as is most likely, not even a monitor action provides a real benefit, since the daemon would suicide the node if there was a problem."
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020383
- Creation Date: 08-Sep-2021
- Modified Date:10-Sep-2021
-
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com