SBD setup - debug and verify (OPENAIS)
This document (7009485) is provided subject to the disclaimer at the end of this document.
Environment
Situation
The Setup is described at http://linux-ha.org/wiki/SBD_Fencing
Resolution
Oct 5 15:38:36 fozzie sbd: [5263]: info: fozzie owns slot 2
Oct 5 15:38:36 fozzie sbd: [5263]: info: Monitoring slot 2 on disk /dev/disk/by-id/scsi-2001b4d281000fee0-part1
Oct 5 15:38:37 fozzie sbd: [5262]: notice: Using watchdog device: /dev/watchdog
Oct 5 15:38:37 fozzie sbd: [5262]: info: Set watchdog timeout to 10 seconds.
Oct 5 15:38:37 fozzie corosync[5267]: [MAIN ] Corosync Cluster Engine ('UNKNOWN'): started and ready to provide service.
on the running cluster node the processes can be identified via ps
fozzie:~ # ps aux | grep sbd
root 5262 0.0 0.1 38244 5540 pts/0 SL 15:38 0:00 sbd: inquisitor
root 5263 0.0 0.1 38252 5528 pts/0 SL 15:38 0:00 sbd: watcher: /dev/disk/by-id/scsi-2001b4d281000fee0-part1 - slot: 2
to verify that the SBD Stonith is working properly one can just kill all sbd processes on that node
fozzie:~ # killall -9 sbd
fozzie:~ # ps aux | grep sbd
root 5551 0.0 0.0 4348 772 pts/0 S+ 15:42 0:00 grep sbd
If the device is working this node will reboot after the timeout for the watchdog is reached, mostly the default is 60 seconds. If there is no reboot after this time then the watchdog is not working and as such sbd STONITH not reliable and this issue should be addressed before any further action is taken on the cluster as data integrity
cannot be ensured.
One possible problem is that the watchdog module loaded does not work. In this case the preferred approach would be to contact the hardware vendor and ask for a module recommendation. Information and hints in this direction can normally be found as ERROR and CRIT messages in the logs or in dmesg output.
If this is not viable then as a workaround the softdog can be used.
To ensure that either the module recommended by the hardware vendor or the module selected by the administrator is loaded it can be added in /etc/sysconfig/kernel/ to INITRD_MODULES. An example would look like
INITRD_MODULES="softdog processor thermal cciss qla4xxx pata_amd ata_generic amd74xx ide_pci_generic fan jbd ext3 edd"
which forces the load of the softdog module during boot. Any other means of ensuring that the selected module gets loaded is fine.
Repeat the test by starting openais and check whether the node fences on killing the sbd processes now.
Another possible problem in the setup of the SBD STONITH is a possible timeout / wait mismatch between the sbd device and the cluster.
The cluster has a default "stonith-timeout" parameter of 60 Seconds.
The corresponding value in the sbd partition is the "Timeout (msgwait)" which defaults to 10 seconds. The "Timeout (msgwait)" has to be double of Timeout (watchdog).
The default settings will not give any issues. But sometimes, depending on the setup of the cluster, eg with sbd on multipath, the Timeout values of the sbd device are changed to reflect some latency until underlying systems report to the sbd device.
The value of stonith-timeout should always be greater or equal to
(Timeout (msgwait)) / 100 * 120
A failure to follow this rule will show in the logs during failover. One node fails (for some reason) and should be fenced. The fencing node will start the sbd STONITH to fence, but will return with a timeout in the logs.
Sep 29 13:16:24 whiskey stonith-ng: [19622]: ERROR: remote_op_timeout: Action reboot (27837044-1b8c-409a-8e93-64450e03affe) for vodka timed out
Is for example the result of using
vodka:/var/log # sbd -d /dev/mapper/clusterSBD dump
==Dumping header on disk /dev/mapper/clusterSBD
Header version : 2
Number of slots : 255
Sector size : 512
Timeout (watchdog) : 90
Timeout (allocate) : 2
Timeout (loop) : 1
Timeout (msgwait) : 180
and setting (which is not correct!)
stonith-timeout="150s"
In the above example the correct value of stonith-timeout would be
180/100*120 = 216
so stonith-timeout="216s" resolves the issue.
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:7009485
- Creation Date: 05-Oct-2011
- Modified Date:03-Mar-2020
-
- SUSE Linux Enterprise Server
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com