How to safely change sbd timeout settings in a running pacemaker cluster

This document (7023689) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 15
SUSE Linux Enterprise High Availability Extension 12

Situation

This TID covers generic HA cluster . For cluster running SAP HANA workload go to TID: https://www.suse.com/support/kb/doc/?id=000021362
For various potential reasons, the timeout settings for the configured sbd device(s) may need to be adjusted. For example, the timeout settings for watchdog (90) and msgwait (180) should be adjusted:

sles12cluster2:~ # sbd -d /dev/disk/by-id/scsi-36000c29c0348eb3640b99be0f96e80fe dump
==Dumping header on disk /dev/disk/by-id/scsi-36000c29c0348eb3640b99be0f96e80fe
Header version     : 2.1
UUID               : 62caa488-cbee-4449-84c3-5fd0659dcc09
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 90
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait) : 180
==Header on disk /dev/disk/by-id/scsi-36000c29c0348eb3640b99be0f96e80fe is dumped

Note that this document is not intended to show what values should be used, only *how* to change them. For recommendations about values, see TID https://www.suse.com/support/kb/doc/?id=000017952

Resolution

Notes:

- The following commands need to be executed as root user or user with equivalent permissions.

- Make sure none of the cluster resources are in stopped state before putting the cluster in maintenance mode.

- Make sure cluster will be stopped and restarted as described below, otherwise new settings will not be activated on the cluster nodes.

- Verify the sbd service was successfully stopped, check the output of: systemctl status sbd.

- In case existing sbd devices are exchanges with new ones, keep in mind to update /etc/sysconfig/sbd accordingly.

1. Run the following command to display the current settings of the sbd device:

# sbd -d <device> dump

2. Put the cluster into maintenance mode:

# crm configure property maintenance-mode=true

3. Verify if all cluster resources in "unmanged" state:

# crm status

4. Stop the cluster services on all nodes:

# crm cluster stop

5. Recreate the metadata on the sbd device(s):

# sbd -d <device> -4 xx -1 xx create

Full example (using three sbd disks):

# sbd -d /dev/disk/by-id/scsi-36000c29c0348eb3640b99be0f96e80fe -d /dev/disk/by-id/scsi-36000c29d7b18a8c4a6e980da7fd74fab -d /dev/disk/by-id/scsi-36000c2912306cd2a42adc9c0c95f450c -4 20 -1 10 create

6. Start the cluster services on all nodes:

# crm cluster start

7. Check the sbd partition information:

# sbd -d <device> dump

and make sure the cluster nodes have been assigned a slot:

# sbd -d <device> list

8. Put the cluster back to normal mode:

# crm configure property maintenance-mode=false

Additional Information

Output of sbd -d <device> dump after the change:

sles12cluster2:~ # sbd -d /dev/disk/by-id/scsi-36000c29c0348eb3640b99be0f96e80fe dump
==Dumping header on disk /dev/disk/by-id/scsi-36000c29c0348eb3640b99be0f96e80fe
Header version     : 2.1
UUID               : f2faed5e-c0a5-46a8-8fb8-45d7bab44182
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 10
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 20
==Header on disk /dev/disk/by-id/scsi-36000c29c0348eb3640b99be0f96e80fe is dumped
sles12cluster2:~ # sbd -d /dev/disk/by-id/scsi-36000c29c0348eb3640b99be0f96e80fe list
0       sles12cluster2  clear
1       sles12cluster1  clear

Output systool -vc watchdog after the change (12 SP4 and later):

systool -vc watchdog
Class = "watchdog"

  Class Device = "watchdog0"
  Class Device path = "/sys/devices/virtual/watchdog/watchdog0"
    bootstatus          = "0"
    dev                 = "249:0"
    identity            = "Software Watchdog"
    nowayout            = "0"
    pretimeout          = "0"
    pretimeout_available_governors= "noop"
    pretimeout_governor = "noop"
    state               = "active"
    status              = "0x8000"
    timeout             = "10"
    uevent              = "MAJOR=249
MINOR=0
DEVNAME=watchdog0"

For more information please refer to:

SUSE Linux Enterprise High Availability Extension - Chapter 11, Storage Protection and SBD

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.