How to safely change SBD timeout settings in a Pacemaker cluster running SAP HANA
This document (000021362) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise High Availability Extension 12
SUSE Linux Enterprise Server for SAP Applications 15
SUSE Linux Enterprise Server for SAP Applications 12
Split Brain Detection (SBD)
Situation
If the Pacemaker cluster is not managing SAP HANA, then follow the procedure described in TID#000019396
Note that this document is not intended to show what values should be used, only *how* to change them. For recommendations about values, see TID#000017952
The following procedure is based on the one described in man-page SAPHanaSR_maintenance_examples(7) , and describes how to change the values of "Timeout (watchdog)" from 5 to 10, and the values of "Timeout (msgwait)" from 10 to 20 in the SBD device(s):
Current values:
Timeout (watchdog) : 5 Timeout (msgwait) : 10
New values:
Timeout (watchdog) : 10 Timeout (msgwait) : 20
And using as an example a two-node Pacemaker cluster managing SAP HANA resources with the following configuration as per the "Scale-Up Performance Optimized Scenario" guide:
node 1: mythicnode09
node 2: mythicnode10
SID: HA1
Instance Number: 10
Site Name for Primary: PRIMSITE
Site Name for Secondary: SECSITE
Name of the multistate cluster resource: msl_SAPHana_HA1_HDB10
SBD device: /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149
Resolution
- The following commands need to be executed as root user or user with equivalent permissions.
- Make sure none of the cluster resources are in stopped state before putting the cluster in maintenance mode.
- Make sure cluster will be stopped and restarted as described below, otherwise new settings will not be activated on the cluster nodes.
- Verify the SBD service was successfully stopped.
- In case existing SBD devices are exchanges with new ones, keep in mind to update /etc/sysconfig/sbd accordingly.
1- Check the status of the HANA replication on the "Primary" node, run:
# su - ha1adm -c 'HDBSettings.sh systemReplicationStatus.py'
Example:
|Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------------ |----- |------------ |--------- |------- |--------- |------------ |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |mythicnode09 |31001 |nameserver | 1 | 1 |PRIMSITE |mythicnode10 | 31001 | 2 |SECSITE |YES |SYNC |ACTIVE | | True | |HA1 |mythicnode09 |31007 |xsengine | 3 | 1 |PRIMSITE |mythicnode10 | 31007 | 2 |SECSITE |YES |SYNC |ACTIVE | | True | |HA1 |mythicnode09 |31003 |indexserver | 2 | 1 |PRIMSITE |mythicnode10 | 31003 | 2 |SECSITE |YES |SYNC |ACTIVE | | True | status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: PRIMSITE
2- Check the status of the HANA replication on the "Secondary" node:
# su - ha1adm -c 'HDBSettings.sh systemReplicationStatus.py'
Example:
this system is either not running or not primary system replication site Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: SYNC site id: 2 site name: SECSITE active primary site: 1 primary masters: mythicnode09
3- Check the status of the cluster and the resources, run on any node:
# crm_mon -Ar1
Example:
Cluster Summary: * Stack: corosync * Current DC: mythicnode09 - partition with quorum * 2 nodes configured * 6 resource instances configured Node List: * Node mythicnode09: online: * Resources: * stonith-sbd (stonith:external/sbd): Started * rsc_ip_HA1_HDB10 (ocf::heartbeat:IPaddr2): Started * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Master * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started * Node mythicnode10: online: * Resources: * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started Inactive Resources: * No inactive resources Node Attributes: * Node: mythicnode09: * hana_ha1_clone_state : PROMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode10 * hana_ha1_roles : 4:P:master1:master:worker:master * hana_ha1_site : PRIMSITE * hana_ha1_sra : - * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_sync_state : PRIM * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode09 * lpa_ha1_lpt : 1708412663 * master-rsc_SAPHana_HA1_HDB10 : 150 * Node: mythicnode10: * hana_ha1_clone_state : DEMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode09 * hana_ha1_roles : 4:S:master1:master:worker:master * hana_ha1_site : SECSITE * hana_ha1_sra : - * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_sync_state : SOK * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode10 * lpa_ha1_lpt : 30 * master-rsc_SAPHana_HA1_HDB10 : 100
4- Get the current settings of the SBD device:
# sbd -d <device> dump
Example:
# sbd -d /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149 dump ==Dumping header on disk /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149 Header version : 2.1 UUID : 0de4f60f-4cbb-4083-85fc-5619cb76829f Number of slots : 255 Sector size : 512 Timeout (watchdog) : 5 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 10 ==Header on disk /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149 is dumped
5- Verify that the SBD device is "clean" for all the nodes and make sure that there are no pending "reset" operations:
# sbd -d <device> list
Example:
# sbd -d /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149 list 0 mythicnode09 clear 1 mythicnode10 clear
6- Set HANA multistate resource into "maintenance":
# crm resource maintenance msl_SAPHana_HA1_HDB10 on
7- Check the current status, the SAPHana resource should show as "unmanaged" on all the nodes:
# crm_mon -Ar1
Example:
Cluster Summary: * Stack: corosync * Current DC: mythicnode09 - partition with quorum * 2 nodes configured * 6 resource instances configured Node List: * Node mythicnode09: online: * Resources: * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Master (unmanaged) * stonith-sbd (stonith:external/sbd): Started * rsc_ip_HA1_HDB10 (ocf::heartbeat:IPaddr2): Started * Node mythicnode10: online: * Resources: * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave (unmanaged) Inactive Resources: * No inactive resources Node Attributes: * Node: mythicnode09: * hana_ha1_clone_state : PROMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode10 * hana_ha1_roles : 4:P:master1:master:worker:master * hana_ha1_site : PRIMSITE * hana_ha1_sra : - * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_sync_state : PRIM * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode09 * lpa_ha1_lpt : 1708413925 * master-rsc_SAPHana_HA1_HDB10 : 150 * Node: mythicnode10: * hana_ha1_clone_state : DEMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode09 * hana_ha1_roles : 4:S:master1:master:worker:master * hana_ha1_site : SECSITE * hana_ha1_sra : - * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_sync_state : SOK * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode10 * lpa_ha1_lpt : 30 * master-rsc_SAPHana_HA1_HDB10 : 100
8- Set the cluster into maintenance mode, run on any node:
# crm configure property maintenance-mode=true
9- Check the current status, it should show all the resources as "unmanaged":
# crm_mon -Ar1
Example:
Cluster Summary: * Stack: corosync * Current DC: mythicnode09 - partition with quorum * 2 nodes configured * 6 resource instances configured *** Resource management is DISABLED *** The cluster will not attempt to start, stop or recover services Node List: * Node mythicnode09: online: * Resources: * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started (unmanaged) * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Master (unmanaged) * stonith-sbd (stonith:external/sbd): Started (unmanaged) * rsc_ip_HA1_HDB10 (ocf::heartbeat:IPaddr2): Started (unmanaged) * Node mythicnode10: online: * Resources: * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started (unmanaged) * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave (unmanaged) Inactive Resources: * No inactive resources Node Attributes: * Node: mythicnode09: * hana_ha1_clone_state : PROMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode10 * hana_ha1_roles : 4:P:master1:master:worker:master * hana_ha1_site : PRIMSITE * hana_ha1_sra : - * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_sync_state : PRIM * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode09 * lpa_ha1_lpt : 1708413925 * master-rsc_SAPHana_HA1_HDB10 : 150 * Node: mythicnode10: * hana_ha1_clone_state : DEMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode09 * hana_ha1_roles : 4:S:master1:master:worker:master * hana_ha1_site : SECSITE * hana_ha1_sra : - * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_sync_state : SOK * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode10 * lpa_ha1_lpt : 30 * master-rsc_SAPHana_HA1_HDB10 : 100
10- Stop Linux Cluster on all nodes. Make sure to do that on all the nodes:
# crm cluster stop
11- Double-check and make sure that all cluster related services are stopped on all the nodes:
# systemctl status pacemaker.service
# systemctl status corosync.service
# systemctl status sbd.service
12- Set value of "Timeout (watchdog)" to 10, and the value of "Timeout (msgwait)" to 20 in the SBD device:
# sbd -d <device> -1 <N> -4 <N> create
Example:
# sbd -d /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149 -1 10 -4 20 create
13- Get the current settings of the SBD device, and verify that the new values were applied:
# sbd -d <device> dump
# sbd -d <device> list
Example:
# sbd -d /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149 dump ==Dumping header on disk /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149 Header version : 2.1 UUID : 2cfa5a55-29d8-474f-b39d-bab906802e8c Number of slots : 255 Sector size : 512 Timeout (watchdog) : 10 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 20 ==Header on disk /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149 is dumped
# sbd -d /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_WMDP0A966149 list 0 mythicnode09 clear 1 mythicnode10 clear
14- Start Linux cluster on all nodes. Make sure to do that on all the nodes:
# crm cluster start
15- Check the current status, it should show all nodes as "online" and the resources as "unmanaged", but the HANA replication scores should no longer show as 150 and 100, this is expected:
# crm_mon -Ar1
Example:
Cluster Summary: * Stack: corosync * Current DC: mythicnode09 - partition with quorum * 2 nodes configured * 6 resource instances configured *** Resource management is DISABLED *** The cluster will not attempt to start, stop or recover services Node List: * Node mythicnode09: online: * Resources: * rsc_ip_HA1_HDB10 (ocf::heartbeat:IPaddr2): Started (unmanaged) * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave (unmanaged) * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started (unmanaged) * Node mythicnode10: online: * Resources: * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave (unmanaged) * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started (unmanaged) Inactive Resources: * stonith-sbd (stonith:external/sbd): Stopped (unmanaged) Node Attributes: * Node: mythicnode09: * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode10 * hana_ha1_site : PRIMSITE * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode09 * lpa_ha1_lpt : 1708413925 * master-rsc_SAPHana_HA1_HDB10 : -1 * Node: mythicnode10: * hana_ha1_clone_state : DEMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode09 * hana_ha1_site : SECSITE * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode10 * lpa_ha1_lpt : 30 * master-rsc_SAPHana_HA1_HDB10 : -1
16- Disable "maintenance mode" on cluster level, the resources should as "started", except for the multistate that should still show as "unmanaged":
# crm configure property maintenance-mode=false
# crm_mon -Ar1
Example:
Cluster Summary: * Stack: corosync * Current DC: mythicnode09 - partition with quorum * Last updated: Tue Feb 20 08:58:59 2024 * Last change: Tue Feb 20 08:58:53 2024 by root via cibadmin on mythicnode09 * 2 nodes configured * 6 resource instances configured Node List: * Node mythicnode09: online: * Resources: * stonith-sbd (stonith:external/sbd): Started * rsc_ip_HA1_HDB10 (ocf::heartbeat:IPaddr2): Started * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave (unmanaged) * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started * Node mythicnode10: online: * Resources: * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave (unmanaged) * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started Inactive Resources: * No inactive resources Node Attributes: * Node: mythicnode09: * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode10 * hana_ha1_roles : 4:P:master1:master:worker:master * hana_ha1_site : PRIMSITE * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode09 * lpa_ha1_lpt : 1708413925 * master-rsc_SAPHana_HA1_HDB10 : -1 * Node: mythicnode10: * hana_ha1_clone_state : DEMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode09 * hana_ha1_roles : 4:S:master1:master:worker:master * hana_ha1_site : SECSITE * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode10 * lpa_ha1_lpt : 30 * master-rsc_SAPHana_HA1_HDB10 : -1
17- Let the cluster detect the status of HANA multistate resource, run on any node:
# crm resource refresh msl_...
Example:
# crm resource refresh msl_SAPHana_HA1_HDB10 Cleaned up rsc_SAPHana_HA1_HDB10:0 on mythicnode10 Cleaned up rsc_SAPHana_HA1_HDB10:1 on mythicnode09 Waiting for 2 replies from the controller ... got reply ... got reply (done)
18- Wait for the refresh to complete all the operations and then check the current status, the resources should as "started", except for the multistate that should show still as "unmanaged", but now the HANA replication scores should show as 150 on the "Primary" node and 100 on the "Secondary" node:
# crm_mon -Ar1
Example:
Cluster Summary: * Stack: corosync * Current DC: mythicnode09 - partition with quorum * 2 nodes configured * 6 resource instances configured Node List: * Node mythicnode09: online: * Resources: * stonith-sbd (stonith:external/sbd): Started * rsc_ip_HA1_HDB10 (ocf::heartbeat:IPaddr2): Started * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave (unmanaged) * Node mythicnode10: online: * Resources: * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave (unmanaged) Inactive Resources: * No inactive resources Node Attributes: * Node: mythicnode09: * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode10 * hana_ha1_roles : 4:P:master1:master:worker:master * hana_ha1_site : PRIMSITE * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode09 * lpa_ha1_lpt : 1708413925 * master-rsc_SAPHana_HA1_HDB10 : 150 * Node: mythicnode10: * hana_ha1_clone_state : DEMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode09 * hana_ha1_roles : 4:S:master1:master:worker:master * hana_ha1_site : SECSITE * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode10 * lpa_ha1_lpt : 30 * master-rsc_SAPHana_HA1_HDB10 : 100
19- Once with the HANA roles and scores values showing as the above example, then disable "maintenance" on the HANA multistate resource, run on any node:
# crm resource maintenance msl_... off
Example:
# crm resource maintenance msl_SAPHana_HA1_HDB10 off
20- Check the current status, it should show all resources and HANA with the same status as in step 3 before starting this procedure:
# crm_mon -Ar1
Example:
Cluster Summary: * Stack: corosync * Current DC: mythicnode09 - partition with quorum * 2 nodes configured * 6 resource instances configured Node List: * Node mythicnode09: online: * Resources: * stonith-sbd (stonith:external/sbd): Started * rsc_ip_HA1_HDB10 (ocf::heartbeat:IPaddr2): Started * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Master * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started * Node mythicnode10: online: * Resources: * rsc_SAPHana_HA1_HDB10 (ocf::suse:SAPHana): Slave * rsc_SAPHanaTopology_HA1_HDB10 (ocf::suse:SAPHanaTopology): Started Inactive Resources: * No inactive resources Node Attributes: * Node: mythicnode09: * hana_ha1_clone_state : PROMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode10 * hana_ha1_roles : 4:P:master1:master:worker:master * hana_ha1_site : PRIMSITE * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_sync_state : PRIM * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode09 * lpa_ha1_lpt : 1708416619 * master-rsc_SAPHana_HA1_HDB10 : 150 * Node: mythicnode10: * hana_ha1_clone_state : DEMOTED * hana_ha1_op_mode : logreplay * hana_ha1_remoteHost : mythicnode09 * hana_ha1_roles : 4:S:master1:master:worker:master * hana_ha1_site : SECSITE * hana_ha1_srah : - * hana_ha1_srmode : sync * hana_ha1_sync_state : SOK * hana_ha1_version : 2.00.060.00 * hana_ha1_vhost : mythicnode10 * lpa_ha1_lpt : 30 * master-rsc_SAPHana_HA1_HDB10 : 100
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000021362
- Creation Date: 20-Feb-2024
- Modified Date:20-Feb-2024
-
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com