Protect HANA against manually caused dual-primary situation in SUSE HA cluster

This document (000021044) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise for SAP Applications 15
SUSE Linux Enterprise for SAP Applications 12

Situation

One of the most critical administration mistakes is to run a manual takeover on the secondary SAP HANA instance while the primary HANA instance is up and running, controlled by the SUSE HA cluster. In that case you will have two primary HANA database instances running in parallel. Thus the former secondary is out of sync. Even worse, the original primary database might not be valid anymore. Since both databases are primary HANAs from a formal perspective, the Linux cluster can not solve this afterwards. Best it could do would be stopping the younger primary. Then the admin may fix it.

On a running cluster such a dual-primary situation looks like this:

pizbuin02:~ # SAPHanaSR-showAttr 
Global cib-time                 
--------------------------------
global Mon Apr 17 17:57:43 2023 

Sites srHook 
-------------
JWD   SFAIL  
WDF   PRIM   

Hosts     clone_state lpa_ha1_lpt node_state op_mode   remoteHost roles                            score site srah srmode sync_state version     vhost     
-----------------------------------------------------------------------------------------------------------------------------------------------------------
pizbuin01 PROMOTED    1681747066  online     logreplay pizbuin02  4:P:master1:master:worker:master 150   WDF  -    sync   PRIM       2.00.070.00 pizbuin01 
pizbuin02 UNDEFINED   10          online     logreplay pizbuin01  1:P:master1::worker:             -9000 JWD  -    sync   SFAIL      2.00.070.00 pizbuin02 

 

pizbuin02:~ # crm_mon -1r
Cluster Summary:
  * Stack: corosync
  * Current DC: pizbuin02 (version 2.1.2+20211124.ada5c3b36-150400.4.9.2-2.1.2+20211124.ada5c3b36) - partition with quorum
  * Last updated: Mon Apr 17 17:59:18 2023
  * Last change:  Mon Apr 17 17:59:10 2023 by root via crm_attribute on pizbuin01
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ pizbuin01 pizbuin02 ]

Full List of Resources:
  * rsc_stonith_sbd     (stonith:external/sbd):  Started pizbuin01
  * rsc_ip_HA1  (ocf::heartbeat:IPaddr2):        Started pizbuin01
  * Clone Set: msl_SAPHana_HA1_HDB00 [rsc_SAPHana_HA1_HDB00] (promotable):
    * Masters: [ pizbuin01 ]
    * Stopped: [ pizbuin02 ]
  * Clone Set: cln_SAPHanaTopology_HA1_HDB00 [rsc_SAPHanaTopology_HA1_HDB00]:
    * Started: [ pizbuin01 pizbuin02 ]

Failed Resource Actions:
  * rsc_SAPHana_HA1_HDB00_monitor_20000 on pizbuin02 'not running' (7): call=23, status='complete', last-rc-change='Mon Apr 17 17:57:06 2023', queued=0ms, exec=0ms
  * rsc_SAPHana_HA1_HDB00_start_0 on pizbuin02 'not running' (7): call=26, status='complete', last-rc-change='Mon Apr 17 17:57:19 2023', queued=0ms, exec=2266ms

In the system logs it looks like this:

pizbuin02:~ # grep "2023-04-17T17:5.*SAPHana.*rsc_SAPHana_" /var/log/messages | less
...
2023-04-17T17:56:58.933998+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: RA ==== begin action monitor_clone (0.162.1) ====
2023-04-17T17:56:59.002878+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: hana_ha1_site_srHook_JWD=SFAIL
2023-04-17T17:56:59.008553+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: Finally get_SRHOOK()=SFAIL
2023-04-17T17:57:01.856625+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: ACT: systemd service SAPHA1_00.service is active
2023-04-17T17:57:04.971203+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: hana_ha1_site_srHook_JWD=SFAIL
2023-04-17T17:57:05.021509+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: Finally get_SRHOOK()=SFAIL
2023-04-17T17:57:05.721836+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: Dual primary detected, other instance is PROMOTED and lpa stalemate ==> local restart
2023-04-17T17:57:06.124219+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: saphana_monitor_primary: scoring_crm_master(4:S:master1:master:worker:master,SFAIL)
2023-04-17T17:57:06.321256+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: scoring_crm_master: roles(4:S:master1:master:worker:master) are matching pattern ([0-9]*:S:[^:]*:master)
2023-04-17T17:57:06.364610+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: scoring_crm_master: sync(SFAIL) is matching syncPattern (SFAIL)
2023-04-17T17:57:06.412470+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: scoring_crm_master: set score -INFINITY
2023-04-17T17:57:06.579896+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: RA ==== end action monitor_clone with rc=7 (0.162.1) (8s)====
...

2023-04-17T17:57:07.373671+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: RA ==== begin action stop_clone (0.162.1) ====
2023-04-17T17:57:09.341320+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: ACT: systemd service SAPHA1_00.service is active
2023-04-17T17:57:09.363726+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: ACT: Stopping SAP Instance HA1-HDB00: #01217.04.2023 17:57:09#012Stop#012OK
2023-04-17T17:57:19.398137+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: ACT: SAP Instance HA1-HDB00 stopped: #01217.04.2023 17:57:19#012WaitforStopped#012OK
2023-04-17T17:57:19.418711+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: RA ==== end action stop_clone with rc=0 (0.162.1) (13s)====
...

 

pizbuin02:~ # grep "2023-04-17T17:57:06.*pacemaker-" /var/log/messages | less 
...
2023-04-17T17:57:06.617968+02:00 pizbuin02 pacemaker-controld[32198]:  notice: Result of monitor operation for rsc_SAPHana_HA1_HDB00 on pizbuin02: not running 
2023-04-17T17:57:06.618070+02:00 pizbuin02 pacemaker-controld[32198]:  notice: pizbuin02-rsc_SAPHana_HA1_HDB00_monitor_20000:23 [ 4:S:master1:master:worker:master\n4:S:master1:master:worker:master\n4:S:master1:master:worker:master\n4:S:master1:master:worker:master\nSFAIL\n ]
2023-04-17T17:57:06.621896+02:00 pizbuin02 pacemaker-controld[32198]:  notice: Transition 3 action 14 (rsc_SAPHana_HA1_HDB00_monitor_20000 on pizbuin02): expected 'ok' but got 'not running' 
2023-04-17T17:57:06.623678+02:00 pizbuin02 pacemaker-attrd[32196]:  notice: Setting fail-count-rsc_SAPHana_HA1_HDB00#monitor_20000[pizbuin02]: (unset) -> 1 
2023-04-17T17:57:06.625162+02:00 pizbuin02 pacemaker-attrd[32196]:  notice: Setting last-failure-rsc_SAPHana_HA1_HDB00#monitor_20000[pizbuin02]: (unset) -> 1681747026 
2023-04-17T17:57:06.625943+02:00 pizbuin02 pacemaker-controld[32198]:  notice: Transition 3 action 14 (rsc_SAPHana_HA1_HDB00_monitor_20000 on pizbuin02): expected 'ok' but got 'not running' 
2023-04-17T17:57:06.628382+02:00 pizbuin02 pacemaker-attrd[32196]:  notice: Setting fail-count-rsc_SAPHana_HA1_HDB00#monitor_20000[pizbuin02]: 1 -> 2 
2023-04-17T17:57:06.675872+02:00 pizbuin02 pacemaker-schedulerd[32197]:  warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHana_HA1_HDB00:1 on pizbuin02 at Apr 17 17:57:06 2023 
2023-04-17T17:57:06.678790+02:00 pizbuin02 pacemaker-schedulerd[32197]:  warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHana_HA1_HDB00:1 on pizbuin02 at Apr 17 17:57:06 2023 
2023-04-17T17:57:06.678846+02:00 pizbuin02 pacemaker-schedulerd[32197]:  notice: Actions: Recover    rsc_SAPHana_HA1_HDB00:1             (             Slave pizbuin02 )
2023-04-17T17:57:06.692252+02:00 pizbuin02 pacemaker-schedulerd[32197]:  warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHana_HA1_HDB00:1 on pizbuin02 at Apr 17 17:57:06 2023 
2023-04-17T17:57:06.692335+02:00 pizbuin02 pacemaker-schedulerd[32197]:  warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHana_HA1_HDB00:1 on pizbuin02 at Apr 17 17:57:06 2023 
2023-04-17T17:57:06.693665+02:00 pizbuin02 pacemaker-schedulerd[32197]:  notice: Actions: Recover    rsc_SAPHana_HA1_HDB00:1             (             Slave pizbuin02 )
2023-04-17T17:57:06.694973+02:00 pizbuin02 pacemaker-controld[32198]:  notice: Initiating stop operation rsc_SAPHana_HA1_HDB00_stop_0 locally on pizbuin02 
2023-04-17T17:57:06.695901+02:00 pizbuin02 pacemaker-controld[32198]:  notice: Requesting local execution of stop operation for rsc_SAPHana_HA1_HDB00 on pizbuin02 
2023-04-17T17:57:06.697211+02:00 pizbuin02 pacemaker-execd[32195]:  notice: executing - rsc:rsc_SAPHana_HA1_HDB00 action:stop call_id:25
2023-04-17T17:57:06.862389+02:00 pizbuin02 SAPHanaTopology(rsc_SAPHanaTopology_HA1_HDB00)[9111]: INFO: DEC: site=JWD, mode=primary, hanaRemoteHost=pizbuin01 - found by remote site (WDF)
...

Resolution

The Linux cluster can not fix HANA once the admin did run it into dual-primary. But the Linux cluster can prevent the admin from doing so again next time.

Therefor the SAP HANA nameserver provides a Python-based API (HA/DR providers), which is called at important points of the host auto-failover and system replication takeover process. The method preTakeover() is called before any sr_takeover
action.

The HA/DR provider hook script susTkOver.py permits manual takeover of the HANA primary if the SAP HANA multi-state resource (managed by SAPHana or SAPHanaController) is set into maintenance or the Linux cluster is stopped.
Otherwise the manual takeover is blocked. In that case an error message at the Linux console and in HANA Cockpit reminds the admin to use an appropriate cluster maintenance procedure.

This hook script needs to be configured and activated on all HANA nodes.

To activate the hook script susTkOver.py for SAP HANA and to intergrate the script with the SUSE cluster, two configuration changes are necessary on all cluster nodes:

The auxiliary tool SAPHanaSR-hookHelper needs access permission for the Linux cluster information base (CIB) via the Linux sudoers rules.
The hook script susTkOver.py needs to be configured in the HANA global.ini and to be loaded.

Step 1: Granting permission to SAPHanaSR-hookHelper

Example SID is HA1, <sid>adm is ha1adm. A simple rule in the file
/etc/sudoers.d/SAPHanaSR looks like:

# simple permission needed by SAPHanaSR-hookHelper for susTkOver.py 
ha1adm ALL=(ALL) NOPASSWD: /usr/sbin/SAPHanaSR-hookHelper --sid=HA1 --case=*

Please consult manual pages sudoers(5) and SAPHanaSR-hookHelper(8) for details and more elaborated rules.

You might check the resulting permission by calling:

# sudo -U ha1adm -l

Step 2: Activating susTkOver.py

Example for entry in SAP HANA scale-up global configuration, i.e. in /hana/shared/<SID>/global/hdb/custom/config/global.ini . This config change is needed at both sites:

[ha_dr_provider_sustkover]
provider = susTkOver
path = /usr/share/SAPHanaSR
sustkover_timeout = 30
execution_order = 2

See manual page susTkOver.py(7) for additional details. Please consult manual page SAPHanaSR-manageProvider(8) and the SAP HANA documentation on how to change the configuration while HANA is up and running.

You might check the HANA tracefiles whether the hook script has been loaded:

# su - ha1adm
~> cdtrace
~> grep HADR.*load.*susTkOver nameserver_*.trc
~> grep susTkOver.init nameserver_*.trc

Additional Information

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.