Protect HANA against manually caused dual-primary situation in SUSE HA cluster
This document (000021044) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise for SAP Applications 12
Situation
On a running cluster such a dual-primary situation looks like this:
pizbuin02:~ # SAPHanaSR-showAttr Global cib-time -------------------------------- global Mon Apr 17 17:57:43 2023 Sites srHook ------------- JWD SFAIL WDF PRIM Hosts clone_state lpa_ha1_lpt node_state op_mode remoteHost roles score site srah srmode sync_state version vhost ----------------------------------------------------------------------------------------------------------------------------------------------------------- pizbuin01 PROMOTED 1681747066 online logreplay pizbuin02 4:P:master1:master:worker:master 150 WDF - sync PRIM 2.00.070.00 pizbuin01 pizbuin02 UNDEFINED 10 online logreplay pizbuin01 1:P:master1::worker: -9000 JWD - sync SFAIL 2.00.070.00 pizbuin02 pizbuin02:~ # crm_mon -1r Cluster Summary: * Stack: corosync * Current DC: pizbuin02 (version 2.1.2+20211124.ada5c3b36-150400.4.9.2-2.1.2+20211124.ada5c3b36) - partition with quorum * Last updated: Mon Apr 17 17:59:18 2023 * Last change: Mon Apr 17 17:59:10 2023 by root via crm_attribute on pizbuin01 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ pizbuin01 pizbuin02 ] Full List of Resources: * rsc_stonith_sbd (stonith:external/sbd): Started pizbuin01 * rsc_ip_HA1 (ocf::heartbeat:IPaddr2): Started pizbuin01 * Clone Set: msl_SAPHana_HA1_HDB00 [rsc_SAPHana_HA1_HDB00] (promotable): * Masters: [ pizbuin01 ] * Stopped: [ pizbuin02 ] * Clone Set: cln_SAPHanaTopology_HA1_HDB00 [rsc_SAPHanaTopology_HA1_HDB00]: * Started: [ pizbuin01 pizbuin02 ] Failed Resource Actions: * rsc_SAPHana_HA1_HDB00_monitor_20000 on pizbuin02 'not running' (7): call=23, status='complete', last-rc-change='Mon Apr 17 17:57:06 2023', queued=0ms, exec=0ms * rsc_SAPHana_HA1_HDB00_start_0 on pizbuin02 'not running' (7): call=26, status='complete', last-rc-change='Mon Apr 17 17:57:19 2023', queued=0ms, exec=2266ms
In the system logs it looks like this:
pizbuin02:~ # grep "2023-04-17T17:5.*SAPHana.*rsc_SAPHana_" /var/log/messages | less ... 2023-04-17T17:56:58.933998+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: RA ==== begin action monitor_clone (0.162.1) ==== 2023-04-17T17:56:59.002878+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: hana_ha1_site_srHook_JWD=SFAIL 2023-04-17T17:56:59.008553+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: Finally get_SRHOOK()=SFAIL 2023-04-17T17:57:01.856625+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: ACT: systemd service SAPHA1_00.service is active 2023-04-17T17:57:04.971203+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: hana_ha1_site_srHook_JWD=SFAIL 2023-04-17T17:57:05.021509+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: Finally get_SRHOOK()=SFAIL 2023-04-17T17:57:05.721836+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: Dual primary detected, other instance is PROMOTED and lpa stalemate ==> local restart 2023-04-17T17:57:06.124219+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: saphana_monitor_primary: scoring_crm_master(4:S:master1:master:worker:master,SFAIL) 2023-04-17T17:57:06.321256+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: scoring_crm_master: roles(4:S:master1:master:worker:master) are matching pattern ([0-9]*:S:[^:]*:master) 2023-04-17T17:57:06.364610+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: scoring_crm_master: sync(SFAIL) is matching syncPattern (SFAIL) 2023-04-17T17:57:06.412470+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: DEC: scoring_crm_master: set score -INFINITY 2023-04-17T17:57:06.579896+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[8940]: INFO: RA ==== end action monitor_clone with rc=7 (0.162.1) (8s)==== ... 2023-04-17T17:57:07.373671+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: RA ==== begin action stop_clone (0.162.1) ==== 2023-04-17T17:57:09.341320+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: ACT: systemd service SAPHA1_00.service is active 2023-04-17T17:57:09.363726+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: ACT: Stopping SAP Instance HA1-HDB00: #01217.04.2023 17:57:09#012Stop#012OK 2023-04-17T17:57:19.398137+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: ACT: SAP Instance HA1-HDB00 stopped: #01217.04.2023 17:57:19#012WaitforStopped#012OK 2023-04-17T17:57:19.418711+02:00 pizbuin02 SAPHana(rsc_SAPHana_HA1_HDB00)[9923]: INFO: RA ==== end action stop_clone with rc=0 (0.162.1) (13s)==== ... pizbuin02:~ # grep "2023-04-17T17:57:06.*pacemaker-" /var/log/messages | less ... 2023-04-17T17:57:06.617968+02:00 pizbuin02 pacemaker-controld[32198]: notice: Result of monitor operation for rsc_SAPHana_HA1_HDB00 on pizbuin02: not running 2023-04-17T17:57:06.618070+02:00 pizbuin02 pacemaker-controld[32198]: notice: pizbuin02-rsc_SAPHana_HA1_HDB00_monitor_20000:23 [ 4:S:master1:master:worker:master\n4:S:master1:master:worker:master\n4:S:master1:master:worker:master\n4:S:master1:master:worker:master\nSFAIL\n ] 2023-04-17T17:57:06.621896+02:00 pizbuin02 pacemaker-controld[32198]: notice: Transition 3 action 14 (rsc_SAPHana_HA1_HDB00_monitor_20000 on pizbuin02): expected 'ok' but got 'not running' 2023-04-17T17:57:06.623678+02:00 pizbuin02 pacemaker-attrd[32196]: notice: Setting fail-count-rsc_SAPHana_HA1_HDB00#monitor_20000[pizbuin02]: (unset) -> 1 2023-04-17T17:57:06.625162+02:00 pizbuin02 pacemaker-attrd[32196]: notice: Setting last-failure-rsc_SAPHana_HA1_HDB00#monitor_20000[pizbuin02]: (unset) -> 1681747026 2023-04-17T17:57:06.625943+02:00 pizbuin02 pacemaker-controld[32198]: notice: Transition 3 action 14 (rsc_SAPHana_HA1_HDB00_monitor_20000 on pizbuin02): expected 'ok' but got 'not running' 2023-04-17T17:57:06.628382+02:00 pizbuin02 pacemaker-attrd[32196]: notice: Setting fail-count-rsc_SAPHana_HA1_HDB00#monitor_20000[pizbuin02]: 1 -> 2 2023-04-17T17:57:06.675872+02:00 pizbuin02 pacemaker-schedulerd[32197]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHana_HA1_HDB00:1 on pizbuin02 at Apr 17 17:57:06 2023 2023-04-17T17:57:06.678790+02:00 pizbuin02 pacemaker-schedulerd[32197]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHana_HA1_HDB00:1 on pizbuin02 at Apr 17 17:57:06 2023 2023-04-17T17:57:06.678846+02:00 pizbuin02 pacemaker-schedulerd[32197]: notice: Actions: Recover rsc_SAPHana_HA1_HDB00:1 ( Slave pizbuin02 ) 2023-04-17T17:57:06.692252+02:00 pizbuin02 pacemaker-schedulerd[32197]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHana_HA1_HDB00:1 on pizbuin02 at Apr 17 17:57:06 2023 2023-04-17T17:57:06.692335+02:00 pizbuin02 pacemaker-schedulerd[32197]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHana_HA1_HDB00:1 on pizbuin02 at Apr 17 17:57:06 2023 2023-04-17T17:57:06.693665+02:00 pizbuin02 pacemaker-schedulerd[32197]: notice: Actions: Recover rsc_SAPHana_HA1_HDB00:1 ( Slave pizbuin02 ) 2023-04-17T17:57:06.694973+02:00 pizbuin02 pacemaker-controld[32198]: notice: Initiating stop operation rsc_SAPHana_HA1_HDB00_stop_0 locally on pizbuin02 2023-04-17T17:57:06.695901+02:00 pizbuin02 pacemaker-controld[32198]: notice: Requesting local execution of stop operation for rsc_SAPHana_HA1_HDB00 on pizbuin02 2023-04-17T17:57:06.697211+02:00 pizbuin02 pacemaker-execd[32195]: notice: executing - rsc:rsc_SAPHana_HA1_HDB00 action:stop call_id:25 2023-04-17T17:57:06.862389+02:00 pizbuin02 SAPHanaTopology(rsc_SAPHanaTopology_HA1_HDB00)[9111]: INFO: DEC: site=JWD, mode=primary, hanaRemoteHost=pizbuin01 - found by remote site (WDF) ...
Resolution
Therefor the SAP HANA nameserver provides a Python-based API (HA/DR providers), which is called at important points of the host auto-failover and system replication takeover process. The method preTakeover() is called before any sr_takeover
action.
The HA/DR provider hook script susTkOver.py permits manual takeover of the HANA primary if the SAP HANA multi-state resource (managed by SAPHana or SAPHanaController) is set into maintenance or the Linux cluster is stopped.
Otherwise the manual takeover is blocked. In that case an error message at the Linux console and in HANA Cockpit reminds the admin to use an appropriate cluster maintenance procedure.
This hook script needs to be configured and activated on all HANA nodes.
To activate the hook script susTkOver.py for SAP HANA and to intergrate the script with the SUSE cluster, two configuration changes are necessary on all cluster nodes:
- The auxiliary tool SAPHanaSR-hookHelper needs access permission for the Linux cluster information base (CIB) via the Linux sudoers rules.
- The hook script susTkOver.py needs to be configured in the HANA global.ini and to be loaded.
Step 1: Granting permission to SAPHanaSR-hookHelper
Example SID is HA1, <sid>adm is ha1adm. A simple rule in the file/etc/sudoers.d/SAPHanaSR looks like:
# simple permission needed by SAPHanaSR-hookHelper for susTkOver.py ha1adm ALL=(ALL) NOPASSWD: /usr/sbin/SAPHanaSR-hookHelper --sid=HA1 --case=*
Please consult manual pages sudoers(5) and SAPHanaSR-hookHelper(8) for details and more elaborated rules.
You might check the resulting permission by calling:
# sudo -U ha1adm -l
Step 2: Activating susTkOver.py
Example for entry in SAP HANA scale-up global configuration, i.e. in /hana/shared/<SID>/global/hdb/custom/config/global.ini . This config change is needed at both sites:[ha_dr_provider_sustkover] provider = susTkOver path = /usr/share/SAPHanaSR sustkover_timeout = 30 execution_order = 2
See manual page susTkOver.py(7) for additional details. Please consult manual page SAPHanaSR-manageProvider(8) and the SAP HANA documentation on how to change the configuration while HANA is up and running.
You might check the HANA tracefiles whether the hook script has been loaded:
# su - ha1adm ~> cdtrace ~> grep HADR.*load.*susTkOver nameserver_*.trc ~> grep susTkOver.init nameserver_*.trc
Additional Information
- Manual pages susTkOver.py(7), SAPHanaSR-hookHelper(8), SAPHanaSR-manageProvider(8), SAPHanaSR_maintenance_examples(7), sudoers(5)
- SAP HANA System Replication Scale-Up Perfomance-Optimised Scenario Setup Guide https://documentation.suse.com/sbp/all/single-html/SLES4SAP-hana-sr-guide-PerfOpt-15/#id-implementing-sustkover-hook-for-pretakeover
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000021044
- Creation Date: 18-Apr-2023
- Modified Date:18-Apr-2023
-
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com