SLES for SAP OS patching procedure for Scale-Up Perf-Opt HANA cluster
This blog post cover a specific maintenance scenario where we will discuss the steps of OS patching on a scale-up performance optimized HANA cluster.
This is a supplementary blog to more generic blogs on SLES for SAP maintenance available at Maintenance blog – Part 1 and Maintenance blog – Part-2.
The generic prerequisites of the maintenance is already mentioned in section 5 of the blog part 2.
Here we are going to cover 3 OS patching senarios:
- When HANA can be shut down on both nodes.
- When HANA needs to be running on one of the nodes and both nodes are patched one after the other.
- When HANA needs to be running on one of the nodes and both nodes are patched one after the other but with a time gap.
IMPORTANT:
- Please note that when cluster is running, in all maintenance procedure, before every step we have to ensure first that the cluster is stabilized by running command “cs_clusterstate -i” and looking for the output “S_IDLE”.
- Please make sure that if SBD is used, the SBD configuration parameter SBD_DELAY_START is set to “no”. It helps to avoid startup fencing.
- Before the start of the maintenance procedure follow the checks mentioned in blog part 1.
There are other patching scenarios documented in blogs and manual pages. For example you can patch the nodes one by one in combination with an SAP HANA takeover. For details, please look into blog article https://www.suse.com/c/sap-hana-maintenance-suse-clusters/ and manual page SAPHanaSR_maintenance_examples(7).
When HANA can be shut down on both nodes.
This is the most ideal scenario for the OS patching as the workloads on both the nodes are down and the admin has to only worry about the patching of the OS. Please note that generally an OS patching is done during maintenance windows where many other maintenance tasks on hardware and softwares are performed and if an admin has to focus on less variables then they can easily figure out where a problem lies. If there are too many things running on the system during maintenance then sometimes it is difficult to point out what was the cause of a problem. Therefore HANA down on both the nodes would be an ideal scenario for patching.
This scenario is already discussed in section OS reboot of blog part 2. Here I am rewriting just the steps without illustrating the command outputs.
- Disabling pacemaker on SAP HANA primary
- Disabling pacemaker on SAP HANA secondary
- Stopping cluster on SAP HANA secondary
- Stopping cluster on SAP HANA primary
- Patching the OS
- Enabling pacemaker on SAP HANA primary
- Enabling pacemaker on SAP HANA secondary
- Starting cluster on SAP HANA primary
- Starting cluster on SAP HANA secondary
When HANA needs to be running on one of the nodes and both nodes are patched one after the other.
This is a more practical scenario where one of the node in the cluster is always serving SAP HANA to the applications.
Note: If the primary HANA runs without connection to the registered secondary for a while, the local replication logs might fill up the filesystem. If unsure, use scenario 3.
-
- Put the multi state resource and the virtual IP resource into maintenance.
llhana2:~ # crm resource maintenance msl_SAPHana_TST_HDB00
llhana2:~ #
Cluster Summary:
* Stack: corosync
* Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Mon Sep 19 21:26:56 2022
* Last change: Mon Sep 19 21:26:54 2022 by root via cibadmin on llhana2
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ llhana1 llhana2 ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started llhana1
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana1
* Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
* rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Master llhana1 (unmanaged)
* rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Slave llhana2 (unmanaged)
* Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
* Started: [ llhana1 llhana2 ]
llhana2:~ # crm resource maintenance rsc_ip_TST_HDB00
llhana2:~ #
Cluster Summary:
* Stack: corosync
* Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Mon Sep 19 21:28:05 2022
* Last change: Mon Sep 19 21:28:03 2022 by root via cibadmin on llhana2
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ llhana1 llhana2 ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started llhana1
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana1 (unmanaged)
* Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
* rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Master llhana1 (unmanaged)
* rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Slave llhana2 (unmanaged)
* Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
* Started: [ llhana1 llhana2 ]
DISCUSSIONS: Putting the multi state resource into maintenance first is the best practice method to start the maintenance on a HANA cluster. We no longer need to put the whole cluster into maintenance mode. Putting maintenance on the virtual IP resource is also important as we want cluster to avoid migrating this resource and we want it to stay running on its existing node. During the period of maintenance we want to manage both these resources manually.
-
- Stop cluster on both the nodes
llhana2:~ # crm cluster stop
INFO: Cluster services stopped
llhana2:~ #
llhana1:~ # crm cluster stop
INFO: Cluster services stopped
llhana1:~ #
DISCUSSIONS: We stop the cluster before the patching procedure as the OS patching procedure will also update the cluster stack. We stop the cluster on primary as well to avoid any self node fencing. If the cluster is stopped and if something goes wrong then an admin is sure that whatever happened was a result of admin’s action and not cluster action so that it narrows down where to look for the cause of the problem.
-
- Stop SAP HANA on the secondary node
llhana2:~ # su - tstadm
tstadm@llhana2:/usr/sap/TST/HDB00> HDB stop
hdbdaemon will wait maximal 300 seconds for NewDB services finishing.
Stopping instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function Stop 400
19.09.2022 21:31:15
Stop
OK
Waiting for stopped instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function WaitforStopped 600 2
19.09.2022 21:31:31
WaitforStopped
OK
hdbdaemon is stopped.
tstadm@llhana2:/usr/sap/TST/HDB00>
llhana1:~ # cat /hana/shared/TST/HDB00/.crm_attribute.TWO
hana_tst_site_srHook_TWO = SFAIL
llhana1:~ #
DISCUSSIONS: Most of the OS patching procedure result in OS reboot and as per SAP best practises when cluster is not running it is recommended to manually stop the HANA database. Otherwise HANA database will be stopped during a reboot and if anything is wrong with the database then it will be difficult to troubleshoot when it is rebooting than when we manually stop it.
-
- Patch and Upgrade the OS
- If reboot is required then disable the pacemaker on SAP HANA secondary
llhana2:~ # systemctl disable pacemaker
Removed /etc/systemd/system/multi-user.target.wants/pacemaker.service.
llhana2:~ #
DISCUSSIONS: Disabling of pacemaker ensures that there is no unwanted cluster start or fencing. We start the pacemaker when we are sure about it.
-
- Reboot the patched secondary node
- Enable pacemaker on the rebooted secondary node.
llhana2:~ # systemctl enable pacemaker
Created symlink /etc/systemd/system/multi-user.target.wants/pacemaker.service → /usr/lib/systemd/system/pacemaker.service.
llhana2:~ #
-
- Start SAP HANA on the rebooted secondary node.
tstadm@llhana2:/usr/sap/TST/HDB00> HDB start
StartService
Impromptu CCC initialization by 'rscpCInit'.
See SAP note 1266393.
OK
OK
Starting instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function StartWait 2700 2
19.09.2022 21:38:20
Start
OK
19.09.2022 21:38:48
StartWait
OK
tstadm@llhana2:/usr/sap/TST/HDB00>
llhana1:~ # cat /hana/shared/TST/HDB00/.crm_attribute.TWO
hana_tst_site_srHook_TWO = SOK
llhana1:~ #
DISCUSSIONS: During the manual stop of HANA database or during reboot of seconary HANA when the cluster is down, the srHook script creates a cache file with the name .crm_attribute.<SITENAME> at the location /hana/shared/<SID>/HDB<nr>. This is only created and updated when the cluster is down and to record the change in system replication due to stopping of HANA database on secondary. If we do not start the HANA database at this stage before the cluster start then it will later cause a wrong value of srHook attribute.
-
- Start cluster on SAP HANA primary
llhana1:~ # crm cluster start
INFO: Cluster services started
llhana1:~ #
Cluster Summary:
* Stack: corosync
* Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition WITHOUT quorum
* Last updated: Mon Sep 19 21:55:29 2022
* Last change: Mon Sep 19 21:48:09 2022 by root via cibadmin on llhana2
* 2 nodes configured
* 6 resource instances configured
Node List:
* Node llhana2: UNCLEAN (offline)
* Online: [ llhana1 ]
Active Resources:
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana1 (unmanaged)
* Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
* rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Slave llhana1 (unmanaged)
DISCUSSIONS: At this stage it is important that we start the cluster at primary HANA first as due to patching the software version of the cluster stack of two HANA nodes may be different and it may hit a situation as described in following TID: https://www.suse.com/de-de/support/kb/doc/?id=000019119
-
- Start cluster on SAP HANA secondary
llhana2:~ # crm cluster start
INFO: Cluster services started
llhana2:~ #
Cluster Summary:
* Stack: corosync
* Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Mon Sep 19 21:56:57 2022
* Last change: Mon Sep 19 21:48:09 2022 by root via cibadmin on llhana2
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ llhana1 llhana2 ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started llhana1
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana1 (unmanaged)
* Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
* rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Slave llhana1 (unmanaged)
* rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Slave llhana2 (unmanaged)
* Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
* Started: [ llhana1 llhana2 ]
llhana2:~ # SAPHanaSR-showAttr
Global cib-time maintenance
--------------------------------------------
global Mon Sep 19 21:48:09 2022 false
Resource maintenance
----------------------------------
msl_SAPHana_TST_HDB00 true
rsc_ip_TST_HDB00 true
Sites srHook
-------------
ONE PRIM
TWO SOK
Hosts clone_state lpa_tst_lpt node_state op_mode remoteHost roles score site srah srmode version vhost
-------------------------------------------------------------------------------------------------------------------------------------------------------
llhana1 1663616828 online logreplay llhana2 4:P:master1:master:worker:master -1 ONE - sync 2.00.052.00.1599235305 llhana1
llhana2 DEMOTED 30 online logreplay llhana1 4:S:master1:master:worker:master -1 TWO - sync 2.00.052.00.1599235305 llhana2
llhana2:~ #
-
- Refresh the cloned SAPHanaTopology resource
llhana2:~ # crm resource refresh rsc_ip_TST_HDB00
Cleaned up rsc_ip_TST_HDB00 on llhana2
Cleaned up rsc_ip_TST_HDB00 on llhana1
Waiting for 2 replies from the controller.. OK
llhana2:~ #
llhana2:~ # crm resource refresh cln_SAPHanaTopology_TST_HDB00
Cleaned up rsc_SAPHanaTopology_TST_HDB00:0 on llhana1
Cleaned up rsc_SAPHanaTopology_TST_HDB00:1 on llhana2
Waiting for 2 replies from the controller.. OK
llhana2:~ #
llhana2:~ # SAPHanaSR-showAttr
Global cib-time maintenance
--------------------------------------------
global Mon Sep 19 22:01:38 2022 false
Resource maintenance
----------------------------------
msl_SAPHana_TST_HDB00 true
rsc_ip_TST_HDB00 true
Sites srHook
-------------
ONE PRIM
TWO SOK
Hosts clone_state lpa_tst_lpt node_state op_mode remoteHost roles score site srah srmode version vhost
-------------------------------------------------------------------------------------------------------------------------------------------------------
llhana1 1663616828 online logreplay llhana2 4:P:master1:master:worker:master -1 ONE - sync 2.00.052.00.1599235305 llhana1
llhana2 DEMOTED 30 online logreplay llhana1 4:S:master1:master:worker:master -1 TWO - sync 2.00.052.00.1599235305 llhana2
llhana2:~ #
-
- Refresh the multi state SAPHana resource
llhana2:~ # crm resource refresh msl_SAPHana_TST_HDB00
Cleaned up rsc_SAPHana_TST_HDB00:0 on llhana1
Cleaned up rsc_SAPHana_TST_HDB00:1 on llhana2
Waiting for 2 replies from the controller.. OK
llhana2:~ #
llhana2:~ # SAPHanaSR-showAttr
Global cib-time maintenance
--------------------------------------------
global Mon Sep 19 22:02:52 2022 false
Resource maintenance
----------------------------------
msl_SAPHana_TST_HDB00 true
rsc_ip_TST_HDB00 true
Sites srHook
-------------
ONE PRIM
TWO SOK
Hosts clone_state lpa_tst_lpt node_state op_mode remoteHost roles score site srah srmode version vhost
-------------------------------------------------------------------------------------------------------------------------------------------------------
llhana1 1663616828 online logreplay llhana2 4:P:master1:master:worker:master 150 ONE - sync 2.00.052.00.1599235305 llhana1
llhana2 DEMOTED 30 online logreplay llhana1 4:S:master1:master:worker:master 100 TWO - sync 2.00.052.00.1599235305 llhana2
llhana2:~ #
DISCUSSIONS: Refreshing the multi state resources updates many node attributes as we can see in above output that the score attribute has been updated.
-
- Set maintenance off on the virtual IP and the multi state resource.
llhana2:~ # crm resource maintenance rsc_ip_TST_HDB00 off
llhana2:~ #
llhana2:~ # crm resource maintenance msl_SAPHana_TST_HDB00 off
llhana2:~ #
Cluster Summary:
* Stack: corosync
* Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Mon Sep 19 22:06:27 2022
* Last change: Mon Sep 19 22:06:26 2022 by root via crm_attribute on llhana1
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ llhana1 llhana2 ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started llhana1
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana1
* Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable):
* Masters: [ llhana1 ]
* Slaves: [ llhana2 ]
* Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
* Started: [ llhana1 llhana2 ]
llhana2:~ # SAPHanaSR-showAttr
Global cib-time maintenance
--------------------------------------------
global Mon Sep 19 22:06:26 2022 false
Resource maintenance
----------------------------------
msl_SAPHana_TST_HDB00 false
rsc_ip_TST_HDB00 false
Sites srHook
-------------
ONE PRIM
TWO SOK
Hosts clone_state lpa_tst_lpt node_state op_mode remoteHost roles score site srah srmode sync_state version vhost
------------------------------------------------------------------------------------------------------------------------------------------------------------------
llhana1 PROMOTED 1663617986 online logreplay llhana2 4:P:master1:master:worker:master 150 ONE - sync PRIM 2.00.052.00.1599235305 llhana1
llhana2 DEMOTED 30 online logreplay llhana1 4:S:master1:master:worker:master 100 TWO - sync SOK 2.00.052.00.1599235305 llhana2
llhana2:~ #
- Switch HANA roles between the two nodes:
- Forcefully migrate the multi state resource away from its current node.
-
llhana1:~ # crm resource move msl_SAPHana_TST_HDB00 force INFO: Move constraint created for msl_SAPHana_TST_HDB00 INFO: Use `crm resource clear msl_SAPHana_TST_HDB00` to remove this constraint llhana1:~ # Cluster Summary: * Stack: corosync * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum * Last updated: Mon Sep 19 12:44:46 2022 * Last change: Mon Sep 19 12:44:37 2022 by root via crm_attribute on llhana1 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ llhana1 llhana2 ] Active Resources: * stonith-sbd (stonith:external/sbd): Started llhana1 * rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana2 * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable): * rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Promoting llhana2 * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]: * Started: [ llhana1 llhana2 ] Cluster Summary: * Stack: corosync * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum * Last updated: Mon Sep 19 12:45:20 2022 * Last change: Mon Sep 19 12:45:20 2022 by root via crm_attribute on llhana1 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ llhana1 llhana2 ] Active Resources: * stonith-sbd (stonith:external/sbd): Started llhana1 * rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana2 * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable): * Masters: [ llhana2 ] * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]: * Started: [ llhana1 llhana2 ]
- Check the role attributes of both nodes
-
llhana1:~ # SAPHanaSR-showAttr --format=script |SAPHanaSR-filter --search='roles' Mon Sep 19 12:45:20 2022; Hosts/llhana1/roles=1:P:master1::worker: Mon Sep 19 12:45:20 2022; Hosts/llhana2/roles=4:P:master1:master:worker:master llhana1:~ #
- Register the old SAP HANA primary as the new SAP HANA secondary
-
llhana1:~ # su - tstadm tstadm@llhana1:/usr/sap/TST/HDB00> hdbnsutil -sr_register --remoteHost=llhana2 --remoteInstance=00 --replicationMode=sync --operationMode=logreplay --name=ONE adding site ... nameserver llhana1:30001 not responding. collecting information ... updating local ini files ... done. tstadm@llhana1:/usr/sap/TST/HDB00>
- Clear the migration constraint created after forceful migration of multi state resource
-
llhana1:~ # crm resource clear msl_SAPHana_TST_HDB00 INFO: Removed migration constraints for msl_SAPHana_TST_HDB00 llhana1:~ # Cluster Summary: * Stack: corosync * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum * Last updated: Mon Sep 19 12:48:09 2022 * Last change: Mon Sep 19 12:47:56 2022 by root via crm_attribute on llhana2 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ llhana1 llhana2 ] Active Resources: * stonith-sbd (stonith:external/sbd): Started llhana1 * rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana2 * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable): * rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Starting llhana1 * Masters: [ llhana2 ] * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]: * Started: [ llhana1 llhana2 ] Cluster Summary: * Stack: corosync * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum * Last updated: Mon Sep 19 12:48:09 2022 * Last change: Mon Sep 19 12:47:56 2022 by root via crm_attribute on llhana2 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ llhana1 llhana2 ] Active Resources: * stonith-sbd (stonith:external/sbd): Started llhana1 * rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana2 * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable): * Masters: [ llhana2 ] * Slaves: [ llhana1 ] * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]: * Started: [ llhana1 llhana2 ]
- Check the roles attribute once again.
-
llhana1:~ # SAPHanaSR-showAttr --format=script |SAPHanaSR-filter --search='roles' Mon Sep 19 12:49:56 2022; Hosts/llhana1/roles=4:S:master1:master:worker:master Mon Sep 19 12:49:56 2022; Hosts/llhana2/roles=4:P:master1:master:worker:master llhana1:~ #
- Repeat step 1 to 13 on the new secondary node.
When HANA needs to be running on one of the nodes and both nodes are patched one after the other but with a time gap.
Sometimes it becomes necessary that there is a gap of time between patching of OS of the two nodes in a cluster. Ideally both the nodes should be patched together, but if it is so required that there is a time gap between the patching of the two nodes then firstly one must understand the gap should not be more than few days. There is no specific recommendation on how long this gap should be but a generic recommendation is that the gap should be as less as possible and it should not be more than a week. In such a scenario if only node is serving HANA and the database is not getting synced to secondary for long then there is possibility that the /hana/log filesystem will get filled up. To avoid this situation it is recommended that in scenario no.2 we change step no 3 and step no 8 as shown below. Rest all steps remains same as in scenario no. 2.
Step 3. Un-register SAP HANA secondary and Stop SAP HANA on the secondary node
tstadm@llhana2:/usr/sap/TST/HDB00> hdbnsutil -sr_unregister
unregistering site ...
done.
tstadm@llhana2:/usr/sap/TST/HDB00>
tstadm@llhana2:/usr/sap/TST/HDB00> HDB info
USER PID PPID %CPU VSZ RSS COMMAND
tstadm 23348 23338 0.0 12452 3224 -sh -c python /usr/sap/TST/HDB00/exe/python_support/systemReplicationStatus.py --sapcontrol=1
tstadm 23368 23348 0.0 12452 1580 \_ -sh -c python /usr/sap/TST/HDB00/exe/python_support/systemReplicationStatus.py --sapcontrol=1
tstadm 23369 23368 0.0 12452 2192 \_ [sh]
tstadm 23370 23368 0.0 12452 364 \_ -sh -c python /usr/sap/TST/HDB00/exe/python_support/systemReplicationStatus.py --sapcontrol=1
tstadm 18316 18315 0.0 17492 6056 -sh
tstadm 23321 18316 10.0 15580 3960 \_ /bin/sh /usr/sap/TST/HDB00/HDB info
tstadm 23367 23321 0.0 39248 3992 \_ ps fx -U tstadm -o user:8,pid:8,ppid:8,pcpu:5,vsz:10,rss:10,args
tstadm 31112 1 0.0 716656 51344 hdbrsutil --start --port 30003 --volume 3 --volumesuffix mnt00001/hdb00003.00003 --identifier 1663618494
tstadm 30319 1 0.0 716324 50960 hdbrsutil --start --port 30001 --volume 1 --volumesuffix mnt00001/hdb00001 --identifier 1663618487
tstadm 29515 1 0.0 23632 3116 sapstart pf=/usr/sap/TST/SYS/profile/TST_HDB00_llhana2
tstadm 29522 29515 0.0 465248 74636 \_ /usr/sap/TST/HDB00/llhana2/trace/hdb.sapTST_HDB00 -d -nw -f /usr/sap/TST/HDB00/llhana2/daemon.ini pf=/usr/sap/TS
tstadm 29540 29522 17.2 5618440 3063240 \_ hdbnameserver
tstadm 29832 29522 0.7 452096 126408 \_ hdbcompileserver
tstadm 29835 29522 0.8 719928 154848 \_ hdbpreprocessor
tstadm 29877 29522 18.6 5846104 3339092 \_ hdbindexserver -port 30003
tstadm 29880 29522 2.6 3779524 1230612 \_ hdbxsengine -port 30007
tstadm 30536 29522 1.3 2415948 416800 \_ hdbwebdispatcher
tstadm 2637 1 0.0 502872 30596 /usr/sap/TST/HDB00/exe/sapstartsrv pf=/usr/sap/TST/SYS/profile/TST_HDB00_llhana2 -D -u tstadm
tstadm 2558 1 0.0 89060 11588 /usr/lib/systemd/systemd --user
tstadm 2559 2558 0.0 136924 3728 \_ (sd-pam)
tstadm@llhana2:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetProcessList
19.09.2022 22:24:06
GetProcessList
OK
name, description, dispstatus, textstatus, starttime, elapsedtime, pid
hdbdaemon, HDB Daemon, GREEN, Running, 2022 09 19 22:14:38, 0:09:28, 29522
hdbcompileserver, HDB Compileserver, GREEN, Running, 2022 09 19 22:14:42, 0:09:24, 29832
hdbindexserver, HDB Indexserver-TST, GREEN, Running, 2022 09 19 22:14:43, 0:09:23, 29877
hdbnameserver, HDB Nameserver, GREEN, Running, 2022 09 19 22:14:38, 0:09:28, 29540
hdbpreprocessor, HDB Preprocessor, GREEN, Running, 2022 09 19 22:14:42, 0:09:24, 29835
hdbwebdispatcher, HDB Web Dispatcher, GREEN, Running, 2022 09 19 22:14:49, 0:09:17, 30536
hdbxsengine, HDB XSEngine-TST, GREEN, Running, 2022 09 19 22:14:43, 0:09:23, 29880
tstadm@llhana2:/usr/sap/TST/HDB00>
tstadm@llhana2:/usr/sap/TST/HDB00> HDB stop
hdbdaemon will wait maximal 300 seconds for NewDB services finishing.
Stopping instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function Stop 400
19.09.2022 22:24:42
Stop
OK
Waiting for stopped instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function WaitforStopped 600 2
19.09.2022 22:25:10
WaitforStopped
OK
hdbdaemon is stopped.
tstadm@llhana2:/usr/sap/TST/HDB00>
Step 8. Re-register the SAP HANA secondary and Start SAP HANA on the rebooted secondary node.
llhana2:~ # su - tstadm
tstadm@llhana2:/usr/sap/TST/HDB00> hdbnsutil -sr_register --name=TWO --remoteHost=llhana1 --remoteInstance=00 --replicationMode=sync --operationMode=logreplay
adding site ...
nameserver llhana2:30001 not responding.
collecting information ...
updating local ini files ...
done.
tstadm@llhana2:/usr/sap/TST/HDB00> HDB start
StartService
Impromptu CCC initialization by 'rscpCInit'.
See SAP note 1266393.
OK
OK
Starting instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function StartWait 2700 2
19.09.2022 22:31:18
Start
OK
19.09.2022 22:31:46
StartWait
OK
tstadm@llhana2:/usr/sap/TST/HDB00>
tstadm@llhana2:/usr/sap/TST/HDB00> HDB info
USER PID PPID %CPU VSZ RSS COMMAND
tstadm 4372 4371 0.0 17492 6036 -sh
tstadm 6518 4372 0.0 15580 3996 \_ /bin/sh /usr/sap/TST/HDB00/HDB info
tstadm 6549 6518 0.0 39248 3980 \_ ps fx -U tstadm -o user:8,pid:8,ppid:8,pcpu:5,vsz:10,rss:10,args
tstadm 5734 1 0.0 716656 51284 hdbrsutil --start --port 30003 --volume 3 --volumesuffix mnt00001/hdb00003.00003 --identifier 1663619495
tstadm 5350 1 0.0 716328 51008 hdbrsutil --start --port 30001 --volume 1 --volumesuffix mnt00001/hdb00001 --identifier 1663619489
tstadm 4821 1 0.0 23632 3008 sapstart pf=/usr/sap/TST/SYS/profile/TST_HDB00_llhana2
tstadm 4828 4821 0.7 465292 74828 \_ /usr/sap/TST/HDB00/llhana2/trace/hdb.sapTST_HDB00 -d -nw -f /usr/sap/TST/HDB00/llhana2/daemon.ini pf=/usr/sap/TS
tstadm 4846 4828 20.8 9792200 1601584 \_ hdbnameserver
tstadm 5130 4828 1.1 447236 122056 \_ hdbcompileserver
tstadm 5133 4828 1.4 718900 153020 \_ hdbpreprocessor
tstadm 5176 4828 20.3 9797780 1671000 \_ hdbindexserver -port 30003
tstadm 5179 4828 12.6 5081520 1070824 \_ hdbxsengine -port 30007
tstadm 5453 4828 6.6 2413644 414576 \_ hdbwebdispatcher
tstadm 2633 1 0.4 502876 30744 /usr/sap/TST/HDB00/exe/sapstartsrv pf=/usr/sap/TST/SYS/profile/TST_HDB00_llhana2 -D -u tstadm
tstadm 2558 1 0.0 89056 11928 /usr/lib/systemd/systemd --user
tstadm 2559 2558 0.0 136908 3692 \_ (sd-pam)
tstadm@llhana2:/usr/sap/TST/HDB00>
tstadm@llhana2:/usr/sap/TST/HDB00>
tstadm@llhana2:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetProcessList
19.09.2022 22:32:24
GetProcessList
OK
name, description, dispstatus, textstatus, starttime, elapsedtime, pid
hdbdaemon, HDB Daemon, GREEN, Running, 2022 09 19 22:31:21, 0:01:03, 4828
hdbcompileserver, HDB Compileserver, GREEN, Running, 2022 09 19 22:31:26, 0:00:58, 5130
hdbindexserver, HDB Indexserver-TST, GREEN, Running, 2022 09 19 22:31:26, 0:00:58, 5176
hdbnameserver, HDB Nameserver, GREEN, Running, 2022 09 19 22:31:21, 0:01:03, 4846
hdbpreprocessor, HDB Preprocessor, GREEN, Running, 2022 09 19 22:31:26, 0:00:58, 5133
hdbwebdispatcher, HDB Web Dispatcher, GREEN, Running, 2022 09 19 22:31:31, 0:00:53, 5453
hdbxsengine, HDB XSEngine-TST, GREEN, Running, 2022 09 19 22:31:26, 0:00:58, 5179
tstadm@llhana2:/usr/sap/TST/HDB00>
Please also read our other blogs about #TowardsZeroDowntime.
Where can I find further information?
- SUSECON 2020 BP-1351 Tipps, Tricks and Troubleshooting
- Manual pages
- SAPHanaSR-ScaleOut(7)
- ocf_suse_SAPHanaController(7)
- ocf_suse_SAPHanaTopology(7)
- SAPHanaSR.py(7)
- SAPHanaSrMultiTarget.py(7)
- SAPHanaSR-ScaleOut_basic_cluster(7)
- SAPHanaSR-showAttr(8)
- SAPHanaSR_maintenance_examples(7)
- sbd(8)
- cs_man2pdf(8)
- cs_show_hana_info(8)
- cs_wait_for_idle(8)
- cs_clusterstate(8)
- cs_show_sbd_devices(8)
- cs_make_sbd_devices(8)
- supportconfig_plugins(5)
- crm(8)
- crmadmin(8)
- crm_mon(8)
- ha_related_suse_tids(7)
- ha_related_sap_notes(7)
- SUSE support TIDs
- Troubleshooting the SAPHanaSR python hook (000019865)
- Indepth HANA Cluster Debug Data Collection (PACEMAKER, SAP) (7022702)
- HANA SystemReplication doesn’t provide SiteName … (000019754)
- SAPHanaController running in timeout when starting SAP Hana (000019899)
- SAP HANA monitors timed out after 5 seconds (000020626)
- Related blog articles: https://www.suse.com/c/tag/towardszerodowntime/
- Blog Part 1 on SAP HANA Maintenance Procedure: https://www.suse.com/c/sles-for-sap-hana-maintenance-procedures-part-1-pre-maintenance-checks/
- Product documentation: https://documentation.suse.com/
- Pacemaker Upstream documentation on cluster property options: https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
Related Articles
May 03rd, 2023
Comments
We have a Suse Pacemaker cluster for Hana DB in AWS and are documenting our OS patching procedures. We’re encountering an issue that needs clarification.
Following the blog’s steps:
1) Pacemaker service starts/stops with the cluster, so we do not manage it manually.
2) After setting Vip and MSl hana resources to maintenance and stopping the cluster on both nodes >> HDB stop on standby, the node gets fenced and the AWS instance shuts down. We needed to put SAPHanaTopology in maintenance mode, which wasn’t documented. Once done, all subsequent steps were fine, but SAPHanaTopology must be brought out of maintenance mode afterward. Is this correct?.
3) We have SAP ASCS and ERS in another 2-node cluster and couldn’t find guidance on minimizing downtime during OS patching. Is there a specific blog for this?
Thank you for your comment. Here is my response.
1) I guess the command you are referring is to disable pacemaker and not stop/start it. Stopping the cluster service “crm cluster stop” will not disable pacemaker, it will only stop the pacemaker and therefore I have mentioned about this command separately.
2)It will be interesting to know why the node gets fenced. In my opinion there should not be a fencing but it would be interesting to know what is happening in your case. So, if you can share more information then it will help me to improve this blog article.
To answer your other question regarding if it is ok to set maintenance on SAPHanaTopology and whether it should be brought out of maintenance afterwards: To my understanding setting a maintenance SAPHanaTopology should not be necessary. I do have come across some maintenance/patching procedure that works where after setting maintenance on “msl” resource there is an intermediary step of setting maintenance on whole cluster, the maintenance on whole cluster is unset before unsetting maintenance on “msl” resource and refreshing both “SAPHanaTopology” and “msl” resource. Refreshing the SAPHanaTopology resource before unsetting maintenance on “msl” is very important as then only the cluster knows the current state of HANA and sets the correct attributes. And to know the current state of HANA it is important that SAPHanaTopology resource agent is in running state.
3)You can use our man page SAPStartSrv_maintenance_procedures(7) for ASCS/ERS. You can also refer to our best practice guide: https://documentation.suse.com/sbp/sap-15/html/SAP-S4HA10-setupguide-simplemount-sle15/index.html#id-maintenance-procedure-for-a-linux-cluster-or-operating-system-with-ascs-and-ers-instances-remain-running