SLES for SAP OS patching procedure for Scale-Up Perf-Opt HANA cluster

Share
Share

This blog post cover a specific maintenance scenario where we will discuss the steps of OS patching on a scale-up performance optimized HANA cluster.

This is a supplementary blog to more generic blogs on SLES for SAP maintenance available at Maintenance blog – Part 1 and Maintenance blog – Part-2.

The generic prerequisites of the maintenance is already mentioned in section 5 of the blog part 2.

Here we are going to cover 3 OS patching senarios:

  1. When HANA can be shut down on both nodes.
  2. When HANA needs to be running on one of the nodes and both nodes are patched one after the other.
  3. When HANA needs to be running on one of the nodes and both nodes are patched one after the other but with a time gap.

IMPORTANT:

  1. Please note that when cluster is running, in all maintenance procedure, before every step we have to ensure first that the cluster is stabilized by running command “cs_clusterstate -i” and looking for the output “S_IDLE”.
  2. Please make sure that if SBD is used, the SBD configuration parameter SBD_DELAY_START is set to “no”. It helps to avoid startup fencing.
  3. Before the start of the maintenance procedure follow the checks mentioned in blog part 1.

There are other patching scenarios documented in blogs and manual pages. For example you can patch the nodes one by one in combination with an SAP HANA takeover. For details, please look into blog article https://www.suse.com/c/sap-hana-maintenance-suse-clusters/ and manual page SAPHanaSR_maintenance_examples(7).

When HANA can be shut down on both nodes.

This is the most ideal scenario for the OS patching as the workloads on both the nodes are down and the admin has to only worry about the patching of the OS. Please note that generally an OS patching is done during maintenance windows where many other maintenance tasks on hardware and softwares are performed and if an admin has to focus on less variables then they can easily figure out where a problem lies. If there are too many things running on the system during maintenance then sometimes it is difficult to point out what was the cause of a problem. Therefore HANA down on both the nodes would be an ideal scenario for patching.

This scenario is already discussed in section OS reboot of blog part 2. Here I am rewriting just the steps without illustrating the command outputs.

  1. Disabling pacemaker on SAP HANA primary
  2. Disabling pacemaker on SAP HANA secondary
  3. Stopping cluster on SAP HANA secondary
  4. Stopping cluster on SAP HANA primary
  5. Patching the OS
  6. Enabling pacemaker on SAP HANA primary
  7. Enabling pacemaker on SAP HANA secondary
  8. Starting cluster on SAP HANA primary
  9. Starting cluster on SAP HANA secondary

When HANA needs to be running on one of the nodes and both nodes are patched one after the other.

This is a more practical scenario where one of the node in the cluster is always serving SAP HANA to the applications.

Note: If the primary HANA runs without connection to the registered secondary for a while, the local replication logs might fill up the filesystem. If unsure, use scenario 3.

    1. Put the multi state resource and the virtual IP resource into maintenance.

llhana2:~ # crm resource maintenance msl_SAPHana_TST_HDB00 
llhana2:~ # 


Cluster Summary:
  * Stack: corosync
  * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
  * Last updated: Mon Sep 19 21:26:56 2022
  * Last change:  Mon Sep 19 21:26:54 2022 by root via cibadmin on llhana2
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ llhana1 llhana2 ]

Active Resources:
  * stonith-sbd (stonith:external/sbd):  Started llhana1
  * rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana1
  * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
    * rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Master llhana1 (unmanaged)
    * rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Slave llhana2 (unmanaged)
  * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
    * Started: [ llhana1 llhana2 ]

llhana2:~ # crm resource maintenance rsc_ip_TST_HDB00 
llhana2:~ #

Cluster Summary:
  * Stack: corosync
  * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
  * Last updated: Mon Sep 19 21:28:05 2022
  * Last change:  Mon Sep 19 21:28:03 2022 by root via cibadmin on llhana2
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ llhana1 llhana2 ]

Active Resources:
  * stonith-sbd (stonith:external/sbd):  Started llhana1
  * rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana1 (unmanaged)
  * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
    * rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Master llhana1 (unmanaged)
    * rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Slave llhana2 (unmanaged)
  * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
    * Started: [ llhana1 llhana2 ]

DISCUSSIONS: Putting the multi state resource into maintenance first is the best practice method to start the maintenance on a HANA cluster. We no longer need to put the whole cluster into maintenance mode. Putting maintenance on the virtual IP resource is also important as we want cluster to avoid migrating this resource and we want it to stay running on its existing node. During the period of maintenance we want to manage both these resources manually.

    1. Stop cluster on both the nodes

llhana2:~ # crm cluster stop
INFO: Cluster services stopped
llhana2:~ #

llhana1:~ # crm cluster stop
INFO: Cluster services stopped
llhana1:~ #

DISCUSSIONS: We stop the cluster before the patching procedure as the OS patching procedure will also update the cluster stack. We stop the cluster on primary as well to avoid any self node fencing. If the cluster is stopped and if something goes wrong then an admin is sure that whatever happened was a result of admin’s action and not cluster action so that it narrows down where to look for the cause of the problem.

    1. Stop SAP HANA on the secondary node

llhana2:~ # su - tstadm 
tstadm@llhana2:/usr/sap/TST/HDB00> HDB stop
hdbdaemon will wait maximal 300 seconds for NewDB services finishing.
Stopping instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function Stop 400

19.09.2022 21:31:15
Stop
OK
Waiting for stopped instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function WaitforStopped 600 2


19.09.2022 21:31:31
WaitforStopped
OK
hdbdaemon is stopped.
tstadm@llhana2:/usr/sap/TST/HDB00>


llhana1:~ # cat /hana/shared/TST/HDB00/.crm_attribute.TWO 
hana_tst_site_srHook_TWO = SFAIL
llhana1:~ #

DISCUSSIONS: Most of the OS patching procedure result in OS reboot and as per SAP best practises when cluster is not running it is recommended to manually stop the HANA database. Otherwise HANA database will be stopped during a reboot and if anything is wrong with the database then it will be difficult to troubleshoot when it is rebooting than when we manually stop it.

    1. Patch and Upgrade the OS
    2. If reboot is required then disable the pacemaker on SAP HANA secondary

llhana2:~ # systemctl disable pacemaker
Removed /etc/systemd/system/multi-user.target.wants/pacemaker.service.
llhana2:~ #

DISCUSSIONS: Disabling of pacemaker ensures that there is no unwanted cluster start or fencing. We start the pacemaker when we are sure about it.

    1. Reboot the patched secondary node
    2. Enable pacemaker on the rebooted secondary node.

llhana2:~ # systemctl enable pacemaker
Created symlink /etc/systemd/system/multi-user.target.wants/pacemaker.service → /usr/lib/systemd/system/pacemaker.service.
llhana2:~ #
    1. Start SAP HANA on the rebooted secondary node.

tstadm@llhana2:/usr/sap/TST/HDB00> HDB start


StartService
Impromptu CCC initialization by 'rscpCInit'.
  See SAP note 1266393.
OK
OK
Starting instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function StartWait 2700 2


19.09.2022 21:38:20
Start
OK

19.09.2022 21:38:48
StartWait
OK
tstadm@llhana2:/usr/sap/TST/HDB00>


llhana1:~ # cat /hana/shared/TST/HDB00/.crm_attribute.TWO 
hana_tst_site_srHook_TWO = SOK
llhana1:~ #

DISCUSSIONS: During the manual stop of HANA database or during reboot of seconary HANA when the cluster is down, the srHook script creates a cache file with the name .crm_attribute.<SITENAME> at the location /hana/shared/<SID>/HDB<nr>. This is only created and updated when the cluster is down and to record the change in system replication due to stopping of HANA database on secondary. If we do not start the HANA database at this stage before the cluster start then it will later cause a wrong value of srHook attribute.

    1. Start cluster on SAP HANA primary

llhana1:~ # crm cluster start
INFO: Cluster services started
llhana1:~ #



Cluster Summary:
  * Stack: corosync
  * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition WITHOUT quorum
  * Last updated: Mon Sep 19 21:55:29 2022
  * Last change:  Mon Sep 19 21:48:09 2022 by root via cibadmin on llhana2
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Node llhana2: UNCLEAN (offline)
  * Online: [ llhana1 ]

Active Resources:
  * rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana1 (unmanaged)
  * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
    * rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Slave llhana1 (unmanaged)

DISCUSSIONS: At this stage it is important that we start the cluster at primary HANA first as due to patching the software version of the cluster stack of two HANA nodes may be different and it may hit a situation as described in following TID: https://www.suse.com/de-de/support/kb/doc/?id=000019119

    1. Start cluster on SAP HANA secondary

llhana2:~ # crm cluster start
INFO: Cluster services started
llhana2:~ #




Cluster Summary:
  * Stack: corosync
  * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
  * Last updated: Mon Sep 19 21:56:57 2022
  * Last change:  Mon Sep 19 21:48:09 2022 by root via cibadmin on llhana2
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ llhana1 llhana2 ]

Active Resources:
  * stonith-sbd (stonith:external/sbd):  Started llhana1
  * rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana1 (unmanaged)
  * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
    * rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Slave llhana1 (unmanaged)
    * rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Slave llhana2 (unmanaged)
  * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
    * Started: [ llhana1 llhana2 ]
    
    
    

    
llhana2:~ # SAPHanaSR-showAttr 
Global cib-time                 maintenance 
--------------------------------------------
global Mon Sep 19 21:48:09 2022 false       

Resource              maintenance 
----------------------------------
msl_SAPHana_TST_HDB00 true        
rsc_ip_TST_HDB00      true        

Sites srHook 
-------------
ONE   PRIM   
TWO   SOK    

Hosts   clone_state lpa_tst_lpt node_state op_mode   remoteHost roles                            score site srah srmode version                vhost   
-------------------------------------------------------------------------------------------------------------------------------------------------------
llhana1             1663616828  online     logreplay llhana2    4:P:master1:master:worker:master -1    ONE  -    sync   2.00.052.00.1599235305 llhana1 
llhana2 DEMOTED     30          online     logreplay llhana1    4:S:master1:master:worker:master -1    TWO  -    sync   2.00.052.00.1599235305 llhana2 

llhana2:~ #
    1. Refresh the cloned SAPHanaTopology resource

llhana2:~ # crm resource refresh rsc_ip_TST_HDB00 
Cleaned up rsc_ip_TST_HDB00 on llhana2
Cleaned up rsc_ip_TST_HDB00 on llhana1
Waiting for 2 replies from the controller.. OK
llhana2:~ #



llhana2:~ # crm resource refresh cln_SAPHanaTopology_TST_HDB00 
Cleaned up rsc_SAPHanaTopology_TST_HDB00:0 on llhana1
Cleaned up rsc_SAPHanaTopology_TST_HDB00:1 on llhana2
Waiting for 2 replies from the controller.. OK
llhana2:~ # 





llhana2:~ # SAPHanaSR-showAttr 
Global cib-time                 maintenance 
--------------------------------------------
global Mon Sep 19 22:01:38 2022 false       

Resource              maintenance 
----------------------------------
msl_SAPHana_TST_HDB00 true        
rsc_ip_TST_HDB00      true        

Sites srHook 
-------------
ONE   PRIM   
TWO   SOK    

Hosts   clone_state lpa_tst_lpt node_state op_mode   remoteHost roles                            score site srah srmode version                vhost   
-------------------------------------------------------------------------------------------------------------------------------------------------------
llhana1             1663616828  online     logreplay llhana2    4:P:master1:master:worker:master -1    ONE  -    sync   2.00.052.00.1599235305 llhana1 
llhana2 DEMOTED     30          online     logreplay llhana1    4:S:master1:master:worker:master -1    TWO  -    sync   2.00.052.00.1599235305 llhana2 

llhana2:~ # 
    1. Refresh the multi state SAPHana resource

llhana2:~ # crm resource refresh msl_SAPHana_TST_HDB00 
Cleaned up rsc_SAPHana_TST_HDB00:0 on llhana1
Cleaned up rsc_SAPHana_TST_HDB00:1 on llhana2
Waiting for 2 replies from the controller.. OK
llhana2:~ #



llhana2:~ # SAPHanaSR-showAttr 
Global cib-time                 maintenance 
--------------------------------------------
global Mon Sep 19 22:02:52 2022 false       

Resource              maintenance 
----------------------------------
msl_SAPHana_TST_HDB00 true        
rsc_ip_TST_HDB00      true        

Sites srHook 
-------------
ONE   PRIM   
TWO   SOK    

Hosts   clone_state lpa_tst_lpt node_state op_mode   remoteHost roles                            score site srah srmode version                vhost   
-------------------------------------------------------------------------------------------------------------------------------------------------------
llhana1             1663616828  online     logreplay llhana2    4:P:master1:master:worker:master 150   ONE  -    sync   2.00.052.00.1599235305 llhana1 
llhana2 DEMOTED     30          online     logreplay llhana1    4:S:master1:master:worker:master 100   TWO  -    sync   2.00.052.00.1599235305 llhana2 

llhana2:~ #

DISCUSSIONS: Refreshing the multi state resources updates many node attributes as we can see in above output that the score attribute has been updated.

    1. Set maintenance off on the virtual IP and the multi state resource.

llhana2:~ # crm resource maintenance rsc_ip_TST_HDB00 off
llhana2:~ #

llhana2:~ # crm resource maintenance msl_SAPHana_TST_HDB00 off
llhana2:~ #



Cluster Summary:
  * Stack: corosync
  * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
  * Last updated: Mon Sep 19 22:06:27 2022
  * Last change:  Mon Sep 19 22:06:26 2022 by root via crm_attribute on llhana1
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ llhana1 llhana2 ]

Active Resources:
  * stonith-sbd (stonith:external/sbd):  Started llhana1
  * rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana1
  * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable):
    * Masters: [ llhana1 ]
    * Slaves: [ llhana2 ]
  * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
    * Started: [ llhana1 llhana2 ]
    
    
    
    
llhana2:~ # SAPHanaSR-showAttr 
Global cib-time                 maintenance 
--------------------------------------------
global Mon Sep 19 22:06:26 2022 false       

Resource              maintenance 
----------------------------------
msl_SAPHana_TST_HDB00 false       
rsc_ip_TST_HDB00      false       

Sites srHook 
-------------
ONE   PRIM   
TWO   SOK    

Hosts   clone_state lpa_tst_lpt node_state op_mode   remoteHost roles                            score site srah srmode sync_state version                vhost   
------------------------------------------------------------------------------------------------------------------------------------------------------------------
llhana1 PROMOTED    1663617986  online     logreplay llhana2    4:P:master1:master:worker:master 150   ONE  -    sync   PRIM       2.00.052.00.1599235305 llhana1 
llhana2 DEMOTED     30          online     logreplay llhana1    4:S:master1:master:worker:master 100   TWO  -    sync   SOK        2.00.052.00.1599235305 llhana2 

llhana2:~ #
  1. Switch HANA roles between the two nodes:
  2. Forcefully migrate the multi state resource away from its current node.
  3. 
    llhana1:~ # crm resource move msl_SAPHana_TST_HDB00 force
    INFO: Move constraint created for msl_SAPHana_TST_HDB00
    INFO: Use `crm resource clear msl_SAPHana_TST_HDB00` to remove this constraint
    llhana1:~ #
    
    
    
    
    Cluster Summary:
      * Stack: corosync
      * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
      * Last updated: Mon Sep 19 12:44:46 2022
      * Last change:  Mon Sep 19 12:44:37 2022 by root via crm_attribute on llhana1
      * 2 nodes configured
      * 6 resource instances configured
    
    Node List:
      * Online: [ llhana1 llhana2 ]
    
    Active Resources:
      * stonith-sbd (stonith:external/sbd):  Started llhana1
      * rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana2
      * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable):
        * rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Promoting llhana2
      * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
        * Started: [ llhana1 llhana2 ]
        
        
    
        
        
    Cluster Summary:
      * Stack: corosync
      * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
      * Last updated: Mon Sep 19 12:45:20 2022
      * Last change:  Mon Sep 19 12:45:20 2022 by root via crm_attribute on llhana1
      * 2 nodes configured
      * 6 resource instances configured
    
    Node List:
      * Online: [ llhana1 llhana2 ]
    
    Active Resources:
      * stonith-sbd (stonith:external/sbd):  Started llhana1
      * rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana2
      * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable):
        * Masters: [ llhana2 ]
      * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
        * Started: [ llhana1 llhana2 ]
    
  4. Check the role attributes of both nodes
  5. 
    llhana1:~ # SAPHanaSR-showAttr --format=script |SAPHanaSR-filter --search='roles'
    Mon Sep 19 12:45:20 2022; Hosts/llhana1/roles=1:P:master1::worker:
    Mon Sep 19 12:45:20 2022; Hosts/llhana2/roles=4:P:master1:master:worker:master
    llhana1:~ #
    
    
  6. Register the old SAP HANA primary as the new SAP HANA secondary
  7. 
    llhana1:~ # su - tstadm 
    tstadm@llhana1:/usr/sap/TST/HDB00> hdbnsutil -sr_register --remoteHost=llhana2 --remoteInstance=00 --replicationMode=sync --operationMode=logreplay --name=ONE
    adding site ...
    nameserver llhana1:30001 not responding.
    collecting information ...
    updating local ini files ...
    done.
    tstadm@llhana1:/usr/sap/TST/HDB00>
    
  8. Clear the migration constraint created after forceful migration of multi state resource
  9. 
    llhana1:~ # crm resource clear msl_SAPHana_TST_HDB00 
    INFO: Removed migration constraints for msl_SAPHana_TST_HDB00
    llhana1:~ #
    
    
    
    
    
    Cluster Summary:
      * Stack: corosync
      * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
      * Last updated: Mon Sep 19 12:48:09 2022
      * Last change:  Mon Sep 19 12:47:56 2022 by root via crm_attribute on llhana2
      * 2 nodes configured
      * 6 resource instances configured
    
    Node List:
      * Online: [ llhana1 llhana2 ]
    
    Active Resources:
      * stonith-sbd (stonith:external/sbd):  Started llhana1
      * rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana2
      * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable):
        * rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Starting llhana1
        * Masters: [ llhana2 ]
      * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
        * Started: [ llhana1 llhana2 ]
        
        
        
        
        
    Cluster Summary:
      * Stack: corosync
      * Current DC: llhana1 (version 2.0.5+20201202.ba59be712-150300.4.21.1-2.0.5+20201202.ba59be712) - partition with quorum
      * Last updated: Mon Sep 19 12:48:09 2022
      * Last change:  Mon Sep 19 12:47:56 2022 by root via crm_attribute on llhana2
      * 2 nodes configured
      * 6 resource instances configured
    
    Node List:
      * Online: [ llhana1 llhana2 ]
    
    Active Resources:
      * stonith-sbd (stonith:external/sbd):  Started llhana1
      * rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana2
      * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable):
        * Masters: [ llhana2 ]
        * Slaves: [ llhana1 ]
      * Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
        * Started: [ llhana1 llhana2 ]
    
  10. Check the roles attribute once again.
  11. 
    llhana1:~ # SAPHanaSR-showAttr --format=script |SAPHanaSR-filter --search='roles'
    Mon Sep 19 12:49:56 2022; Hosts/llhana1/roles=4:S:master1:master:worker:master
    Mon Sep 19 12:49:56 2022; Hosts/llhana2/roles=4:P:master1:master:worker:master
    llhana1:~ #
    
  12. Repeat step 1 to 13 on the new secondary node.

 

When HANA needs to be running on one of the nodes and both nodes are patched one after the other but with a time gap.

Sometimes it becomes necessary that there is a gap of time between patching of OS of the two nodes in a cluster. Ideally both the nodes should be patched together, but if it is so required that there is a time gap between the patching of the two nodes then firstly one must understand the gap should not be more than few days. There is no specific recommendation on how long this gap should be but a generic recommendation is that the gap should be as less as possible and it should not be more than a week. In such a scenario if only node is serving HANA and the database is not getting synced to secondary for long then there is possibility that the /hana/log filesystem will get filled up. To avoid this situation it is recommended that in scenario no.2 we change step no 3 and step no 8 as shown below. Rest all steps remains same as in scenario no. 2.

Step 3. Un-register SAP HANA secondary and Stop SAP HANA on the secondary node


tstadm@llhana2:/usr/sap/TST/HDB00> hdbnsutil -sr_unregister
unregistering site ...
done.
tstadm@llhana2:/usr/sap/TST/HDB00>



tstadm@llhana2:/usr/sap/TST/HDB00> HDB info
USER          PID     PPID  %CPU        VSZ        RSS COMMAND
tstadm      23348    23338   0.0      12452       3224 -sh -c python /usr/sap/TST/HDB00/exe/python_support/systemReplicationStatus.py --sapcontrol=1
tstadm      23368    23348   0.0      12452       1580  \_ -sh -c python /usr/sap/TST/HDB00/exe/python_support/systemReplicationStatus.py --sapcontrol=1
tstadm      23369    23368   0.0      12452       2192      \_ [sh]
tstadm      23370    23368   0.0      12452        364      \_ -sh -c python /usr/sap/TST/HDB00/exe/python_support/systemReplicationStatus.py --sapcontrol=1
tstadm      18316    18315   0.0      17492       6056 -sh
tstadm      23321    18316  10.0      15580       3960  \_ /bin/sh /usr/sap/TST/HDB00/HDB info
tstadm      23367    23321   0.0      39248       3992      \_ ps fx -U tstadm -o user:8,pid:8,ppid:8,pcpu:5,vsz:10,rss:10,args
tstadm      31112        1   0.0     716656      51344 hdbrsutil  --start --port 30003 --volume 3 --volumesuffix mnt00001/hdb00003.00003 --identifier 1663618494
tstadm      30319        1   0.0     716324      50960 hdbrsutil  --start --port 30001 --volume 1 --volumesuffix mnt00001/hdb00001 --identifier 1663618487
tstadm      29515        1   0.0      23632       3116 sapstart pf=/usr/sap/TST/SYS/profile/TST_HDB00_llhana2
tstadm      29522    29515   0.0     465248      74636  \_ /usr/sap/TST/HDB00/llhana2/trace/hdb.sapTST_HDB00 -d -nw -f /usr/sap/TST/HDB00/llhana2/daemon.ini pf=/usr/sap/TS
tstadm      29540    29522  17.2    5618440    3063240      \_ hdbnameserver
tstadm      29832    29522   0.7     452096     126408      \_ hdbcompileserver
tstadm      29835    29522   0.8     719928     154848      \_ hdbpreprocessor
tstadm      29877    29522  18.6    5846104    3339092      \_ hdbindexserver -port 30003
tstadm      29880    29522   2.6    3779524    1230612      \_ hdbxsengine -port 30007
tstadm      30536    29522   1.3    2415948     416800      \_ hdbwebdispatcher
tstadm       2637        1   0.0     502872      30596 /usr/sap/TST/HDB00/exe/sapstartsrv pf=/usr/sap/TST/SYS/profile/TST_HDB00_llhana2 -D -u tstadm
tstadm       2558        1   0.0      89060      11588 /usr/lib/systemd/systemd --user
tstadm       2559     2558   0.0     136924       3728  \_ (sd-pam)
tstadm@llhana2:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetProcessList

19.09.2022 22:24:06
GetProcessList
OK
name, description, dispstatus, textstatus, starttime, elapsedtime, pid
hdbdaemon, HDB Daemon, GREEN, Running, 2022 09 19 22:14:38, 0:09:28, 29522
hdbcompileserver, HDB Compileserver, GREEN, Running, 2022 09 19 22:14:42, 0:09:24, 29832
hdbindexserver, HDB Indexserver-TST, GREEN, Running, 2022 09 19 22:14:43, 0:09:23, 29877
hdbnameserver, HDB Nameserver, GREEN, Running, 2022 09 19 22:14:38, 0:09:28, 29540
hdbpreprocessor, HDB Preprocessor, GREEN, Running, 2022 09 19 22:14:42, 0:09:24, 29835
hdbwebdispatcher, HDB Web Dispatcher, GREEN, Running, 2022 09 19 22:14:49, 0:09:17, 30536
hdbxsengine, HDB XSEngine-TST, GREEN, Running, 2022 09 19 22:14:43, 0:09:23, 29880
tstadm@llhana2:/usr/sap/TST/HDB00>



tstadm@llhana2:/usr/sap/TST/HDB00> HDB stop
hdbdaemon will wait maximal 300 seconds for NewDB services finishing.
Stopping instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function Stop 400

19.09.2022 22:24:42
Stop
OK
Waiting for stopped instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function WaitforStopped 600 2


19.09.2022 22:25:10
WaitforStopped
OK
hdbdaemon is stopped.
tstadm@llhana2:/usr/sap/TST/HDB00>

Step 8. Re-register the SAP HANA secondary and Start SAP HANA on the rebooted secondary node.



llhana2:~ # su - tstadm 
tstadm@llhana2:/usr/sap/TST/HDB00> hdbnsutil -sr_register --name=TWO --remoteHost=llhana1 --remoteInstance=00 --replicationMode=sync --operationMode=logreplay
adding site ...
nameserver llhana2:30001 not responding.
collecting information ...
updating local ini files ...
done.
tstadm@llhana2:/usr/sap/TST/HDB00> HDB start


StartService
Impromptu CCC initialization by 'rscpCInit'.
  See SAP note 1266393.
OK
OK
Starting instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function StartWait 2700 2


19.09.2022 22:31:18
Start
OK

19.09.2022 22:31:46
StartWait
OK
tstadm@llhana2:/usr/sap/TST/HDB00>



tstadm@llhana2:/usr/sap/TST/HDB00> HDB info
USER          PID     PPID  %CPU        VSZ        RSS COMMAND
tstadm       4372     4371   0.0      17492       6036 -sh
tstadm       6518     4372   0.0      15580       3996  \_ /bin/sh /usr/sap/TST/HDB00/HDB info
tstadm       6549     6518   0.0      39248       3980      \_ ps fx -U tstadm -o user:8,pid:8,ppid:8,pcpu:5,vsz:10,rss:10,args
tstadm       5734        1   0.0     716656      51284 hdbrsutil  --start --port 30003 --volume 3 --volumesuffix mnt00001/hdb00003.00003 --identifier 1663619495
tstadm       5350        1   0.0     716328      51008 hdbrsutil  --start --port 30001 --volume 1 --volumesuffix mnt00001/hdb00001 --identifier 1663619489
tstadm       4821        1   0.0      23632       3008 sapstart pf=/usr/sap/TST/SYS/profile/TST_HDB00_llhana2
tstadm       4828     4821   0.7     465292      74828  \_ /usr/sap/TST/HDB00/llhana2/trace/hdb.sapTST_HDB00 -d -nw -f /usr/sap/TST/HDB00/llhana2/daemon.ini pf=/usr/sap/TS
tstadm       4846     4828  20.8    9792200    1601584      \_ hdbnameserver
tstadm       5130     4828   1.1     447236     122056      \_ hdbcompileserver
tstadm       5133     4828   1.4     718900     153020      \_ hdbpreprocessor
tstadm       5176     4828  20.3    9797780    1671000      \_ hdbindexserver -port 30003
tstadm       5179     4828  12.6    5081520    1070824      \_ hdbxsengine -port 30007
tstadm       5453     4828   6.6    2413644     414576      \_ hdbwebdispatcher
tstadm       2633        1   0.4     502876      30744 /usr/sap/TST/HDB00/exe/sapstartsrv pf=/usr/sap/TST/SYS/profile/TST_HDB00_llhana2 -D -u tstadm
tstadm       2558        1   0.0      89056      11928 /usr/lib/systemd/systemd --user
tstadm       2559     2558   0.0     136908       3692  \_ (sd-pam)
tstadm@llhana2:/usr/sap/TST/HDB00> 
tstadm@llhana2:/usr/sap/TST/HDB00> 
tstadm@llhana2:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetProcessList

19.09.2022 22:32:24
GetProcessList
OK
name, description, dispstatus, textstatus, starttime, elapsedtime, pid
hdbdaemon, HDB Daemon, GREEN, Running, 2022 09 19 22:31:21, 0:01:03, 4828
hdbcompileserver, HDB Compileserver, GREEN, Running, 2022 09 19 22:31:26, 0:00:58, 5130
hdbindexserver, HDB Indexserver-TST, GREEN, Running, 2022 09 19 22:31:26, 0:00:58, 5176
hdbnameserver, HDB Nameserver, GREEN, Running, 2022 09 19 22:31:21, 0:01:03, 4846
hdbpreprocessor, HDB Preprocessor, GREEN, Running, 2022 09 19 22:31:26, 0:00:58, 5133
hdbwebdispatcher, HDB Web Dispatcher, GREEN, Running, 2022 09 19 22:31:31, 0:00:53, 5453
hdbxsengine, HDB XSEngine-TST, GREEN, Running, 2022 09 19 22:31:26, 0:00:58, 5179
tstadm@llhana2:/usr/sap/TST/HDB00>

 

 

Please also read our other blogs about #TowardsZeroDowntime.

 

Where can I find further information?

Share
(Visited 34 times, 1 visits today)

Comments

  • Avatar photo Sachin says:

    We have a Suse Pacemaker cluster for Hana DB in AWS and are documenting our OS patching procedures. We’re encountering an issue that needs clarification.

    Following the blog’s steps:
    1) Pacemaker service starts/stops with the cluster, so we do not manage it manually.

    2) After setting Vip and MSl hana resources to maintenance and stopping the cluster on both nodes >> HDB stop on standby, the node gets fenced and the AWS instance shuts down. We needed to put SAPHanaTopology in maintenance mode, which wasn’t documented. Once done, all subsequent steps were fine, but SAPHanaTopology must be brought out of maintenance mode afterward. Is this correct?.

    3) We have SAP ASCS and ERS in another 2-node cluster and couldn’t find guidance on minimizing downtime during OS patching. Is there a specific blog for this?

  • Avatar photo Sanjeet Kumar Jha says:

    Thank you for your comment. Here is my response.

    1) I guess the command you are referring is to disable pacemaker and not stop/start it. Stopping the cluster service “crm cluster stop” will not disable pacemaker, it will only stop the pacemaker and therefore I have mentioned about this command separately.

    2)It will be interesting to know why the node gets fenced. In my opinion there should not be a fencing but it would be interesting to know what is happening in your case. So, if you can share more information then it will help me to improve this blog article.
    To answer your other question regarding if it is ok to set maintenance on SAPHanaTopology and whether it should be brought out of maintenance afterwards: To my understanding setting a maintenance SAPHanaTopology should not be necessary. I do have come across some maintenance/patching procedure that works where after setting maintenance on “msl” resource there is an intermediary step of setting maintenance on whole cluster, the maintenance on whole cluster is unset before unsetting maintenance on “msl” resource and refreshing both “SAPHanaTopology” and “msl” resource. Refreshing the SAPHanaTopology resource before unsetting maintenance on “msl” is very important as then only the cluster knows the current state of HANA and sets the correct attributes. And to know the current state of HANA it is important that SAPHanaTopology resource agent is in running state.

    3)You can use our man page SAPStartSrv_maintenance_procedures(7) for ASCS/ERS. You can also refer to our best practice guide: https://documentation.suse.com/sbp/sap-15/html/SAP-S4HA10-setupguide-simplemount-sle15/index.html#id-maintenance-procedure-for-a-linux-cluster-or-operating-system-with-ascs-and-ers-instances-remain-running

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    Avatar photo
    7,989 views
    Sanjeet Kumar Jha I am a SAP Solution Architect for High Availability at SUSE. I have over a decade years of experience with SUSE high availability technologies for SAP applications.