SLES for SAP HANA Maintenance Procedures – Part -2 (Manual Administrative Tasks, OS reboots and Updation of OS and HANA)
This is the second part of the blog on maintenance procedure for SLE for SAP running SAP HANA workload. The first part of the blog is about the pre-maintenance checks. In this blog I am going to discuss about the actual maintenance procedures, the steps involved as per the best practises, cleanup procedures etc.
We’ll cover the following maintenance procedures.
- Manual take-over
- Manual start of primary when only one node is available
- OS Reboots
- SAP HANA update
- Patching of Cluster Software Stack
- Cleanup after manual administrative activities
Manual take-over
This section details the manual take-over of the SAP HANA database. The status of SAP HANA databases, system replication and Linux cluster has to be checked. The SAP HANA resources are set into maintenance, an sr_takeover is performed, the old primary is registered as new secondary. Therefore the correct secondary site name has to be used. Finally the SAP HANA resources are given back to the Linux cluster.
1. First perform the checks as mention in the first part of the blog. If everything looks fine, proceed to put the msl resource into maintenance mode.
llhana1:~ # crm resource maintenance msl_SAPHana_TST_HDB00
llhana1:~ # crm_mon -1r
Cluster Summary:
* Stack: corosync
* Current DC: llhana2 (version 2.0.5+20201202.ba59be712-150300.4.16.1-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Tue Apr 19 23:21:54 2022
* Last change: Tue Apr 19 23:21:39 2022 by root via cibadmin on llhana1
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ llhana1 llhana2 ]
Full List of Resources:
* stonith-sbd (stonith:external/sbd): Started llhana1
* Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
* Started: [ llhana1 llhana2 ]
* Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
* rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Master llhana1 (unmanaged)
* rsc_SAPHana_TST_HDB00 (ocf::suse:SAPHana): Slave llhana2 (unmanaged)
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana1
llhana1:~ #
2. Stop the SAP HANA primary site
llhana1:~ # su - tstadm
tstadm@llhana1:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function StopSystem HDB
19.04.2022 23:23:33
StopSystem
OK
tstadm@llhana1:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetSystemInstanceList
19.04.2022 23:23:56
GetSystemInstanceList
OK
hostname, instanceNr, httpPort, httpsPort, startPriority, features, dispstatus
llhana1, 0, 50013, 50014, 0.3, HDB|HDB_WORKER, GREEN
tstadm@llhana1:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetSystemInstanceList
19.04.2022 23:24:23
GetSystemInstanceList
OK
hostname, instanceNr, httpPort, httpsPort, startPriority, features, dispstatus
llhana1, 0, 50013, 50014, 0.3, HDB|HDB_WORKER, GRAY
tstadm@llhana1:/usr/sap/TST/HDB00>
We should now only proceed after we have made sure the SAP HANA primary is down. This can be ensured by checking the value of “dispstatus” to be “GRAY” and not “GREEN”.
3. Initiate the takeover on the SAP HANA secondary site
llhana2:~ # su - tstadm
tstadm@llhana2:/usr/sap/TST/HDB00> hdbnsutil -sr_takeover
done.
tstadm@llhana2:/usr/sap/TST/HDB00> HDBSettings.sh systemReplicationStatus.py; echo RC:$?
there are no secondary sites attached
Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mode: PRIMARY
site id: 2
site name: TWO
RC:10
tstadm@llhana2:/usr/sap/TST/HDB00> HDBSettings.sh landscapeHostConfiguration.py; echo RC:$?
| Host | Host | Host | Failover | Remove | Storage | Storage | Failover | Failover | NameServer | NameServer | IndexServer | IndexServer | Host | Host | Work
er | Worker |
| | Active | Status | Status | Status | Config | Actual | Config | Actual | Config | Actual | Config | Actual | Config | Actual | Conf
ig | Actual |
| | | | | | Partition | Partition | Group | Group | Role | Role | Role | Role | Roles | Roles | Grou
ps | Groups |
| ------- | ------ | ------ | -------- | ------ | --------- | --------- | -------- | -------- | ---------- | ---------- | ----------- | ----------- | ------ | ------ | ----
--- | ------- |
| llhana2 | yes | ok | | | 1 | 1 | default | default | master 1 | master | worker | master | worker | worker | defa
ult | default |
overall host status: ok
RC:4
tstadm@llhana2:/usr/sap/TST/HDB00>
If everything looks fine, then proceed to the next step.
4. Register the former HANA primary site, now future secondary site to the new primary site
tstadm@llhana1:/usr/sap/TST/HDB00> hdbnsutil -sr_register --remoteHost=llhana2 --remoteInstance=00 --replicationMode=sync --name=ONE --operationMode=logreplay
adding site ...
nameserver llhana1:30001 not responding.
collecting information ...
updating local ini files ...
done.
tstadm@llhana1:/usr/sap/TST/HDB00>sapcontrol -nr 00 -function StartSystem HDB
19.04.2022 23:38:29
StartSystem
OK
tstadm@llhana1:/usr/sap/TST/HDB00>exit
logout
llhana1:~ #
5. Check the system replication status on the new HANA primary site
tstadm@llhana2:/usr/sap/TST/HDB00> HDBSettings.sh systemReplicationStatus.py; echo RC:$?
| Database | Host | Port | Service Name | Volume ID | Site ID | Site Name | Secondary | Secondary | Secondary | Secondary | Secondary | Replication | Replication |
Replication |
| | | | | | | | Host | Port | Site ID | Site Name | Active Status | Mode | Status |
Status Details |
| -------- | ------- | ----- | ------------ | --------- | ------- | --------- | --------- | --------- | --------- | --------- | ------------- | ----------- | ----------- |
-------------- |
| SYSTEMDB | llhana2 | 30001 | nameserver | 1 | 2 | TWO | llhana1 | 30001 | 1 | ONE | YES | SYNC | ACTIVE |
|
| TST | llhana2 | 30007 | xsengine | 2 | 2 | TWO | llhana1 | 30007 | 1 | ONE | YES | SYNC | ACTIVE |
|
| TST | llhana2 | 30003 | indexserver | 3 | 2 | TWO | llhana1 | 30003 | 1 | ONE | YES | SYNC | ACTIVE |
|
status system replication site "1": ACTIVE
overall system replication status: ACTIVE
Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mode: PRIMARY
site id: 2
site name: TWO
RC:15
tstadm@llhana2:/usr/sap/TST/HDB00> HDBSettings.sh landscapeHostConfiguration.py; echo RC:$?
| Host | Host | Host | Failover | Remove | Storage | Storage | Failover | Failover | NameServer | NameServer | IndexServer | IndexServer | Host | Host | Work
er | Worker |
| | Active | Status | Status | Status | Config | Actual | Config | Actual | Config | Actual | Config | Actual | Config | Actual | Conf
ig | Actual |
| | | | | | Partition | Partition | Group | Group | Role | Role | Role | Role | Roles | Roles | Grou
ps | Groups |
| ------- | ------ | ------ | -------- | ------ | --------- | --------- | -------- | -------- | ---------- | ---------- | ----------- | ----------- | ------ | ------ | ----
--- | ------- |
| llhana2 | yes | ok | | | 1 | 1 | default | default | master 1 | master | worker | master | worker | worker | defa
ult | default |
overall host status: ok
RC:4
tstadm@llhana2:/usr/sap/TST/HDB00> exit
logout
llhana2:~ #
If everything looks fine, then perform the next set of steps.
6. On either of the nodes, check the cluster state, refresh the msl resource, set the maintenance attribute to “off” on the msl resource and check the system replication attributes.
llhana2:~ # cs_clusterstate -i
### llhana2.lab.sk - 2022-04-19 23:42:04 ###
Cluster state: S_IDLE
llhana2:~ # crm resource refresh msl_SAPHana_TST_HDB00
Cleaned up rsc_SAPHana_TST_HDB00:0 on llhana1
Cleaned up rsc_SAPHana_TST_HDB00:1 on llhana2
Waiting for 2 replies from the controller.. OK
llhana2:~ # crm resource maintenance msl_SAPHana_TST_HDB00 off
llhana2:~ # SAPHanaSR-showAttr
Global cib-time
--------------------------------
global Tue Apr 19 23:42:50 2022
Resource maintenance
----------------------------------
msl_SAPHana_TST_HDB00 false
Sites srHook
-------------
ONE SOK
TWO PRIM
Hosts clone_state lpa_tst_lpt maintenance node_state op_mode remoteHost roles score site srmode standby sync_state version vho
st
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----
llhana1 DEMOTED 30 off online logreplay llhana2 4:S:master1:master:worker:master 100 ONE sync off SOK 2.00.052.00.1599235305 llh
ana1
llhana2 PROMOTED 1650404570 online logreplay llhana1 4:P:master1:master:worker:master 150 TWO sync off PRIM 2.00.052.00.1599235305 llh
ana2
llhana2:~ #
llhana2:~ # crm_mon -1r
Cluster Summary:
* Stack: corosync
* Current DC: llhana2 (version 2.0.5+20201202.ba59be712-150300.4.16.1-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Tue Apr 19 23:43:50 2022
* Last change: Tue Apr 19 23:42:50 2022 by root via crm_attribute on llhana2
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ llhana1 llhana2 ]
Full List of Resources:
* stonith-sbd (stonith:external/sbd): Started llhana1
* Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
* Started: [ llhana1 llhana2 ]
* Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable):
* Masters: [ llhana2 ]
* Slaves: [ llhana1 ]
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started llhana2
llhana2:~ # cs_clusterstate -i
### llhana2.lab.sk - 2022-04-19 23:43:54 ###
Cluster state: S_IDLE
llhana2:~ #
Manual start of primary when only one node is available
This might be necessary in case the cluster can not detect the status of both sites. This is an advanced task. For this, I am not sharing the command outputs however, I am indicating which commands needs to be run for the specific step. Assuming, hostnames and cluster node names to be “llhana1” and “llhana2”. SID to be TST and instance number to be 00.
Before doing this, make sure SAP HANA is not primary on the other site!
1. Start the cluster on remaining nodes.
systemctl start pacemaker
2. Wait and check for cluster is running, and in status idle.
watch cs_clusterstate -i
3. Become sidadm, and start HANA manually.
# su - tstadm
~>HDB start
4. Wait and check for HANA is running. If the cluster does not start the SAP HANA then refresh the msl resource.
#crm resource refresh msl_SAPHana_TST_HDB00 llhana1
5. In case the cluster does not promote the SAP HANA to primary, instruct the cluster to migrate the IP address to that node.
#crm resource move rsc_ip_TST_HDB00 llhana1
6. Wait and check for HANA has been promoted to primary by the cluster.
7. Remove the migration rule from the IP address.
#crm resource clear rsc_ip_TST_HDB00
8. Check if cluster is in status idle.
watch cs_clusterstate -i
9. You are done, for now.
10. Please bring back the other node and register that SAP HANA as soon as possible. If the SAP HANA primary stays alone for too long, the log area will fill up.
OS Reboots
Cluster pre-checks should be done as explained in first part of the blog before performing below steps and cluster idle state must be ensured in the intermediate steps:
1. Disabling pacemaker on SAP HANA primary
llhana1:~ # systemctl disable pacemaker
Removed /etc/systemd/system/multi-user.target.wants/pacemaker.service.
llhana1:~ #
2. Disabling and stopping pacemaker on SAP HANA secondary
llhana2:~ # systemctl disable pacemaker
Removed /etc/systemd/system/multi-user.target.wants/pacemaker.service.
llhana2:~ #
3. Stopping cluster on SAP HANA secondary
llhana2:~ # crm cluster stop
INFO: Cluster services stopped
llhana2:~ #
– SAP HANA secondary will be stopped, secondary shows OFFLINE in crm_mon
– system replication goes SFAIL
llhana1:~ # SAPHanaSR-showAttr
Global cib-time
--------------------------------
global Wed Apr 20 10:31:18 2022
Resource maintenance
----------------------------------
msl_SAPHana_TST_HDB00 false
Sites srHook
-------------
ONE PRIM
TWO SFAIL
Hosts clone_state lpa_tst_lpt maintenance node_state op_mode remoteHost roles score site srmode standby sync_state version vho
st
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----
llhana1 PROMOTED 1650443478 off online logreplay llhana2 4:P:master1:master:worker:master 150 ONE sync off PRIM 2.00.052.00.1599235305 llh
ana1
llhana2 10 offline logreplay llhana1 TWO sync off llh
ana2
llhana1:~ #
4. stopping cluster on SAP HANA primary
llhana1:~ # crm cluster stop
INFO: Cluster services stopped
llhana1:~ #
– SAP HANA primary will be stopped
5. Doing something with OS or hardware
6. Enabling pacemaker on SAP HANA primary
llhana1:~ # systemctl enable pacemaker
Created symlink /etc/systemd/system/multi-user.target.wants/pacemaker.service → /usr/lib/systemd/system/pacemaker.service.
llhana1:~ #
7. Enabling pacemaker on SAP HANA secondary
llhana2:~ # systemctl enable pacemaker
Created symlink /etc/systemd/system/multi-user.target.wants/pacemaker.service → /usr/lib/systemd/system/pacemaker.service.
llhana2:~ #
8. Starting cluster on SAP HANA primary
llhana1:~ # crm cluster start
INFO: Cluster services started
llhana1:~ #
Since, for a two node cluster the default corosync configuration is wait_for_all therefore, when we start the pacemaker on the primary HANA node while the pacemaker at the other node is stopped, we observe that pacemaker does not start any resource and waits for the other node to be available.
9. Starting cluster on SAP HANA secondary
llhana1:~ # crm cluster start
INFO: Cluster services started
llhana1:~ #
As soon as the pacemaker at secondary SAP HANA is started, the existing primary node in the cluster sees the secondary node online. In a 2 node cluster it is impossible to ascertain that while the existing node was down/offline the other node was not running any resource, therefore to ensure data integrity and to be on the safer side cluster fences the secondary node. Once the secondary node reboots and comes online, it synchronizes with the primary SAP HANA and the system replication status changes to SOK.
It is also important to note that during the maintenance in step 5 if the secondary node was recently rebooted (less than 5 minutes ago) when the fencing was triggered then in that case although we see a message that secondary node has rebooted due to fencing but there is no actual reboot.
In case it is desired that there should be no fencing in this case then, either:
1. You can temporarily set the cluster property “startup-fencing” to “false”, although, it is important to note that it is not a recommended setting and should only be performed by an advanced users only.
Or,
2. Or, you can set the SBD configuration parameter SBD_DELAY_START to “no”.
SAP HANA update
When we need to perform the update of SAP HANA software, we have to ask the cluster to disable the management of multi-state (msl) resource which in turn will disable the management of SAP HANA resource agents. Cluster will now no longer start, stop or monitor the SAP HANA database. Admins will be able to manually start, stop, the SAP HANA database and perform a system replication takeover. Since the virtual IP resource will still be running and managed by the cluster, so, in case of any takeover, the IP will automatically move to the new primary node.
1. Pre Update Task
For the multi-state-resource set the maintenance mode:
llhana1:~ # crm resource maintenance msl_SAPHana_TST_HDB00
llhana1:~ #
2. Update
The update procedures for SAP HANA needs to be followed from SAP documentations.
3. Post Update Task
In case if the roles of the node were changed during the maintenance activity then a resource refresh will help the cluster to know the status of the current roles.
llhana1:~ # crm resource refresh msl_SAPHana_TST_HDB00
Cleaned up rsc_SAPHana_TST_HDB00:0 on llhana1
Cleaned up rsc_SAPHana_TST_HDB00:1 on llhana2
Waiting for 2 replies from the controller.. OK
llhana1:~ #
4. At the end of the maintenance, enable the cluster control on the msl resource again.
llhana1:~ # crm resource maintenance msl_SAPHana_TST_HDB00 off
llhana1:~ #
Patching of Cluster Software Stack
Regular patching of cluster nodes are important to improve the security, removal of bugs and feature enhancements. Here are some recommendations on how to plan and execute the patching maintenance activity.
-
-
- In case the filesystem of the root (/) is “btrfs” and the snapshots for the filesystem is enabled then it is recommended to take a pre-snapshot and a post-snapshot during the patching activity.
- If the nodes of the cluster are virtual machines then it is recommended to take a snapshot of the VM before the start of the patching. If it requires that the VM should be stopped before the snapshot then follow the steps in section “OS reboot” to shutdown the VM to take the snapshot.
- In case the nodes of the virtual machines are physical machines or a filesystem or VM snapshot is not possible then if available/possible the backup of the OS partition should be taken using some backup tool.
- The patching procedures should be first tested on a test machine before attempting it on production machine. It is highly recommended that the test environment is as similar to the production environment as possible. It has been observed a number of times that when the test environment is not similar to the production environment, the patching of the production shows a very different behaviour than the test.
- Finally to patch the cluster, you need to follow the exact same steps for the “OS reboots” and perform the patching in step 5.
In case you need to update the cluster stack from a lower version/service-pack to higher version/service-pack, please follow the SLE – HAE documentation section titled “Upgrading your cluster to the latest product version” at https://documentation.suse.com/sle-ha/15-SP3/html/SLE-HA-all/cha-ha-migration.html
-
Cleanup after manual administrative activities
Once the maintenance activity is complete, it is recommended to again run the pre-maintenance checks procedure to ensure that the status of the cluster is as expected. In particular, I want to emphasize on below checks after the maintenance:
1. Checking status of SUSE HA cluster and SAP HANA system replication
llhana1:~ # cs_clusterstate -i
### llhana1.lab.sk - 2022-04-12 18:44:12 ###
Cluster state: S_IDLE
llhana1:~ #
2. Check for any migration constraint
llhana1:~ # crm configure show | grep cli-
llhana1:~ #
It is important to note that the location constraints starting with “cli-prefer” or “cli-ban” are created when the resources are moved or migrated. When moving the resources manually, an expiry can be assigned to the migration constraint. The syntax in CLI is “crm resource move <resource name> <node name> <expiry-for-constraint>. Here expiry can be defined in ISO syntax for example “PT5M” would mean expire this constraint after 5 minutes of its creation.
In case you find a migration constraint then you can remove it using below command (replace <resource-name> with the name of the resource whose migration constraint needs to be cleared).
llhana1:~ # crm resource clear <resource-name>
INFO: Removed migration constraints for <resource-name>
llhana1:~ #
3. Check for any “maintenance” meta attribute on resource.
llhana1:~ # crm configure show | grep -B2 maintenance
params pcmk_delay_max=15s
ms msl_SAPHana_TST_HDB00 rsc_SAPHana_TST_HDB00 \
meta clone-max=2 clone-node-max=1 interleave=true maintenance=false
--
stonith-timeout=150s \
last-lrm-refresh=1651438829 \
maintenance-mode=false
llhana1:~ #
in above example there are 2 results of “grep” command, however, we are focusing on the first one which is for the resource and ignoring the second one which is for the cluster, for now. If you find one like in above example, then follow below command to remove this attribute from the CIB. It helps to tidy up the cluster configuration and a good practice to be used after the maintenance.
llhana1:~ # crm resource meta msl_SAPHana_TST_HDB00 delete maintenance
Deleted 'msl_SAPHana_TST_HDB00' option: id=msl_SAPHana_TST_HDB00-meta_attributes-maintenance name=maintenance
llhana1:~ #
Please also read our other blogs about #TowardsZeroDowntime.
Where can I find further information?
- SUSECON 2020 BP-1351 Tipps, Tricks and Troubleshooting
- Manual pages
- SAPHanaSR-ScaleOut(7)
- ocf_suse_SAPHanaController(7)
- ocf_suse_SAPHanaTopology(7)
- SAPHanaSR.py(7)
- SAPHanaSrMultiTarget.py(7)
- SAPHanaSR-ScaleOut_basic_cluster(7)
- SAPHanaSR-showAttr(8)
- SAPHanaSR_maintenance_examples(7)
- sbd(8)
- cs_man2pdf(8)
- cs_show_hana_info(8)
- cs_wait_for_idle(8)
- cs_clusterstate(8)
- cs_show_sbd_devices(8)
- cs_make_sbd_devices(8)
- supportconfig_plugins(5)
- crm(8)
- crmadmin(8)
- crm_mon(8)
- ha_related_suse_tids(7)
- ha_related_sap_notes(7)
- SUSE support TIDs
- Troubleshooting the SAPHanaSR python hook (000019865)
- Indepth HANA Cluster Debug Data Collection (PACEMAKER, SAP) (7022702)
- HANA SystemReplication doesn’t provide SiteName … (000019754)
- SAPHanaController running in timeout when starting SAP Hana (000019899)
- SAP HANA monitors timed out after 5 seconds (000020626)
- Related blog articles: https://www.suse.com/c/tag/towardszerodowntime/
- Blog Part 1 on SAP HANA Maintenance Procedure: https://www.suse.com/c/sles-for-sap-hana-maintenance-procedures-part-1-pre-maintenance-checks/
- Product documentation: https://documentation.suse.com/
- Pacemaker Upstream documentation on cluster property options: https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
Related Articles
Oct 13th, 2023
Meet SUSE at SAPinsider Copenhagen 2023, 14-16 November
Oct 16th, 2024
Comments
Thanks for the detailed information.
I have a question for OS activities, do we need to stop pacemaker, services or simply putting the cluster in maintenance mode is sufficient?
crm configure property maintenance-mode=”true”
For OS activities like patching, updating etc; I have written a dedicated blog which can be accessed using this link: https://www.suse.com/c/sles-for-sap-os-patching-procedure-for-scale-up-perf-opt-hana-cluster/
To briefly answer your question:
1) Stopping pacemaker separately is not required as when one wishes to stop the cluster one should always use “crm cluster stop” command instead, which stops all required services including pacemaker.
2) For OS maintenance we highly recommend to put the multi-state resource into maintenance. This is further explained in Fabian’s blog: https://www.suse.com/c/sap-hana-maintenance-suse-clusters/