SLES for SAP HANA Maintenance Procedures – Part -1 (Pre Maintenance Checks)

Share
Share

This two-part blog is targeted for OS and HA administrators who has to support HANA Workloads on SLES for SAP. This is the first part of the blog  which details the pre-maintenance checks.


As an admin who has to manage the SLE HA cluster infrastructure for the SAP HANA workload, you may often have come across various methods to perform the same task and wondered which one should you use as a best practise. The answer is never simple as it depends on several factors and mostly how the infrastructure is architected and designed. A maintenance procedure for performance optimized setup may not work for cost-optimized. A maintenance procedure for on-premise setup may need some tweaking in the cloud. However, there are some common and generic checks and procedures that are valid in all these different architectures and setups. However, the focus is on HANA scale-up scenarios. We’ll discuss in this blog, such checks and procedure that may be useful for SUSE HA admin who has to support the HANA workload.

I have written this blog in two parts. In this first part I will discuss following health-check procedures that can be performed on a HANA cluster.

  1. Checking status of SUSE HA cluster and HANA system replication
  2. Checking system log for srHook setting HANA SR status in the CIB
  3. Checking the HANA tracefiles for srConnectionChanged() events
  4. Showing monitor runtimes for HANA resource agent from system log

Before performing any administrative activitiy on HANA cluster and after completing the planned activity it is recommended to do some checks to ensure that the cluster is in healthy state.

1)Checking status of SUSE HA cluster and HANA system replication

The first check is to run the command “cs_clusterstate -i”. It is important to note that all the commands starting with cs_* are part of the rpm package “ClusterTools2″.


llhana1:~ # cs_clusterstate -i
### llhana1.lab.sk - 2022-04-12 18:44:12 ###
Cluster state: S_IDLE
llhana1:~ #

It gives the transition state of the cluster as its output. It is important to note that after each step(for example manual takeover) of the administrative activity one should check whether the state of the cluster is S_IDLE. Only if we see the cluster state to be S_IDLE then we can go to the next step and likewise. If the state of the cluster is not S_IDLE then it means that cluster is in some state of transition or state of failure and is not ready for another action.

We should also always check if the CIB is clean and no leftover migration constraints are there to be cleared. Any such migration constraint must be cleared before and after the planned administrative activity.



llhana1:~ # crm configure show | grep cli-
llhana1:~ #

2)Checking system log for srHook setting HANA SR status in the CIB

srHook is an additional and more reliable check that ensures the system replication status on secondary HANA is monitored and reported. Unlike the resource agent for HANA which does a probe from the cluster to the database to check if the database system replication is OK, the srHook works in opposite way where the database on a certain event stops its actions and calls an external script. In case of srHook HANA tells the script the current system replication status. Then the script updates
the respective cluster attributes. Once the script returned succesfully
HANA continues. It is therefore important to ensure that srHook is properly configured and it is performing its required job to change the attribute. srHook can be configured for SAP HANA 2.0 SPS 04 and onwards.

The first check for srHook to ensure that the script is loaded. For this, we grep the trace files for the event of loading the hook script. If the srHook script is not loaded then we should implement all the steps mentioned in https://documentation.suse.com/sbp/all/single-html/SLES4SAP-hana-sr-guide-PerfOpt-15/#id-set-up-sap-hana-hadr-providers


llhana1:~ # su - tstadm -c "cdtrace;grep HADR.*load.*SAPHanaS nameserver_*.trc"
nameserver_llhana1.30001.000.trc:[6812]{-1}[-1/-1] 2022-04-11 07:11:45.228171 i ha_dr_provider   HADRProviderManager.cpp(00075) : loading HA/DR Provider 'SAPHanaSR' from /u
sr/share/SAPHanaSR
llhana1:~ #

If your configurations are correct but if you still do not see the srHook script getting loaded in the trace file then you may run the following command on the master nameservers on both sites to reload it:


tstadm@llhana1:/usr/sap/TST/HDB00> hdbnsutil -reloadHADRProviders
done.
tstadm@llhana1:/usr/sap/TST/HDB00>

Once you have ensured that the srHook script is loaded then now it is time to check the logs for the events where this script has changed/updated the site attribute hana_<sid>_site_srHook_<site> in the CIB.


llhana1:~ # grep "sudo.*crm_attribute.*srHook" /var/log/messages
2022-04-11T07:20:36.231237+02:00 llhana1 sudo:   tstadm : PWD=/hana/shared/TST/HDB00/llhana1 ; USER=root ; COMMAND=/usr/sbin/crm_attribute -n hana_tst_site_srHook_TWO -v SF
AIL -t crm_config -s SAPHanaSR
2022-04-11T07:20:36.332162+02:00 llhana1 sudo:   tstadm : PWD=/hana/shared/TST/HDB00/llhana1 ; USER=root ; COMMAND=/usr/sbin/crm_attribute -n hana_tst_site_srHook_TWO -v SF
AIL -t crm_config -s SAPHanaSR
2022-04-11T07:20:46.605157+02:00 llhana1 sudo:   tstadm : PWD=/hana/shared/TST/HDB00/llhana1 ; USER=root ; COMMAND=/usr/sbin/crm_attribute -n hana_tst_site_srHook_TWO -v SF
AIL -t crm_config -s SAPHanaSR
2022-04-11T07:22:03.420684+02:00 llhana1 sudo:   tstadm : PWD=/hana/shared/TST/HDB00/llhana1 ; USER=root ; COMMAND=/usr/sbin/crm_attribute -n hana_tst_site_srHook_TWO -v SO
K -t crm_config -s SAPHanaSR
llhana1:~ #

3)Checking the HANA tracefiles for srConnectionChanged() events

It can be also important to check the HANA tracefiles for the srConnectionChanged() events. The SrConnectionChanged() method is called by the HADRprovider API and reports the current system replication status.


llhana1:~ # su - tstadm -c "cdtrace;grep SAPHanaSR.srConnectionChanged.*called nameserver_*.trc"

nameserver_llhana1.30001.000.trc:[6812]{-1}[-1/-1] 2022-04-11 07:22:03.403528 i ha_dr_SAPHanaSR  SAPHanaSR.py(00086) : SAPHanaSR (0.162.0) SAPHanaSR.srConnectionChanged met
hod called with Dict={'status': 15, 'is_in_sync': True, 'timestamp': '2022-04-11T07:22:03.403350+02:00', 'database': 'TST', 'siteName': 'TWO', 'service_name': 'indexserver'
, 'hostname': 'llhana1', 'volume': 3, 'system_status': 15, 'reason': '', 'database_status': 15, 'port': '30003'}
nameserver_llhana1.30001.000.trc:[6812]{-1}[-1/-1] 2022-04-11 07:22:03.442792 i ha_dr_SAPHanaSR  SAPHanaSR.py(00115) : SAPHanaSR SAPHanaSR.srConnectionChanged method called
with Dict={'status': 15, 'is_in_sync': True, 'timestamp': '2022-04-11T07:22:03.403350+02:00', 'database': 'TST', 'siteName': 'TWO', 'service_name': 'indexserver', 'hostnam
e': 'llhana1', 'volume': 3, 'system_status': 15, 'reason': '', 'database_status': 15, 'port': '30003'} ###
llhana1:~ #

These checks ensures that srHook is functioning as expected and giving us the current status of the system replication.

4)Showing monitor runtimes for HANA resource agent from system log

The runtime of monitor operation on SAP HANA resource agent is usually few seconds. When we check the sample output of the runtime and find it unusually high then we should first investigate, find the root cause and fix it. Any unusually high value maybe indication of some underlying issues like high I/O or network latency or something else.



llhana1:~ # grep "SAPHana.*end.action.monitor_clone.*rc=" /var/log/messages | awk  '{print $1,$11,$13}' | colrm 20 32 | tr -d "=()rsc" | tr "T" " "
.
.
2022-04-12 22:30:10 0 3
2022-04-12 22:30:23 0 3
2022-04-12 22:30:36 0 3
2022-04-12 22:30:48 0 2
2022-04-12 22:30:55 8 2
2022-04-12 22:31:01 0 3
llhana1:~ #

In above output, first column is the date, second column is the time, third is the return code of the monitor operations and the fourth column is the time in seconds taken for the monitor operation to finish. Above example is from a scale-up system, for scale-out we need to change the name of the resource agent from SAPHana to SAPHanaController.

Please also read our other blogs about #TowardsZeroDowntime.

 

Where can I find further information?

  • SUSECON 2020 BP-1351 Tipps, Tricks and Troubleshooting
  • Manual pages
    • SAPHanaSR-ScaleOut(7)
    • ocf_suse_SAPHanaController(7)
    • ocf_suse_SAPHanaTopology(7)
    • SAPHanaSR.py(7)
    • SAPHanaSrMultiTarget.py(7)
    • SAPHanaSR-ScaleOut_basic_cluster(7)
    • SAPHanaSR-showAttr(8)
    • SAPHanaSR_maintenance_examples(7)
    • sbd(8)
    • cs_man2pdf(8)
    • cs_show_hana_info(8)
    • cs_wait_for_idle(8)
    • cs_clusterstate(8)
    • cs_show_sbd_devices(8)
    • cs_make_sbd_devices(8)
    • supportconfig_plugins(5)
    • crm(8)
    • crmadmin(8)
    • crm_mon(8)
    • ha_related_suse_tids(7)
    • ha_related_sap_notes(7)
  • SUSE support TIDs
    • Troubleshooting the SAPHanaSR python hook (000019865)
    • Indepth HANA Cluster Debug Data Collection (PACEMAKER, SAP) (7022702)
    • HANA SystemReplication doesn’t provide SiteName … (000019754)
    • SAPHanaController running in timeout when starting SAP Hana (000019899)
    • SAP HANA monitors timed out after 5 seconds (000020626)
  • Related blog articles: https://www.suse.com/c/tag/towardszerodowntime/
  • Product documentation: https://documentation.suse.com/
Share
(Visited 40 times, 1 visits today)

Comments

  • Avatar photo Richard Mayne says:

    Very useful information that all administrators managing SLE HA clusters hosting SAP HANA should know. Thank you for sharing Sanjeet.

  • Avatar photo Austin says:

    Nice document very useful for all our customer who use HANA. Thank you Sanjeet

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    Avatar photo
    7,210 views
    Sanjeet Kumar Jha I am a SAP Solution Architect for High Availability at SUSE. I have over a decade years of experience with SUSE high availability technologies for SAP applications.