Basic health check for two-node SAP HANA performance based model

This document (7022984) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server for SAP Applications 11 Service Pack 3
SUSE Linux Enterprise Server for SAP Applications 11 Service Pack 4
SUSE Linux Enterprise Server for SAP Applications 12 Service Pack 1
SUSE Linux Enterprise Server for SAP Applications 12 Service Pack 2

SUSE Linux Enterprise Server for SAP Applications 12 Service Pack 3

SUSE Linux Enterprise Server for SAP Applications 12 Service Pack 4

Situation

Health checks that should be carried out for SAP HANA performance based model (2-node cluster) before opening a support request with SUSE.

Resolution

How to check if the SAP nodes will synchronize outside of clustering:

For the purposes of this document, 'master' can be equated to 'primary' (mode: PRIMARY) and 'slave' can be equated to 'secondary' (mode: SYNC).

1. Put the cluster in to maintenance mode (see TID#7023135) or stop pacemaker on each node. If the cluster is put in to maintenance mode, no cluster or resource actions will be initiated until the cluster is taken out of maintenance mode.

If pacemaker is manually stopped on each node, then the cluster will attempt to shutdown the SAP database and related processes on that node where pacemaker is being stopped. To avoid the possibility of triggering a 'take-over,' care should be taken to stop pacemaker on the 'slave' node first and allow enough time for the unload process to complete.

2. Check the current status of the SAP node synchronization:

     Login to the server that is designated as the SAP database 'primary' node (the node that is designated to host the non-slave database) using the SAP administrator account (e.g. a00adm, where 'a00' is the SAP System ID). The SAP administrator account is created when the SAP product is installed.

    NOTE: If the SAP administrator account password is unknown/lost, that password can be safely changed without causing issue. This account password is per-server and not synchronized across nodes, so changing the password to the same known password on both nodes is prudent.

    In the following examples, the 'SAP HANA System ID' is 'A00' and the 'SAP Instance Number' is '00'.

     NOTE: depending on which access method is used (direct console login, ssh etc.), the shell prompt may show as 'user@hostname/<path>' or may display as something like 'sh-4.2$'.

     Execute 'HDB info' and this will show you what SAP related processes are running on that node.

     An example showing that only threads related to running the 'HDB info' command and the standard SAP instance service deamon are active:-

   a00adm@sapn1:/usr/sap/A00/HDB00> HDB info
   USER        PID   PPID %CPU    VSZ   RSS COMMAND
   a00adm     5183   5178 0.0 87684 1804 sshd: a00adm@pts/0
   a00adm     5184   5183 0.1 14808 3620 \_ -sh
   a00adm     5269   5184 0.0 13200 1824      \_ /bin/sh /usr/sap/A00/HDB00/HDB info
   a00adm     5294   5269 0.0 26668 1356          \_ ps fx -U a00adm -o user,pid,ppid,pcpu,vsz,rss,args
   a00adm     2104      1 0.0 362484 27184 /usr/sap/A00/HDB00/exe/sapstartsrv pf=/usr/sap/A00/SYS/profile/A00_HDB00_sapn1 -D -u a00adm
   a00adm     2004      1 0.0 31844 2352 /usr/lib/systemd/systemd --user
   a00adm     2008   2004 0.0 63796 2620 \_ (sd-pam)
   a00adm@sapn1:/usr/sap/A00/HDB00>

     An example showing that the node is currently running a SAP database and related SAP processes:

   a00adm@sapn1:/usr/sap/A00/HDB00> HDB info
   USER        PID   PPID %CPU    VSZ   RSS COMMAND
   a00adm     5183   5178 0.0 87684 1804 sshd: a00adm@pts/0
   a00adm     5184   5183 0.0 14808 3624 \_ -sh
   a00adm     5994   5184 0.0 13200 1824      \_ /bin/sh /usr/sap/A00/HDB00/HDB info
   a00adm     6019   5994 0.0 26668 1356          \_ ps fx -U a00adm -o user,pid,ppid,pcpu,vsz,rss,args
   a00adm     5369      1 0.0 20932 1644 sapstart pf=/usr/sap/A00/SYS/profile/A00_HDB00_sapn1
   a00adm     5377   5369 1.8 582944 292720 \_ /usr/sap/A00/HDB00/sapn1/trace/hdb.sapA00_HDB00 -d -nw -f /usr/sap/A00/HDB00/sapn1/daemon.ini pf=/usr/sap/A00/SYS/profile/A00_HDB00_sapn1
   a00adm     5394   5377 9.3 3930388 1146444      \_ hdbnameserver
   a00adm     5548   5377 21.3 2943472 529672      \_ hdbcompileserver
   a00adm     5550   5377 4.4 2838792 465664      \_ hdbpreprocessor
   a00adm     5571   5377 91.6 7151116 4019640      \_ hdbindexserver
   a00adm     5573   5377 21.8 4323488 1203128      \_ hdbxsengine
   a00adm     5905   5377 18.9 3182120 710680      \_ hdbwebdispatcher
   a00adm     2104      1 0.0 428748 27760 /usr/sap/A00/HDB00/exe/sapstartsrv pf=/usr/sap/A00/SYS/profile/A00_HDB00_sapn1 -D -u a00adm
   a00adm     2004      1 0.0 31844 2352 /usr/lib/systemd/systemd --user
   a00adm     2008   2004 0.0 63796 2620 \_ (sd-pam)
   a00adm@sapn1:/usr/sap/A00/HDB00>

In order to check if the nodes in the cluster can synchronize properly, you need to be at a point where both nodes are correctly running the expected SAP database and processes (remember, even though one node is designated as a slave and one node is a master, both nodes actually run a database, where the master database is continuously synchronized to the slave database).

If SAP database and services are not active on the node, you may need to run the appropriate command to start up the processes:

e.g.    'HDB start'
or    'sapcontrol -nr 00 -function Start' where '00' is the number of the SAP instance number.

NOTE: To stop the SAP database and processes on a node, you can use 'HDB stop' or 'sapcontrol -nr 00 -function Stop' , where '00' is the number of the SAP instance.

Once both nodes are showing that SAP is active (check using 'HDB info') the synchronization state of the databases can be checked.

If the SAP installation is functioning correctly, you should see something similar to the following by executing the python script 'systemReplicationStatus.py' on each node:

On Master/Primary node:

   sh-4.2$ pwd
   /hana/shared/A00/HDB00/exe/python_support

   sh-4.2$ python systemReplicationStatus.py

   | Host | Port | Service Name | Volume ID | Site ID | Site Name | Secondary | Secondary | Secondary | Secondary | Secondary     | Replication | Replication | Replication    |
   |       |       |              |           |         |           | Host      | Port      | Site ID   | Site Name | Active Status | Mode        | Status      | Status Details |
   | ----- | ----- | ------------ | --------- | ------- | --------- | --------- | --------- | --------- | --------- | ------------- | ----------- | ----------- | -------------- |
   | sapn1 | 30007 | xsengine     |         2 |       1 | node1     | sapn2     |     30007 |         2 | node2     | YES           | SYNC        | ACTIVE      |                |
   | sapn1 | 30001 | nameserver   |         1 |       1 | node1     | sapn2     |     30001 |         2 | node2     | YES           | SYNC        | ACTIVE      |                |
   | sapn1 | 30003 | indexserver |         3 |       1 | node1     | sapn2     |     30003 |         2 | node2     | YES           | SYNC        | ACTIVE      |                |

   status system replication site "2": ACTIVE
   overall system replication status: ACTIVE

   Local System Replication State
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   mode: PRIMARY
   site id: 1
   site name: node1
   sh-4.2$

     -------------------------------------------------------------------------------------------

On the Slave node:

   sh-4.2$ pwd
   /hana/shared/A00/HDB00/exe/python_support

   sh-4.2$ python systemReplicationStatus.py
   this system is either not running or not primary system replication site

   Local System Replication State
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   mode: SYNC
   site id: 2
   site name: node2
   active primary site: 1
   primary masters: sapn1
   sh-4.2$

If the 'replication status' of any of the SAP processes is not showing as 'ACTIVE' then the databases may need more time to 'catch up' to a point where they are fully in-sync. This depends how long it is since the SAP processes were started on each node, how long the master database may have been running whilst the slave database may have been down/unavailable for syncing and how much data has been written to the master database since the last complete sync. The time required to sync can vary greatly depending on these factors and may take from a few minutes to some hours.

NOTE: The python script 'systemReplicationStatus.py' is located in the '/hana/shared/<system_id>/HDB<instance_number>/exe/python_support' directory.

If it becomes clear that the SAP database synchronization is not working, it may be necessary to reconfigure/re-enable replication between the master and slave nodes*, or it may be necessary to contact the SAP support organization for assistance.

* See TID#7023127 - 'How to re-enable replication in a two-node SAP performance based model'.

Don't forget to take the cluster out of maintenance mode when appropriate (the cluster will remain in maintenance mode even after nodes are rebooted unless the cluster is manually taken out of maintenance mode). If the nodes have not been rebooted, then care should be taken to return all of the cluster resources to the same state as they were when the cluster was put in to maintenance mode before you actually bring the cluster out of maintenance mode, otherwise the cluster may not reflect the true state of each resource and on a failure, the cluster may not behave as expected.

Cause

Problems encountered on SLES for SAP cluster and customer is unsure if problem is something that should be addresses by contacting SAP support or by contacting SUSE support.

Additional Information

If the SAP nodes are not correctly configured and will not function and synchronize the database 'outside' of SUSE High Availability clustering, then the issue is very likely a problem that needs to be raised directly with SAP. Opening a SUSE support request for issues related to problems with SAP node synchronization may lead to wasted time and being directed to contact SAP support.

If the SAP nodes are in-sync but problems have developed with the SUSE operating system or SUSE High Availability clustering extension, then opening a support request with SUSE is likely the right course of action.

Please note that SUSE is not responsible for the configuration of a cluster. SUSE consulting services can be employed to configure or re-configure the product.

If a business critical situation exists, where the SAP nodes are 'in sync' but a problem exists with the SUSE High Availability extension, then 'down time' can be avoided by running the SAP nodes with the cluster put in to maintenance mode, or with pacemaker stopped on each node, until such time that to have downtime for remediation is acceptable.

Useful SAP Notes:

2434562 - System Replication Hanging in Status "SYNCING" or "ERROR" With Status Detail "Missing Log" or "Invalid backup size"

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.