SAP NW cluster failover due to sapstartsrv frequent restart
This document (000020517) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server for SAP Applications 12
Situation
node01 pacemaker-execd[5806]: warning: rsc_sapinst_HA1_SCS01_monitor_11000 process (PID 23112) timed out
In the logs a lot of sapstartsrv restart triggered by the cluster are reported:
node01 SAPInstance(rsc_sapinst_HA1_SCS01)[1767]: WARNING: sapstartsrv is running for instance ASCS01, that service will be killed
node01 SAPHA1_01[1990]: SAP Service SAPHA1_01 successfully started.
It seems sapstartsrv was restarted during an ongoing monitor operation or the restart took more time than usual which could have lead to the resource monitor timeout:
node01 SAPInstance(rsc_sapinst_HA1_SCS01)[26602]: WARNING: sapstartsrv is running for instance ASCS01, that service will be killed
node01 pacemaker-execd[5806]: warning: rsc_sapinst_HA1_SCS01_monitor_11000 process (PID 23112) timed out
While startsapsrv was not fully running, the cluster is trying to recover the ASCS resource, basically a stop and start (using sapcontrol) which is not responding:
node01 SAPInstance(rsc_sapinst_HA1_SCS01)[29105]: WARNING: sapstartsrv is not running for instance HA1-SCS01 (no UDS), it will be started now
node01 pacemaker-execd[5806]: warning: rsc_sapinst_HA1_SCS01_stop_0 process (PID 28238) timed out
node01 pacemaker-execd[5806]: warning:rsc_sapinst_HA1_SCS01_stop_0[29105] timed out after 600000ms
A stop operation failure is critical for the cluster, as it can not guarantee anymore the data integrity, and it issues a fence operation:
node01 pacemaker-controld[5809]: notice: Requesting fencing (reboot) of node node01
node01 pacemaker-fenced[5805]: notice: Requesting that node02 perform 'reboot' action targeting node01
Resolution
sapcontrol -nr 01 -function ParameterValue INSTANCE_NAME -format script | grep '^0 : ' | cut -d' ' -f3
An example of a correct ASCS resource configuration would be:
primitive rsc_sap_HA1_ASCS00 SAPInstance \
operations $id=rsc_sap_HA1_ASCS00-operations \
op monitor interval=11 timeout=60 on-fail=restart \
params InstanceName=HA1_ASCS00_sapha1as \
START_PROFILE="/sapmnt/HA1/profile/HA1_ASCS00_sapha1as" \
AUTOMATIC_RECOVER=false \
meta resource-stickiness=5000 failure-timeout=60 migration-threshold=1 \
priority=10
For more details please refer to SUSE best practice guides:
https://documentation.suse.com/sbp/all/html/SAP_NW740_SLE15_SetupGuide/index.html
https://documentation.suse.com/sbp/all/html/SAP_S4HA10_SetupGuide-SLE15/index.html
Cause
WARNING: sapstartsrv is running for instance ASCS01, that service will be killed
and the restart is actually coming from the following code of SAPInstance resource agent:
#/usr/lib/ocf/resource.d/heartbeat/SAPInstance
---------------------------------------------------------------
408 check_sapstartsrv() {
409 local restart=0
...
414 if [ ! -S /tmp/.sapstream5${InstanceNr}13 ]; then
415 ocf_log warn "sapstartsrv is not running for instance $SID-$InstanceName (no UDS), it will be started now"
416 restart=1
417 else
418 output=`$SAPCONTROL -nr $InstanceNr -function ParameterValue INSTANCE_NAME -format script`
419 if [ $? -eq 0 ]
420 then
421 runninginst=`echo "$output" | grep '^0 : ' | cut -d' ' -f3`
422 if [ "$runninginst" != "$InstanceName" ]
423 then
424 ocf_log warn "sapstartsrv is running for instance $runninginst, that service will be killed"
425 restart=1
More specifically, the following command:
sapcontrol -nr 01 -function ParameterValue INSTANCE_NAME -format script | grep '^0 : ' | cut -d' ' -f3
returns "runninginst = ASCS01", while "$InstanceName" returns SCS01 (assigned from echo "$OCF_RESKEY_InstanceName" | cut -d_ -f2 == echo HA1_SCS01_host01 | cut -d_ -f2 = SCS01)
So we have ASCS01 != SCS01, which lead to the referenced warning logs and always triggering a sapstartsrv restart during a monitor operation.
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020517
- Creation Date: 25-Nov-2021
- Modified Date:03-Dec-2021
-
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com