Long Client hang to Cluster after failover of ERS Instance

This document (7023324) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 15
SUSE Linux Enterprise High Availability Extension 12
SUSE Linux Enterprise High Availability Extension 11

Situation

An ASCS/ERS cluster is set up, fencing is in place and all other testing done. Everything works according to plan from an HA cluster perspective.

After a shutdown or destroy on one of the two nodes (in most cases the node holding the ERS instance), the failover happens according to the configuration. From a cluster perspective everything seems to be 100% correct and working. But the client connections to the SAP application that have a lock hang for about 15 minutes.

This issue is not caused by the HA Setup but by the SAP monitoring, however, it's most likely visible in an HA setup.

The attempts of the new ERS instance to connect to the ASCS can be seen with:
tail -f /usr/sap/<SID>/ASCS<Instancenumber>/work/dev_enqrepl
on the server carrying the ASCS instance.

To our knowledge this issue only occurs on ENSA1 (ENSA = Standalone Enqueue Server) setups.

Resolution

The core issue is in SAP's handling of the situation, but as a workaround, the TCP level connection timeout can be decreased by lowering the value of:

/proc/sys/net/ipv4/tcp_retries2

which can also be persistently stored in /etc/sysctl.conf as:

net.ipv4.tcp_retries2=n

Where "n" should be replaced with a value lower than the default of 15. It is suspected that a value of 8 or 9 would be sufficient to work around this issue. It is not recommended to make this value any lower than is absolutely necessary to avoid the problem.

After altering that file, it can be activated with:

sysctl -p

Possible side effects:

1. This is a global value, so it can affect timeout of all TCP connections. Modifications do carry some risks. RFC 1122 recommends at least 100 seconds for certain timeouts, which corresponds to a tcp_retries2 value of at least 8. A lower value might be tried but would require careful testing and monitoring for unintended consequences which may not be noticed until much later.

2. On certain public cloud environments, infrastructure maintenance procedures rely on VM instance keeping TCP connections for 30 seconds. In those environments tcp_retries2 should not be set lower than 8.

3. Lowering this value may cause NFS connections to timeout earlier. This can cause NFS clients to try to reestablish the connection with the same source and destination ports, a practice often referred to as "connection reuse". Many security-conscious devices (such as smart routers, firewalls, frontends, etc) may treat connection reuse with suspicion and may block such activity, leading to NFS client failures. See https://www.suse.com/support/kb/doc/?id=000019722 for more details.

Cause

This seems to be hardware and setup related as it only happens in some environments. The issue is that the "en" replication instance on the ASCS node can only have one active replication partner but does not do a keepalive. So the Enqueue process does not notice that the ERS instance is started again. This can be checked on the ASCS node with searching for the "en" process:

brora:~ # ss -pt | grep HA1
ESTAB 0 0 10.162.192.139:50016 10.162.192.213:20500 users:(("en.sapHA1_ASCS0",4526,41))

and then checking for a keepalive

brora:~ # ss -o | grep keep
tcp ESTAB 0 0 10.162.192.216:ssh 10.162.192.213:36686 timer:(keepalive,113min,0)
tcp ESTAB 0 0 10.162.192.216:rpasswd 10.162.192.191:nfs timer:(keepalive,9.776ms,0)

which is not to be found.

To ensure that the Enqueue replication process will time-out faster and accept the new replication for the new ERS instance the above parameter, lowering tcp_retries2 is the workaround

Additional Information

sysctl.conf(5)
SAPStartSrv_basic_cluster(7)
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.