Long Client hang to Cluster after failover of ERS Instance
This document (7023324) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise High Availability Extension 12
SUSE Linux Enterprise High Availability Extension 11
Situation
After a shutdown or destroy on one of the two nodes (in most cases the node holding the ERS instance), the failover happens according to the configuration. From a cluster perspective everything seems to be 100% correct and working. But the client connections to the SAP application that have a lock hang for about 15 minutes.
This issue is not caused by the HA Setup but by the SAP monitoring, however, it's most likely visible in an HA setup.
The attempts of the new ERS instance to connect to the ASCS can be seen with:
tail -f /usr/sap/<SID>/ASCS<Instancenumber>/work/dev_enqrepl
on the server carrying the ASCS instance.
To our knowledge this issue only occurs on ENSA1 (ENSA = Standalone Enqueue Server) setups.
Resolution
/proc/sys/net/ipv4/tcp_retries2
which can also be persistently stored in /etc/sysctl.conf as:
net.ipv4.tcp_retries2=n
Where "n" should be replaced with a value lower than the default of 15. It is suspected that a value of 8 or 9 would be sufficient to work around this issue. It is not recommended to make this value any lower than is absolutely necessary to avoid the problem.
After altering that file, it can be activated with:
sysctl -p
Possible side effects:
1. This is a global value, so it can affect timeout of all TCP connections. Modifications do carry some risks. RFC 1122 recommends at least 100 seconds for certain timeouts, which corresponds to a tcp_retries2 value of at least 8. A lower value might be tried but would require careful testing and monitoring for unintended consequences which may not be noticed until much later.
2. On certain public cloud environments, infrastructure maintenance procedures rely on VM instance keeping TCP connections for 30 seconds. In those environments tcp_retries2 should not be set lower than 8.
3. Lowering this value may cause NFS connections to timeout earlier. This can cause NFS clients to try to reestablish the connection with the same source and destination ports, a practice often referred to as "connection reuse". Many security-conscious devices (such as smart routers, firewalls, frontends, etc) may treat connection reuse with suspicion and may block such activity, leading to NFS client failures. See https://www.suse.com/support/kb/doc/?id=000019722 for more details.
Cause
brora:~ # ss -pt | grep HA1
ESTAB 0 0 10.162.192.139:50016 10.162.192.213:20500 users:(("en.sapHA1_ASCS0",4526,41))
and then checking for a keepalive
brora:~ # ss -o | grep keep
tcp ESTAB 0 0 10.162.192.216:ssh 10.162.192.213:36686 timer:(keepalive,113min,0)
tcp ESTAB 0 0 10.162.192.216:rpasswd 10.162.192.191:nfs timer:(keepalive,9.776ms,0)
which is not to be found.
To ensure that the Enqueue replication process will time-out faster and accept the new replication for the new ERS instance the above parameter, lowering tcp_retries2 is the workaround
Additional Information
SAPStartSrv_basic_cluster(7)
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:7023324
- Creation Date: 03-Sep-2018
- Modified Date:08-Mar-2021
-
- SUSE Linux Enterprise High Availability Extension
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com