pacemaker-controld: notice: High CPU load detected: 180.869995
This document (000020521) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise High Availability Extension 15
Situation
22-11-08T02:53:41 <host01> pacemaker-controld: notice: High CPU load detected: 180.869995
Resolution
High load situations are quite problematic as the services or the tasks which are waiting on IO to complete, do not respond in time to the cluster monitor operations. For that reason pacemaker is monitoring the system load continuously, in order to adapt to those situation by throttling the number of actions forked at the same time, in order to avoid overloading the system and avoiding increasing further the load. The approach that it follows is by lowering the "node-action-limit" parameter, which specifies the maximum number of actions that can be scheduled concurrently per node, which is by default 2xCPU cores. The system is considered under high load if the load average falls under the following formula:
${load-average} > 2 * ${number-of-cores} * ${load-threshold}
In a high load situation pacemaker-controld allows only one action at a time. The cluster can adapt and operate correctly even on very high load. However the problem could be the services that are managed by the cluster (especially those that are very IO dependent) might not respond in time, which would force the cluster to recover the resource by restarting it in the same node or moving to the other node.
During the investigation of an IO bottleneck (NFS, network, disk IO, etc.) cause, in order to avoid any unwanted failover or service restart, the "on-fail=ignore" option for the monitor operation can be applied temporarily on the frequent failing resources. For example:
Setting on-fail=ignore on a HANA resource:
primitive rsc_SAPHana_RDI_HDB14 ocf:suse:SAPHana \
op start interval=0 timeout=3600 \
op stop interval=0 timeout=3600 \
op promote interval=0 timeout=3600 \
op monitor interval=60 role=Master timeout=700 on-fail=ignore \
op monitor interval=61 role=Slave timeout=700 on-fail=ignore \
...
Setting "on-fail=ignore" on a Netweaver resource:
primitive rsc_sap_RPI_ASCS00 SAPInstance \
operations $id=rsc_sap_RPI_ASCS00-operations \
op monitor interval=31 on-fail=ignore timeout=180 \
...
Cause
Additional Information
https://www.suse.com/de-de/support/kb/doc/?id=000019553
https://www.suse.com/de-de/support/kb/doc/?id=000019509
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020521
- Creation Date: 30-Nov-2021
- Modified Date:12-May-2022
-
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com