Cluster reboots frequently without cause in logs (PACEMAKER)

This document (7018594) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 12 SP2

Situation

Frequently nodes in a cluster get rebooted. Among the last entries on the node are

2017-01-04T14:29:38.217687-06:00 saturn kernel: [160693.230387] cgroup: fork rejected by pids controller in /system.slice/pacemaker.service
2017-01-04T14:36:49.590171-06:00 saturn kernel: [ 42.724175] cgroup: fork rejected by pids controller in /system.slice/pacemaker.service
2017-01-04T16:42:34.676537-06:00 saturn kernel: [ 41.473142] cgroup: fork rejected by pids controller in /system.slice/pacemaker.service

This only happens on SLES 12 SP2. Clusters with SLES 11 or SLES 12 or SLES 12 SP1 are not affected.

Resolution

This is a result of a feature in SLES 12 SP2 that set the DefaultTasksMax=512 for processes.

https://www.suse.com/releasenotes/x86_64/SUSE-SLES/12-SP2/#fate-320358

As the pacemaker cluster might operate a lot of resources and spawns a lot of lrmd processes this limit can be hit in some enviroments. As the release notes state, this is a possible limitation of an otherwise good default setting. This could be taken into account and planned before configuring the cluster.

To alleviate the issue the limit is simply increased, in

/etc/systemd/system.conf

the entry

#DefaultTasksMax=512

is changed to

DefaultTasksMax=8192

and to activate this setting

systemctl daemon-reload

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.