SUSE Support

Here When You Need Us

Hanging processes due to CPU throttling

This document (000021525) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 12 SP5
SUSE Linux Enterprise Server 15 SP2
SUSE Linux Enterprise Server 15 SP3


Situation

A system is hanging, meaning there are unresponsive processes or the hung task detector triggers warnings as some processes appear to be blocked waiting for some resources such as mutexes and expected owner of the resource is not running. CPU cgroup throttling is configured on the machine, but not necessarily over the hung processes.

This can occur on systems that configure CPU cgroup throttling, e.g. via systemd's CPUQuota= directive or kubernetes CPU limits. (Note: text from Environment)

Further analysis of the situation can be done with a debugger looking at the state of runqueues and throttled lists. Such as what we do on crashdump analysis or what describes this commit on live systems.

Resolution

  1. Upgrade to SUSE Linux Enterprise Server 15 SP4 or newer.
  2. Reconsider CPU throttling configuration, quota <= 5%*nr_cpus is susceptible to starvation on affected kernels. (nr_cpus is system-wide or cpuset restriction of the throttled cgroup)
  3. Use cpuset to restrict CPU consumption. Namely, setting cpuset with a single CPU eliminates throttling unfairness in principle.

Cause

In general, CPU throttling can affect processes that are executing kernel code while exclusively holding some system resources. Another process that would otherwise be unthrottled would find itself waiting for that resource, and its execution is indirectly affected by CPU throttling.

In particular, the implementation of CPU throttling is not entirely fair historically when it comes to running a workload on multiple CPUs -- the quota is divided among the CPUs in slices (which are smaller than the quota itself), and under certain conditions, some CPUs may not receive any portion of the of the quota, effectively preempting anything that is supposed to run on such a CPU within a CPU-restricted cgroup. The behavior of the scheduler was eventually fixed in the SUSE Linux kernels, but the changes are too intrusive to be backported to older kernels.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000021525
  • Creation Date: 09-Aug-2024
  • Modified Date:28-Aug-2024
    • SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

tick icon

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

tick icon

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

tick icon

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.