CPU Isolation – Nohz_full – by SUSE Labs (part 3)

Share
Share

This blog post is the third in a technical series by SUSE Labs team exploring Kernel CPU Isolation along with one of its core components: Full Dynticks (or Nohz Full).  Here is the list of the articles for the series so far :

  1. CPU Isolation – Introduction
  2. CPU Isolation – Full dynticks internals
  3. CPU Isolation – Nohz_full
  4. CPU Isolation – Housekeeping and tradeoffs
  5. CPU Isolation – A practical example
  6. CPU Isolation – Nohz_full troubleshooting: broken TSC/clocksource

 

Undisturbed

Now that we have drown ourselves within theory and full dynticks internals, it’s time to dive into the feature in practice.

NOHZ_FULL

The “nohz_full=” kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation.

A cpu-list argument is passed to define the set of CPUs to isolate. Assuming you have 8 CPUs for example and you want to isolate CPUs 4, 5, 6, 7:

nohz_full=4-7

Some more details on how to format a cpu-list can be found here.

What does nohz_full do exactly

When a CPU is included in the cpu-list from the nohz_full boot parameter, the kernel tries to move away from that CPU as much kernel noise as it can. We have explained what can and need to be done in theory in the previous article in order to shutdown the timer tick, here is what is eventually performed:

The timer tick

The timer tick is stopped whenever possible, assuming some conditions are met:

A residual 1 Hz tick (an interrupt every second) remains in order to maintain scheduler internal statistics. It used to execute on the isolated CPUs but nowadays this event is offloaded to the CPUs outside the nohz_full range using an unbound workqueue. This means that a clean setup can afford to run 100% tick-free on a CPU.

Timer callbacks

Unbound timer callbacks execution are moved to any CPU outside the nohz_full range, so they won’t trigger timer ticks on the wrong place to serve them. Meanwhile pinned timer ticks can’t be moved elsewhere. We’ll see later how to cope with them.

Workqueues and other kernel threads

In a similar fashion to the timer callbacks, unbound kernel workqueues and kthreads are moved to any CPU outside the nohz_full range. But pinned workqueues and kthreads can’t be moved elsewhere. Again we’ll see later how to cope with them.

RCU

Most of RCU processing is offloaded to the CPUs outside the isolated range. The CPUs set as nohz_full run in NOCB mode, which means the RCU callbacks queued on these CPUs are executed from unbound kthreads running on non-isolated CPUs. No need to pass the “rcu_nocbs=” kernel parameter as that is automatically taken care of while passing the “nohz_full=” parameter.

The CPU also doesn’t need to actively report quiescent states through the tick because it enters into RCU extended quiescent state upon return to userspace (see previous article at “3.2 RCU quiescent states reporting”)

Cputime accounting

The CPU switches to full dynticks cputime accounting (see previous article at 3.1 Cputime accounting) so that it doesn’t rely on a periodic event anymore.

 

Other isolation settings

Even though nohz_full is a significant part of the whole isolation setting, you’ll need to care about other details separately, among which two significant items:

User tasks affinity

If you wish to run a task undisturbed, you may not want other threads or processes to share the CPU with it. And full dynticks only works on single tasks in the end. It is therefore necessary to:

  • Affine each of your isolated tasks to one CPU within the range of nohz_full. There must be only one isolated task per CPU.
  • Affine all other tasks outside the nohz_full range.

There are several ways to affine your tasks to a set of CPUs, from the low level sched_setaffinity() API to tools like taskset. Powerful interfaces such as cpusets are also recommended.

 

IRQs affinity

Hardware IRQs (other than the timer and some other specific interrupts) may run on any CPU and disturb your isolated set. The resulting noise may not be just about interrupts stealing CPU time and trashing the CPU cache, IRQs may launch further asynchronous work on the CPU: softirq, timer, workqueue, etc… So it is usually a good idea to affine the IRQs to the CPUs outside the range of nohz_full. This affinity can be overriden through the file:

/proc/irq/$IRQ/smp_affinity

with $IRQ being the vector number.  More details can found on the kernel documentation.

 

All these CPU isolation settings though involve tradeoffs and pitfalls that we’ll explore in the next article.

Share
(Visited 44 times, 1 visits today)

Comments

  • Avatar photo skyryu says:

    Hi Frederic,
    I still get one tick per second after setting isolcpus=1,2 nohz_full=1,2 in GRUB_CMDLINE_LINUX and mapping irq to cpu 0 and 3 in smp_affinity_list, I expect it should be offloaded to the CPUs outside the nohz_full range as you mentioned ? Does it only works on suse enterprise server version ?

    my version is openSUSE Leap 15.3, linux is 5.3.18-59.27-default.
    could you pls help check what might be wrong ?
    Thanks so much and your artile is really inspiring !

    Thanks,
    skyryu

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    Avatar photo
    30,537 views
    Frederic Weisbecker Linux Kernel Engineer at SUSE Labs.