CPU Isolation – Nohz_full – by SUSE Labs (part 3)
This blog post is the third in a technical series by SUSE Labs team exploring Kernel CPU Isolation along with one of its core components: Full Dynticks (or Nohz Full). Here is the list of the articles for the series so far :
- CPU Isolation – Introduction
- CPU Isolation – Full dynticks internals
- CPU Isolation – Nohz_full
- CPU Isolation – Housekeeping and tradeoffs
- CPU Isolation – A practical example
- CPU Isolation – Nohz_full troubleshooting: broken TSC/clocksource
Now that we have drown ourselves within theory and full dynticks internals, it’s time to dive into the feature in practice.
NOHZ_FULL
The “nohz_full=” kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation.
A cpu-list argument is passed to define the set of CPUs to isolate. Assuming you have 8 CPUs for example and you want to isolate CPUs 4, 5, 6, 7:
nohz_full=4-7
Some more details on how to format a cpu-list can be found here.
What does nohz_full do exactly
When a CPU is included in the cpu-list from the nohz_full boot parameter, the kernel tries to move away from that CPU as much kernel noise as it can. We have explained what can and need to be done in theory in the previous article in order to shutdown the timer tick, here is what is eventually performed:
The timer tick
The timer tick is stopped whenever possible, assuming some conditions are met:
- The task that runs on the CPU can’t be preempted by another. This means you can’t have more than one task with the following policies: SCHED_OTHER, SCHED_BATCH, SCHED_IDLE. The same applies to SCHED_RR if the highest prio is shared by two or more tasks. The less error prone setting is to run a single task on an isolated CPU.
- The task doesn’t use posix-cpu-timers.
- The task doesn’t use perf events.
- If you run on x86, your machine must have a reliable timestamp counter (TSC). We’ll describe that later.
A residual 1 Hz tick (an interrupt every second) remains in order to maintain scheduler internal statistics. It used to execute on the isolated CPUs but nowadays this event is offloaded to the CPUs outside the nohz_full range using an unbound workqueue. This means that a clean setup can afford to run 100% tick-free on a CPU.
Timer callbacks
Unbound timer callbacks execution are moved to any CPU outside the nohz_full range, so they won’t trigger timer ticks on the wrong place to serve them. Meanwhile pinned timer ticks can’t be moved elsewhere. We’ll see later how to cope with them.
Workqueues and other kernel threads
In a similar fashion to the timer callbacks, unbound kernel workqueues and kthreads are moved to any CPU outside the nohz_full range. But pinned workqueues and kthreads can’t be moved elsewhere. Again we’ll see later how to cope with them.
RCU
Most of RCU processing is offloaded to the CPUs outside the isolated range. The CPUs set as nohz_full run in NOCB mode, which means the RCU callbacks queued on these CPUs are executed from unbound kthreads running on non-isolated CPUs. No need to pass the “rcu_nocbs=” kernel parameter as that is automatically taken care of while passing the “nohz_full=” parameter.
The CPU also doesn’t need to actively report quiescent states through the tick because it enters into RCU extended quiescent state upon return to userspace (see previous article at “3.2 RCU quiescent states reporting”)
Cputime accounting
The CPU switches to full dynticks cputime accounting (see previous article at 3.1 Cputime accounting) so that it doesn’t rely on a periodic event anymore.
Other isolation settings
Even though nohz_full is a significant part of the whole isolation setting, you’ll need to care about other details separately, among which two significant items:
User tasks affinity
If you wish to run a task undisturbed, you may not want other threads or processes to share the CPU with it. And full dynticks only works on single tasks in the end. It is therefore necessary to:
- Affine each of your isolated tasks to one CPU within the range of nohz_full. There must be only one isolated task per CPU.
- Affine all other tasks outside the nohz_full range.
There are several ways to affine your tasks to a set of CPUs, from the low level sched_setaffinity() API to tools like taskset. Powerful interfaces such as cpusets are also recommended.
IRQs affinity
Hardware IRQs (other than the timer and some other specific interrupts) may run on any CPU and disturb your isolated set. The resulting noise may not be just about interrupts stealing CPU time and trashing the CPU cache, IRQs may launch further asynchronous work on the CPU: softirq, timer, workqueue, etc… So it is usually a good idea to affine the IRQs to the CPUs outside the range of nohz_full. This affinity can be overriden through the file:
/proc/irq/$IRQ/smp_affinity
with $IRQ being the vector number. More details can found on the kernel documentation.
All these CPU isolation settings though involve tradeoffs and pitfalls that we’ll explore in the next article.
Related Articles
Feb 21st, 2023
SUSE Linux Enterprise 15 Service Pack 5 Public Beta is out!
Oct 01st, 2024
Comments
Hi Frederic,
I still get one tick per second after setting isolcpus=1,2 nohz_full=1,2 in GRUB_CMDLINE_LINUX and mapping irq to cpu 0 and 3 in smp_affinity_list, I expect it should be offloaded to the CPUs outside the nohz_full range as you mentioned ? Does it only works on suse enterprise server version ?
my version is openSUSE Leap 15.3, linux is 5.3.18-59.27-default.
could you pls help check what might be wrong ?
Thanks so much and your artile is really inspiring !
Thanks,
skyryu