TSC Clocksource Switching to HPET During High I/O Load

This document (000021035) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server for SAP Applications 12 SP5
SUSE Linux Enterprise Server for SAP Applications 15 SP3
SUSE Linux Enterprise Server for SAP Applications 15 SP4
SUSE Linux Enterprise Server for SAP Applications 15 SP5
SUSE Linux Enterprise Server 12 SP5
SUSE Linux Enterprise Server 15 SP3
SUSE Linux Enterprise Server 15 SP4
SUSE Linux Enterprise Server 15 SP5

Situation

There have been several reports of the TSC clocksource switching to HPET under high load. The use of HPET as the primary clocksource has been observed to negatively impact the performance of numerous workloads that rely on accurate Time Stamp Counter (TSC) timing information, such as SAP HANA systems, as referenced on SAP Note 2753418 . Based on available information, the switch to HPET may be caused by certain CPUs becoming stuck in the sbitmap_queue_clear() and __sbq_wake_up() functions. The problem has been detected on machines that have a significant number of CPUs and multiple NUMA (Non-Uniform Memory Access) nodes. For example, it has been observed on a 448 CPU system that has 8 NUMA nodes, as well as on a 192 CPU machine with 4 NUMA nodes, which did not meet the criteria for disabling the clocksource watchdog (applicable only to systems with 2 NUMA nodes or fewer). The kernel log pattern may appear as follows:

A workqueue lockup appeared first, which has been attributed to a worker becoming unresponsive in the blk_mq_run_work_fn() function:

BUG: workqueue lockup - pool cpus=436 node=7 flags=0x0 nice=0 stuck for 42s!
Showing busy workqueues and worker pools:
workqueue kblockd: flags=0x18
  pwq 873: cpus=436 node=7 flags=0x0 nice=-20 active=2/256 refcnt=3
    in-flight: 3743:blk_mq_run_work_fn
    pending: blk_mq_run_work_fn

Followed by a RCU stall, with a warning message indicating that CPUs 6 and 266 had become stuck in sbitmap_queue_clear() and __sbq_wake_up() functions while handling a block layer softirq:

 rcu: INFO: rcu_sched self-detected stall on CPU
 rcu:   199-....: (1 GPs behind) idle=01a/1/0x4000000000000004 softirq=173152/173153 fqs=4912
      (t=15000 jiffies g=4183933 q=26243)
 Sending NMI from CPU 199 to CPUs 6:
 NMI backtrace for cpu 6
 CPU: 6 PID: 0 Comm: swapper/6 Tainted: G X 5.3.18-150300.59.101-default #1 SLE15-SP3
 RIP: 0010:__sbq_wake_up+0x64/0xd0
 Call Trace:
  <IRQ>
  sbitmap_queue_clear+0x3b/0x70
  __blk_mq_free_request+0x7f/0xb0
  scsi_end_request+0x9d/0x120 [scsi_mod]
  scsi_io_completion+0x85/0x4f0 [scsi_mod]
  blk_done_softirq+0x8f/0xc0
  __do_softirq+0xe3/0x2da
  irq_exit+0xd5/0xe0
  call_function_single_interrupt+0xf/0x20
 ... ... ...
 NMI backtrace for cpu 266
 CPU: 266 PID: 0 Comm: swapper/266 Tainted: G X 5.3.18-150300.59.101-default #1 SLE15-SP3
 RIP: 0010:__sbq_wake_up+0x27/0xd0
 Call Trace:
  <IRQ>
  sbitmap_queue_clear+0x3b/0x70
  __blk_mq_free_request+0x7f/0xb0
  scsi_end_request+0x9d/0x120 [scsi_mod]
  scsi_io_completion+0x85/0x4f0 [scsi_mod]
  blk_done_softirq+0x8f/0xc0
  __do_softirq+0xe3/0x2da
  irq_exit+0xd5/0xe0
  call_function_single_interrupt+0xf/0x20
  </IRQ>

The worker remained blocked in blk_mq_run_work_fn():

 workqueue kblockd: flags=0x18
   pwq 873: cpus=436 node=7 flags=0x0 nice=-20 active=2/256 refcnt=3
     in-flight: 3743:blk_mq_run_work_fn
     pending: blk_mq_run_work_fn
 pool 873: cpus=436 node=7 flags=0x0 nice=-20 hung=80s workers=2 idle: 3006

Eventually, the clocksource became unstable due to the skew being too large:

 clocksource: timekeeping watchdog on CPU301: Marking clocksource 'tsc' as unstable because the skew is too large:
 clocksource:     'hpet' wd_now: a5dab66a wd_last: ea6da5ee mask: ffffffff
 clocksource:     'tsc' cs_now: 2a61230e401d7e cs_last: 2a60d0e23da03e mask: ffffffffffffffff
 tsc: Marking TSC unstable due to clocksource watchdog 
 TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
 sched_clock: Marking unstable (42172076095957, -113650410)<-(42172793736635, -831659344)
 clocksource: Switched to clocksource hpet

The above warning message is generated when the skew between the periods measured by HPET and TSC exceeds 0.0625 seconds. The counter values for HPET show that HPET overflowed and the period elapsed since the last check was 131.020 seconds:

 0xffffffff + 1 + 0xa5dab66a - 0xea6da5ee = 3144487036 
 3144487036 / (24 * 10^6) = 131.020293166

The counter values for TSC show that the period elapsed since the last check was 131.020 seconds, (cs_now - cs_last) / (tsc_mhz * 10^6):

 0x2a61230e401d7e - 0x2a60d0e23da03e = 352925678912
 352925678912 / (2693.671 * 10^6) = 131.020335784

A NIC transmit queue timeout was also observed:

NETDEV WATCHDOG: eth4 (mlx5_core): transmit queue 35 timed out

Resolution

To address this defect, the solution is to verify that the wait_cnt value is positive before updating the queue index of the bitmap. If the wait_cnt value is not positive, the index update should not be executed. The TSC clocksource switches triggered by CPUs getting stuck in sbitmap_queue_clear() and __sbq_wake_up() affect the following kernels and Service Packs:

SUSE version	Affected Kernels	Not Affected Kernels
SLES12 SP5	4.12.14-122.144.1 4.12.14-122.147.1	4.12.14-122.139.1 and older 4.12.14-122.150.1 and newer
SLES15 SP3	5.3.18-150300.59.101.1 5.3.18-150300.59.106.1	5.3.18-150300.59.98.1 and older 5.3.18-150300.59.109.1 and newer
SLES15 SP4	5.14.21-150400.24.33.2 5.14.21-150400.24.38.1	5.14.21-150400.24.28.1 and older 5.14.21-150400.24.41.1 and newer
SLES15 SP5	5.14.21-150500.55.39.1 and older	5.14.21-150500.55.49.1 and newer

In regards to SLES 15 SP2, it should be noted that the commits which trigger TSC clocksource switches to HPET due to CPUs becoming stuck in sbitmap_queue_clear() and __sbq_wake_up() have not been backported to this version. As such, SLES 15 SP2 remains unaffected by the TSC switches to HPET resulting from CPUs becoming stuck in sbitmap_queue_clear() and __sbq_wake_up(). An unstable TSC clocksource switching to HPET can result from various events and condition. The currently active clocksource being marked as unstable is not an issue with a single root cause and sometimes it is what should happen because the clocksource is indeed unstable. The root causes outlined in this TID illustrate the importance of the specific details and circumstances surrounding the issue.

If a similar issue occurs on SLES15 SP2 or other Service Packs not listed in this TID, please generate a supportconfig and contact SUSE Support for further investigation. In SLES 15 SP2, a temporary workaround involves disabling the clocksource watchdog, which may generate latency spikes. To disable the clocksource watchdog, add the kernel parameter "tsc=nowatchdog" during boot time. Set the parameter as a GRUB option by appending it to the GRUB_CMDLINE_LINUX_DEFAULT line in the /etc/default/grub file, afterwards regenerate GRUB and reboot the system:

# vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="... tsc=nowatchdog"

# grub2-mkconfig -o /boot/grub2/grub.cfg

Cause

The root cause of the issue involves the block layer in the Linux kernel. CPUs may get stuck in sbitmap_queue_clear() and __sbq_wake_up(). A race can happen whenever one I/O completion interrupts another I/O completion currently being processed, leading to an overflow of wait_cnt and setting the counter to a negative value. This block layer defect causes soft lockups and may also cause workqueue lockups, stalls in RCU processing, hard lockups, timeouts of the transmit queues of NICs and clocksource watchdog failures. A clocksource watchdog failure will result in the current clocksource being marked as unstable, forcing a switch to an alternative clocksource. On x86, a clocksource watchdog failure usually triggers a switch to the HPET clocksource after the TSC clocksource is marked unstable. In case of soft lockups, the block layer issue may only be transient and the system recovers from it. However, the switch to HPET has a permanent negative performance impact, as long as HPET is used as the main clocksource.

Status

Reported to Engineering

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

Document ID:000021035
Creation Date: 04-Apr-2023
Modified Date:31-Jul-2024
- SUSE Linux Enterprise Server
- SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com