Kernel hard LOCKUP and rcu_sched stalls caused by a slow responding serial console interface
This document (000020516) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server 12
Situation
[ 5653.173525] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10
[ 5653.173539] RIP: 0010:native_queued_spin_lock_slowpath+0x65/0x1e0
[ 5653.173542] Call Trace:
[ 5653.173542] _raw_spin_lock_irqsave+0x30/0x40
[ 5653.173542] force_qs_rnp+0xa8/0x160
[ 5653.173542] rcu_gp_kthread+0x7b2/0xfa0
[ 5653.173542] ? __schedule+0x305/0x750
[ 5653.173542] ? rcu_barrier+0x590/0x590
[ 5653.173543] kthread+0x10d/0x130
[ 5653.173543] ? kthread_park+0xa0/0xa0
[ 5653.173543] ret_from_fork+0x35/0x40
The current stack of CPU#10, running rcu_sched with PID 12:
PID: 12 TASK: ffff967981712680 CPU: 10 COMMAND: "rcu_sched"
#0 [fffffe00002079c8] machine_kexec at ffffffffa3e6d861
#1 [fffffe0000207a20] __crash_kexec at ffffffffa3f53ca5
#2 [fffffe0000207ae8] panic at ffffffffa3e9b1e7
#3 [fffffe0000207b70] nmi_panic at ffffffffa3e9ae45
#4 [fffffe0000207b78] watchdog_overflow_callback at ffffffffa3f89fb4
#5 [fffffe0000207b88] __perf_event_overflow at ffffffffa40041c2
#6 [fffffe0000207bb8] handle_pmi_common at ffffffffa3e0efa6
#7 [fffffe0000207e08] intel_pmu_handle_irq at ffffffffa3e0f22f
#8 [fffffe0000207e48] perf_event_nmi_handler at ffffffffa3e0746d
#9 [fffffe0000207e60] nmi_handle at ffffffffa3e36043
#10 [fffffe0000207eb8] default_do_nmi at ffffffffa3e364be
#11 [fffffe0000207ed8] do_nmi at ffffffffa3e366d3
#12 [fffffe0000207ef0] end_repeat_nmi at ffffffffa480164b
[exception RIP: native_queued_spin_lock_slowpath+101]
RIP: ffffffffa3ef8345 RSP: ffffa4e94c51be40 RFLAGS: 00000002
RAX: 0000000000000101 RBX: 0000000000000246 RCX: ffffffffa52c21a0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa5279600
RBP: ffffffffa5279600 R8: 0000000000000001 R9: 0000000000000002
R10: ffffa4e94c57bdc8 R11: 0000000000000006 R12: 0000000000000000
R13: ffffffffa3f1c560 R14: 000000000002d7c0 R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#13 [ffffa4e94c51be40] native_queued_spin_lock_slowpath at ffffffffa3ef8345
#14 [ffffa4e94c51be40] _raw_spin_lock_irqsave at ffffffffa46a5ea0
#15 [ffffa4e94c51be50] force_qs_rnp at ffffffffa3f1ccc8
#16 [ffffa4e94c51be90] rcu_gp_kthread at ffffffffa3f21322
#17 [ffffa4e94c51bf10] kthread at ffffffffa3ec06ed
#18 [ffffa4e94c51bf50] ret_from_fork at ffffffffa4800215
All CPU are running idle, except for CPU 56,52,10, which also have expired timers:
crash> timer -r -C 10,56,52
CPU: 10 HRTIMER_CPU_BASE: ffff96987f49ff80
CLOCK: 0 HRTIMER_CLOCK_BASE: ffff96987f49ffc0 [ktime_get]
CURRENT
5650408000000
SOFTEXPIRES EXPIRES TTE HRTIMER FUNCTION
5626674278491 5626674278491 -23733721509 ffff96987f4a04e0 ffffffffa3f42bf0 <tick_sched_timer>
5628834278491 5628834278491 -21573721509 ffff96987f4a0740 ffffffffa3f89540 <watchdog_timer_fn>
CPU: 52 HRTIMER_CPU_BASE: ffff96987fa9ff80
CLOCK: 0 HRTIMER_CLOCK_BASE: ffff96987fa9ffc0 [ktime_get]
CURRENT
5650408000000
SOFTEXPIRES EXPIRES TTE HRTIMER FUNCTION
5649262278491 5649262278491 -1145721509 ffff96987faa04e0 ffffffffa3f42bf0 <tick_sched_timer>
CPU: 56 HRTIMER_CPU_BASE: ffff96b8ff91ff80
CLOCK: 0 HRTIMER_CLOCK_BASE: ffff96b8ff91ffc0 [ktime_get]
CURRENT
5650408000000
SOFTEXPIRES EXPIRES TTE HRTIMER FUNCTION
5629282278491 5629282278491 -21125721509 ffff96b8ff920740 ffffffffa3f89540 <watchdog_timer_fn>
The referenced spinlock seems to be only used by PID#12:
crash> qspinlock ffffffffa5279600
struct qspinlock {
{
val = {
counter = 0x101
},
{
locked = 0x1,
pending = 0x1
},
{
locked_pending = 0x101,
tail = 0x0
}
}
}
While CPU#52 is writing the printk() debugging messages to the serial console:
PID: 28052 TASK: ffff96a1e5fa0000 CPU: 52 COMMAND: "sshd"
#0 [fffffe0000a65e58] crash_nmi_callback at ffffffffa3e61113
#1 [fffffe0000a65e60] nmi_handle at ffffffffa3e36043
#2 [fffffe0000a65eb8] default_do_nmi at ffffffffa3e364be
#3 [fffffe0000a65ed8] do_nmi at ffffffffa3e366d3
#4 [fffffe0000a65ef0] end_repeat_nmi at ffffffffa480164b
[exception RIP: io_serial_in+20]
RIP: ffffffffa43d8424 RSP: ffffa4e94d204c20 RFLAGS: 00000002
RAX: ffffffffa43d8400 RBX: ffffffffa6458540 RCX: 0000000000000000
RDX: 00000000000002fd RSI: 0000000000000005 RDI: ffffffffa6458540
RBP: 00000000000026fc R8: 000b4bf3fa09edba R9: 0000000000000034
R10: 0000000000000001 R11: ffffa4e94d204cb0 R12: 0000000000000020
R13: ffffffffa5a89605 R14: ffffffffa5a895c0 R15: 0000000000000045
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#5 [ffffa4e94d204c20] io_serial_in at ffffffffa43d8424
#6 [ffffa4e94d204c20] wait_for_xmitr at ffffffffa43d8e20
#7 [ffffa4e94d204c40] serial8250_console_putchar at ffffffffa43d8ea6
#8 [ffffa4e94d204c58] uart_console_write at ffffffffa43d16f5
#9 [ffffa4e94d204c80] serial8250_console_write at ffffffffa43dcd18
#10 [ffffa4e94d204d00] console_unlock at ffffffffa3f06ae2
#11 [ffffa4e94d204d48] vprintk_emit at ffffffffa3f086f9
#12 [ffffa4e94d204d98] printk at ffffffffa3f08fb0
#13 [ffffa4e94d204e00] show_trace_log_lvl at ffffffffa3e3591d
#14 [ffffa4e94d204ef0] watchdog_timer_fn at ffffffffa3f89758
#15 [ffffa4e94d204f20] __hrtimer_run_queues at ffffffffa3f31c08
#16 [ffffa4e94d204f80] hrtimer_interrupt at ffffffffa3f323a5
#17 [ffffa4e94d204fd8] smp_apic_timer_interrupt at ffffffffa48026ba
#18 [ffffa4e94d204ff0] apic_timer_interrupt at ffffffffa4801bef
--- <IRQ stack> ---
The system serial interface is active:
crash> p saved_command_line
saved_command_line = $1 = 0xffff96b8fffb7a80 "BOOT_IMAGE=/boot/vmlinuz-5.3.18-24.37-default root=/dev/mapper/vg_system-lv_root ipv6.disable=1 showopts ignore_loglevel textmode=1 splash=verbose transparent_hugepage=never crashkernel=768M,high crashkernel=72M,low console=tty0 console=ttyS1,115200 vga=0x"...
Resolution
Cause
Status
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020516
- Creation Date: 25-Nov-2021
- Modified Date:03-Dec-2021
-
- SUSE Linux Enterprise Server
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com