SUSE Support

Here When You Need Us

SLES 15 SP2 based systems with Optane memory may encounter a kernel crash after running supportconfig

This document (000019848) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 15 Service Pack 2 with kernel versions up to 5.3.18-24.37-default

Optane memory


Situation

When executing the command supportconfig to gather information for a support case, the system may crash with a kernel trace similar to:

[501.580554] BUG: unable to handle page fault for address: 00000000000028b8
[ 501.588732] #PF: supervisor read access in kernel mode
[ 501.594902] #PF: error_code(0x0000) - not-present page
[ 501.600991] PGD 0 P4D 0
[ 501.604181] Oops: 0000 [#1] SMP NOPTI
[ 501.608592] CPU: 29 PID: 11091 Comm: systool Kdump: loaded Not tainted 5.3.18-24.37-default #1 SLE15-SP2
[ 501.619537] Hardware name: Cisco Systems Inc UCSB-B480-M5/UCSB-B480-M5, BIOS B480M5.4.0.4i.0.0831191124 08/31/2019
[ 501.631463] RIP: 0010:is_mem_section_removable+0x41/0x140
[ 501.637832] Code: 49 89 fc 48 8b 04 10 4c 01 e6 48 89 c1 48 89 c7 48 c1 e9 33 48 c1 ef 36 83 e1 07 48 69 c9 c0 05 00 00 48 03 0c fd a0 82 80 bb <48> 8b 69 78 48 03 69 68 48 39 f5 48 0f 47 ee 49 39 ec 0f 83 da 00
[ 501.659579] RSP: 0018:ffffb8065dc4bdf8 EFLAGS: 00010202
[ 501.665780] RAX: fffffffe00000001 RBX: 000000000000000c RCX: 0000000000002840
[ 501.674118] RDX: fffff8c600000000 RSI: 0000000006068000 RDI: 00000000000003ff
[ 501.682454] RBP: ffff91a820c3e000 R08: 0000000000000001 R09: ffff908b6f6446c0
[ 501.690789] R10: 0000000000000000 R11: 000000000000362d R12: 0000000006060000
[ 501.699153] R13: 0000000000000000 R14: 0000000000000001 R15: ffff91a83bc10800
[ 501.707514] FS: 00007f42950a7740(0000) GS:ffff91a840640000(0000) knlGS:0000000000000000
[ 501.716923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 501.723742] CR2: 00000000000028b8 CR3: 0000017b7ecfe002 CR4: 00000000007606e0
[ 501.732118] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 501.740492] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 501.748834] PKRU: 55555554
[ 501.752266] Call Trace:
[ 501.755407] removable_show+0x8e/0xb0
[ 501.759901] dev_attr_show+0x18/0x50
[ 501.764315] sysfs_kf_seq_show+0xb3/0x110
[ 501.769171] seq_read+0xd8/0x3e0
[ 501.773153] vfs_read+0x89/0x140
[ 501.777119] ksys_read+0xa1/0xe0
[ 501.781081] do_syscall_64+0x5b/0x1e0
[ 501.785525] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 501.791559] RIP: 0033:0x7f42949b81a1


Resolution

Solution a)

To temporarily disable supportconfig sysfs checks, change SYSFS=1 to SYSFS=0 in /etc/supportconfig.conf. If the configuration file does not exist, run

supportconfig -C

to create it and use

sed -i 's/SYSFS=1/SYSFS=0/g' /etc/supportconfig.conf

to change the SYSFS option.

Solution b)

Please apply the patch SUSE-SLE-Module-Basesystem-15-SP2-2021-354 to address this issue, a reboot after installing the kernel update is required.

Cause

In certain platform configurations when persistent memory is adjacent to the RAM and it is not aligned to 128MB and when the system has enough memory to use large memory blocks (usually 2GB) then this can result in a kernel crash while trying to check for pmem backed memory while reading /sys/devices/system/memory/memory*/removable files. systool -vb memory is one of the way to trigger such a problem, the command is being executed as part of SUSE's support tool called supportconfig.

How to check:
First of all check the block size in the kernel log.
[    7.798992] x86/mm: Memory block size: 2048MB

the size scales on the amount of memory. The smallest size is 128MB and it doesn't expose the problem. Only larger sizes can.

If the size is larger, consult SRAT tables
[    0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
[    0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile


Here we can see that the non volatile memory is close to RAM and it falls into the same 2GB physical region and that is the source of the problem.

Why does that happen?
The underlying reason is that pmem doesn't initialize all the memory descriptors for its range and the implementation of /sys/devices/system/memory/memory*/removable doesn't expect that.

Additional Information

How does the fix work?
The updated kernel has taken a workaround and simply always considers memory removable. This looks like a dubious thing at first sight but we should realize that whatever this file says it is imprecise at best because the removability can change at any point in time and nobody can rely on the returned value because to hotremove can fail right after. Therefore upstream development decided to simply deprecate this interface. For backward compatibility it will report each memory block as removable so that unaware userspace doesn't preemptively do not give up an hotplug operation. See 53cdc1cb29e8 ("drivers/base/memory.c: indicate all memory blocks as removable") for more information.

This means that the problem is prevented by removing the code which would stumble over uninitialized data structures.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019848
  • Creation Date: 26-Jan-2021
  • Modified Date:11-Feb-2021
    • SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

tick icon

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

tick icon

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

tick icon

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.