mce EDAC memory scrubbing error
This document (000020932) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server 12
Situation
Cisco Hardware examples from supportconfig's basic-environment.txt:
Manufacturer: Cisco Systems Inc Hardware: UCSC-C460-M4 Manufacturer: Cisco Systems Inc Hardware: UCSB-EX-M4-3
Memory error examples from /var/log/messages or /var/log/warn:
kernel: [780347.201907] mce: [Hardware Error]: Machine check events logged kernel: [780347.201913] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR kernel: [780347.201915] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 13: 8c00004e000800c0 kernel: [780347.201916] EDAC sbridge MC3: TSC 320fbc8c89c9a4 kernel: [780347.201918] EDAC sbridge MC3: ADDR 52baf54000 kernel: [780347.201918] EDAC sbridge MC3: MISC 900020002001c8c kernel: [780347.201920] EDAC sbridge MC3: PROCESSOR 0:406f1 TIME 1672341137 SOCKET 0 APIC 0 kernel: [780347.201936] EDAC MC3: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x52baf54 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:0)
kernel: [5336480.342062] EDAC sbridge MC2: HANDLING MCE MEMORY ERROR kernel: [5336480.342067] EDAC sbridge MC2: CPU 130: Machine Check Event: 0 Bank 8: cc00038000010091 kernel: [5336480.342070] EDAC sbridge MC2: TSC 0 kernel: [5336480.342071] EDAC sbridge MC2: ADDR b686270ec0 kernel: [5336480.342072] EDAC sbridge MC2: MISC 15646d086 kernel: [5336480.342074] EDAC sbridge MC2: PROCESSOR 0:406f1 TIME 1667973782 SOCKET 1 APIC 59 kernel: [5336480.342090] mce: [Hardware Error]: Machine check events logged kernel: [5336480.342106] EDAC MC6: 14 CE memory read error on CPU_SrcID#1_Ha#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xb686270 offset:0xec0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0091 socket:1 ha:1 channel_mask:2 rank:5)
Resolution
Cisco recommends to disable the "edac" kernel module
by adding blacklist sb_edac to /etc/modprobe.d/50-blacklist.conf
To do that, run the command:
echo "blacklist sb_edac" >> /etc/modprobe.d/50-blacklist.conf
Then reboot for the setting to take effect.
Cause
Additional Information
"If you have EDAC modules enabled in your Linux OS, then you really want to disable/black list those.They are notorious for not correctly identifying the actual DIMM that's triggering ECCs, and you really want to let the hardware do that (which it won't if you have EDAC active in os).Then you should be able to see in your SEL log, the DIMM slot this is triggering ECCs."
https://quickview.cloudapps.cisco.com/quickview/bug/CSCvf14908
"Symptom: When this issue occurs, the following two error/stack lines are often observed:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: [] sbridge_mce_output_error+0x36a/0xdf0 [sb_edac]
Conditions: Cisco UCS B or C series Servers running SLES12 SP1. The EDAC module and UCS error detection conflicts with each other and can cause system crashes. EDAC module should be blacklisted."
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020932
- Creation Date: 17-Jan-2023
- Modified Date:18-Jan-2023
-
- SUSE Linux Enterprise Server
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com