Considerations for dealing with correctable memory error messages
This document (7022118) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server 12 (all versions)
x86_64
mcelog
Situation
Generally, the kernel detects and reports such errors but systems with sophisticated firmware functionality can sometimes do a better job at that due to their most intimate knowledge of the platform and abilities in performing the proper recovery actions.
The kernel however is not the only source, the system service controller may or may not also detect this kind of issue. Especially if the service controller does not show the events seen in the OS the administrator starts to wonder whether there is an issue or not.
Resolution
The operating system (in this case the kernel) is as verbose as possible and logs those events by default which may lead to false/positive alerts if no errors are reported in the hardware management board.
The kernel-source.rpm contains the file
/usr/src/linux/Documentation/x86/x86_64/boot-options.txt
which provides a number of kernel options to influence the logging behavior of the kernel. The question mainly is, should the administrator worry about corrected ECC errors at all?
From a technical point of view, a corrected memory message should be considered as an informational message only because the error has been corrected by the built-in hardware error correction mechanisms and it has not had any effect on system execution. However, todays hardware management boards may provide defined thresholds how many errors may occur before a warning / action is triggered.
Uncorrected errors on the other hand are the ones to worry about. In case of such an event, the kernel panics automatically to prevent data corruption (see option mce=tolerancelevel# in /usr/src/linux/Documentation/x86/x86_64/boot-options.txt)
A kernel option that may influence the behaviour of ECC RAM error logging are (taken from /usr/src/linux/Documentation/x86/x86_64/boot-options.txt):
mce=ignore_ceThis option instructs the kernel to ignore correctable errors in the presence of a hardware management board which takes care of monitoring such events instead.
Disable features for corrected errors, e.g. polling timer
and CMCI. All events reported as corrected are not cleared
by OS and remained in its error banks.
[...]
Additional Information
http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-a00016026en_us&sp4ts.oid=3884323
Please note disabling EDAC as discussed in this article will not effect the kernel's ability to react on uncorrectable memory error events. In this case, a machine check exception will be executed and the system will crash to prevent data corruption.
In case of any questions, please open a support request with the respective hardware vendor do discuss recommended settings for the hardware platform in use.
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:7022118
- Creation Date: 17-Oct-2017
- Modified Date:03-Mar-2020
-
- SUSE Linux Enterprise Server
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com