Linux system hangs or is unstable
This document (3301593) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server 12
SUSE Linux Enterprise Server 11
SUSE Linux Enterprise Server 10
Situation
System hangs
System is unstable
System oops or panic
Resolution
- Problem characterization
- Hardware layer
- BIOS / firmware layer
- Storage layer
- Software layer
Additional Information
Introduction
Due to the large number of different potential causes, system hangs are among the most difficult problems to troubleshoot and a systematic approach is required for troubleshooting to be effective. This document describes such an approach, in general terms.
Problem characterization
First of all, establish a detailed characterization of the problem which answers at a minimum the following questions:
- What is meant by a hang or instability? Is the system not providing a particular service (reliably) anymore, has the system as whole become completely inaccessible (both via network and via console), or is it still responsive to some forms of connection (e.g. SSH, VNC or ping) or commands?
- For a hang, is it a single occurrence or has the hang occurred multiple times?
- For a recurring hang, is there a pattern to the hangs? E.g. can the hang be triggered by a particular sequence of operations, or does it always occur around a particular time of day, after a particular period of system uptime, or when particular cron jobs are executed.
Hardware layer
System hangs or instabilities can be caused by hardware that is defective or improperly configured. Unfortunately, this happens more than most people realize, for two main reasons:
- A ground rule with hardware is "Cheap, reliable, fast. Pick any two". Hardware that is cheap and reliable is not fast; hardware that is fast and cheap is not reliable; hardware that is reliable and fast is not cheap.
- Proper hardware configurationis difficult. Most hardware has many settings which can be tweaked, but knowing when and what to tweak can be something of a black art.
Fortunately, reputable hardware vendors offer diagnostics software that can and should be used to detect hardware problems. If hardware problems are incorrectly disregarded as a problem source, much time will be wasted on analysing the software level.
Aside from vendor hardware diagnostics software, for x86 and x86_64 systems there are very thorough diagnostic tools for the memory subsystem: Memtest86 and Memtest86+. These tools are often better at identifying memory subsystem issues than vendor hardware diagnostics software. A version of them is included on the boot CD of SUSE Linux products and these tools can also be obtained from the www.memtest86.org and www.memtest86.com web sites.
Consult vendor configuration guides
As for hardware configuration, some vendors (e.g. IBM) provide detailed configuration guides for SUSE Linux products on specific hardware models on their support sites. When available, this type of guide should be followed, preferably from the initial installation onwards. Even when such a guide has not been followed during initial installation, it should be consulted later on to check the system configuration and bring it in line with the hardware vendor's recommendations.
Consult certification documentation
Additionally, for SUSE YES CERTIFIED configurations, consult the YES CERTIFIED bulletin Search . Where applicable, the certification bulletins contain configuration details such as Linux kernel parameters.
Address power supply issues
In some regions or at some locations, power from the regular electrical grid may be too variable in voltage, frequency or current for hardware to operate reliably. In such locations, appropriate electrical hardware like surge protectors, voltage regulators, uninterruptible power supplies and/or generators should be used to provide reliable power for computer systems operation.
Isolate components
In some cases, stability issues and hangs are caused by specific extension cards. Remove all non-essential extension cards, test the system then put them back one by one, testing the system after every added card.
Best practice: "burn in" testing
In light of these considerations, it is considered best practice for hardware that is to be used for production services to undergo thorough "burn in" testing covering diagnostics and stress and load testing prior to being put into production use.
BIOS layer
On PC-based systems, the BIOS (Basic Input/Output System) is responsible for the initial setup of the system and devices up to the point where a boot loader can be started to boot the system. On other architectures, the term "BIOS" is not used, but equivalent embedded software exists, e.g. "Open Firmware" or "Extensible Firmware Interface".The BIOS and its equivalents on non-PC architectures may also be involved in power management, hardware monitoring and hotplugging of extension cards.
A BIOS, like any other software, may contain general programming defects (bugs) and may not always be following or supporting relevant standards such as ACPI fully. Vendors regularly release updated versions of BIOSes to correct such defects. Given the central role of the BIOS, it is important to track such version updates and to ensure the most recent non-development version of the BIOS is installed.
Most reputable vendors provide a search interface on their support sites that make it easy to find the current BIOS revision for a particular hardware model as well as update instructions.
Other Firmware
With modern hardware many components, for instance NICs, HBAs and storage controllers, include embedded software or firmware of their own. This firmware should be brought up to date as well.
Storage layer
Ensure that your storage is consistent by performing filesystem checks (and recovery) on all storage areas, including the root filesystem. To check the root filesystem, use the rescue environment from the service pack or installation CDs or DVDs.
Software layer
Check for corrupted dataEven when the filesystems check out cleanly, data contained in them may be corrupted, including code and data vital to proper operation of the operating system. The package management system stores checksums of data under its control. Run
Check the output of this command for signs of changes in files that are not configuration files, like binaries and libraries.
Keep the software installation up to date
SUSE actively maintains released products for long periods of time. This maintenance includes fixes for software defects in particular as well as the addition of drivers for newer hardware models. Use the tools supplied by SUSE, in particular the SPident tool, the SUSE Customer Center and the online update facilities of your product to check whether your software installation is up to date and to bring it up to date if it isn't.
Check recent updates
Unfortunately, updated packages can occasionally introduce new defects. You can use the package management system of your SUSE Linux Enterprise product to determine what updates have been installed recently, e.g. through:
Support from SUSE Technical Services
Basic informationWhen opening a service request with SUSE Support for a server hang or instability issue, the following information may be vital to an efficient resolution process:
- A detailed characterization of the problem (as discussed above)
- A description of changes made to the system and its configuration during troubleshooting prior to the openening of a service request.
- If the system has core dumped, capture a kernel core analysis per TID000017889 (7010484) - Generating a Kernel Core Dump Analysis File
- Run a supportconfig to gather system data and the kernel core analysis file if present. TID000019214 - Supportconfig Self Service via SCC/FTP
During the handling of your service request, you may be asked to provide a system crash dump for analysis, which may require substantial setup (e.g. of a serial console and/or second server to receive dumps). You can prepare for this by consulting the relevant TID or SLES Documentation for details e.g.:
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:3301593
- Creation Date: 29-Oct-2007
- Modified Date:27-May-2021
-
- SUSE Linux Enterprise Desktop
- SUSE Linux Enterprise Server
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com