,

High-Level Linux Troubleshooting

Whether you’re working on a cluster, a server or a workstation, most installations of Linux are similar. When something goes wrong, you need to determine the exact issue before you can get it resolved. This article provides a top-level overview of Linux troubleshooting.

Linux Kernel Messages

The Linux kernel is often aware of issues as they occur. If you suspect you’re facing a hardware issue or serious software issue (crashes/segfaults), the kernel can probably provide more information.

To see the most recent messages, run:
dmesg | tail -n50

To find older messages, read through the log file /var/log/messages (on some systems /var/log/kern.log). The Linux kernel prints many messages during normal operation (especially during the boot process), so don’t assume everything you see is a serious error.

Memory Errors

If your dmesg output contains messages similar to the examples below, your system is encountering errors when accessing memory. Because modern system components are closely integrated, such an error may be caused by several different types of hardware failure. Please send the dmesg output to our support team.

sbridge: HANDLING MCE MEMORY ERROR
CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010091
TSC 0 ADDR 10877e640 MISC 21420c8c86 PROCESSOR 0:206d6 TIME 1369016551 SOCKET 0 APIC 0
EDAC MC0: CE row 0, channel 0, label "CPU_SrcID#0_Channel#0_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0091 (ch=1), addr 
= 0x108778e40 => socket=0, Channel=0(mask=1), rank=0
kernel:[Hardware Error]: CPU:56 MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c55c00080080a13
kernel:[Hardware Error]:     MC4_ADDR: 0x000000720157c6f0
kernel:[Hardware Error]: Northbridge Error (node 7): DRAM ECC error detected on the NB.

NVIDIA GPU Errors

Kernel messages which contain the terms NVRM or Xid indicate some type of event occurred on an NVIDIA GPU. Such messages may not be fatal, so please contact Microway support for additional review. Consult NVIDIA documentation for the full list of Xid errors. Some examples of higher-priority issues are shown below.

NVRM: GPU at 0000:83:00: GPU-722f9c93-9a7f-08e3-6cc2-a5d8e3331e7f
NVRM: Xid (PCI:0000:83:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU
NVRM: Xid (PCI:0000:83:00): 79, GPU has fallen off the bus.
NVRM: GPU at 0000:83:00.0 has fallen off the bus.

Software RAID Errors

The dmesg output below shows an example message for a system with a degraded software RAID. This occurs when one of the hard drives fails, and will require a hardware swap. Please send a copy of the file /proc/mdstat to our support team.

[2010086.462608] md/raid1:md1: Disk failure on sdb1, disabling device.
md/raid1:md1: Operation continuing on 1 devices.
[2010086.474910] RAID1 conf printout:
[2010086.474914]  --- wd:1 rd:2
[2010086.474917]  disk 0, wo:1, o:0, dev:sdb1
[2010086.474919]  disk 1, wo:0, o:1, dev:sda1
[2010086.480441] RAID1 conf printout:
[2010086.480444]  --- wd:1 rd:2
[2010086.480447]  disk 1, wo:0, o:1, dev:sda1

Application Errors

If your scientific code is not working properly but you can find no system errors or messages, this is an indication that Linux and the hardware are working fine. It is likely that your code has a bug, your compiler has a bug or one of the scientific/math libraries has a bug. There are also cases where it is simply a compatibility issue – recompiling with a different compiler/library may fix the issue (e.g., OpenMPI instead of MVAPICH2, Intel compiler vs GNU compiler).

System Hangs/Crashes

Many different conditions can be described as a “system hang”. There are a variety of possible causes for such behavior. Please reference what to do when your system hangs.

No Linux Kernel Messages; System Reboots/Powers Off

If your system is rebooting or powering off with no warning, Linux will not be able to log the cause. You should verify that both your power and cooling are sufficient. The room should be roughly 74°F – systems that overheat will automatically power themselves off.

If power and cooling are reliable, then the most likely explanation is a hardware issue. Our support team can help you track down the issue.

You May Also Like

  • Knowledge Center

    Common Maintenance Tasks (Clusters)

    The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers). Backup non-replaceable data Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate this…

  • Knowledge Center

    Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

    This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021. The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon…

  • Knowledge Center

    Detailed Specifications of the AMD EPYC “Milan” CPUs

    This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Milan” and based on AMD’s Zen3 architecture). EPYC “Milan” processors replace the previous “Rome” processors and are available for sale as of March 15th, 2021. These new CPUs are the third iteration of AMD’s EPYC server processor family. They are compatible…