,

What to do when your system hangs

If one of your Linux systems has crashed or appears to have hung, it can be difficult to know what to do next. Your first instinct may be to reboot it, but a system reboot should not be your first resort (particularly if you want to ensure it doesn’t happen again soon). Below are some options which you should review to help determine what has happened and how it can be prevented in the future.

Before Rebooting a Hung System

Do you know what the server was doing when it hung? Running a heavy workload is probably the most common case, but our technicians have also witnessed systems that only hang when idle.

If a server “hangs” and you cannot log in, there are a few things to do before you reboot it. The goal is to determine if the entire system is 100% frozen or if certain portions are still operating (which often helps classify the type of problem). Some things to try:

  • Can you ping it? If there is network activity then some portions of the system are still operating.
  • Does the system ask you for a password when you try to log in? Sometimes systems hang in such a fashion that the system accepts your username and then hangs as you’re typing in the password (e.g., in an SSH login session).
  • Does there appear to be hard drive activity?
  • Even though you cannot log in, does it respond to typing if you try to access it via IPMI? What if you go to the machine in the server room and try to log in there? If it is not responding to the keyboard at all when you are in the server room, then that is a particular type of hang.

After Rebooting

At this point, you want to see what is in the system logs. Not all types of issues show up in the system logs, but you may see something valuable there. The files to check vary by Linux distribution, but here are a few to look at:

/var/log/dmesg
/var/log/messages
/var/log/mcelog
/var/log/syslog

Some of the common error messages are listed in High-Level Linux Troubleshooting. You may need to have a Linux systems expert review the logs. Be sure to provide them the entire log contents. Messages from system daemons and from the Linux kernel can be misleading. It’s common for users to mistake fairly common kernel messages for critical errors, or to overlook the errors/warnings which indicate trouble spots.

You May Also Like

  • Knowledge Center

    Common Maintenance Tasks (Clusters)

    The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers). Backup non-replaceable data Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate this…

  • Knowledge Center

    Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

    This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021. The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon…

  • Knowledge Center

    Detailed Specifications of the AMD EPYC “Milan” CPUs

    This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Milan” and based on AMD’s Zen3 architecture). EPYC “Milan” processors replace the previous “Rome” processors and are available for sale as of March 15th, 2021. These new CPUs are the third iteration of AMD’s EPYC server processor family. They are compatible…