If one of your Linux systems has crashed or appears to have hung, it can be difficult to know what to do next. Your first instinct may be to reboot it, but a system reboot should not be your first resort (particularly if you want to ensure it doesn’t happen again soon). Below are some options which you should review to help determine what has happened and how it can be prevented in the future.
Before Rebooting a Hung System
Do you know what the server was doing when it hung? Running a heavy workload is probably the most common case, but our technicians have also witnessed systems that only hang when idle.
If a server “hangs” and you cannot log in, there are a few things to do before you reboot it. The goal is to determine if the entire system is 100% frozen or if certain portions are still operating (which often helps classify the type of problem). Some things to try:
- Can you ping it? If there is network activity then some portions of the system are still operating.
- Does the system ask you for a password when you try to log in? Sometimes systems hang in such a fashion that the system accepts your username and then hangs as you’re typing in the password (e.g., in an SSH login session).
- Does there appear to be hard drive activity?
- Even though you cannot log in, does it respond to typing if you try to access it via IPMI? What if you go to the machine in the server room and try to log in there? If it is not responding to the keyboard at all when you are in the server room, then that is a particular type of hang.
After Rebooting
At this point, you want to see what is in the system logs. Not all types of issues show up in the system logs, but you may see something valuable there. The files to check vary by Linux distribution, but here are a few to look at:
/var/log/dmesg /var/log/messages /var/log/mcelog /var/log/syslog
Some of the common error messages are listed in High-Level Linux Troubleshooting. You may need to have a Linux systems expert review the logs. Be sure to provide them the entire log contents. Messages from system daemons and from the Linux kernel can be misleading. It’s common for users to mistake fairly common kernel messages for critical errors, or to overlook the errors/warnings which indicate trouble spots.