In the previous post we learned a bit about servers and sysadmins, and how the sysadmin’s job is usually fairly unexciting.
But then occasionally something will go wrong. A server application which was working perfectly well five minutes ago is now giving the users nothing but error messages. This is when the job becomes terrifying, because some users feel compelled to explain to the sysadmin how (s)he has personally failed them. The phone starts ringing off the hook, the email inbox fills up in a hurry, and people stop by the sysadmin’s desk to point out the painfully obvious fact that the system is down. Boredom is far preferable to this.
When things go off the rails, there are a number of things the sysadmin can do to diagnose the problem:
- She might look at a list of active processes running on the server to see if something important is missing. Sometimes a service will stop for some reason, and simply restarting the service will get things moving again. After everyone settles down a bit, the sysadmin can try to figure why the damn thing stopped in the first place.
- A program called a packet-sniffer can help analyze the server’s network connections. It could be that something about the network (completely external to the server itself) has changed, and that this is causing connectivity problems. This is the sysadmin’s favorite explanation, because everything immediately becomes someone else’s problem, and it gives the sysadmin an excuse to go yell at the network pukes.
- Log files may be the most common diagnostic tool. If the server application experiences some kind of problem, it will hopefully write a useful message to a log file. Often (not always, but often) looking at the log file will reveal the problem, and hopefully there will be a straightforward solution that the sysadmin can apply promptly. Getting things back to normal in a hurry is certainly a priority in these situations, but it’s not always that easy. Sometimes it takes a while to diagnose a problem, and the solution may require unscheduled downtime.
The adage about an ounce of prevention governs the work of an experienced sysadmin, who will expend no small amount of effort putting a lot of canaries in the coal mines. A big part of this job is avoiding common or recurring problems. Examples of this might include some of the following:
- Setting up a process that emails the sysadmin when a hard drive starts to run out of space.
- Reviewing log files every day. (This is deadly dull, but sometimes it identifies problems before they break things.)
- Keeping a detailed list of upgrades and configuration changes so you can put stuff back the way it was days or weeks later.
Anyone who has been doing this kind of work for a while has horror stories, and I have a few of my own. I’ll write those up as short posts from time to time.