The Basics of Troubleshooting

by Joe Grant (@dba_jedi), Principal Architect

Troubleshooting

No matter how resilient your attempt to design an application system, something will always go wrong. The issue then becomes what to do about it when you get the call that something is broken and it is your responsibility to fix it. This can be from the simple “something is slow “ call all the way to a report the system is down. In this series of articles, I will walk through the process that I use when trying to find the cause of an issue.

In this first article we put first things first. These are the things you need to know before you get “the call.”

  • System design
  • Performance baselines
  • OS tooling

 

Know your system design

Before you can troubleshoot, you have to know how it all is supposed to work in the first place.

  • What operating systems, application servers, and database server technologies are involved?
  • How does the user connect to the application?
  • Is there a load balancer?
  • How does the application connect to the database?
  • How is the database configured?
  • How is storage attached?
  • What are all of the layers?
  • Are any of the layers virtualized?

 

Regardless of how simple or complicated the architecture is, you simply have to know how everything fits into place. For as much as you don’t have the time to write it all down, it is important to make sure that you do. You may not be the only one who has to figure out what is wrong, and this documentation helps. Also when you write it all down, and/or explain it to someone else, your understanding of the architecture will improve.

Performance Baselines

In as much as I hate collecting baselines, they are one of the keys to successfully identifying and resolving any issue. Without trusted baselines, it is nearly impossible to examine the performance of a system and to understand what has changed over time. Therefore, it is critical to understand what “normal” is and the baseline will allow you to more easily determine what has changed. The presence of a trusted baseline allows you to quickly determine if the application is doing more I/O, using more CPU, or has an increased number of client connections.

OS tools

There are lots of commercial applications available that can help make troubleshooting issues easier. This blog is not a review of these applications. Instead for this article, we will get back to basics. What can you learn just from the OS, the application, and a little knowledge? Some things to be familiar with:

System metrics tools

  • NMON – Is short for Nigel’s Monitor. It is a system statistics collector for AIX and Linux. For Linux systems, this is technically a third party tool that will need to be downloaded and installed. You can download it at http://nmon.sourceforge.net/pmwiki.php.
  • SAR – System Activity Report is a standard tool for all Unix and Linux based systems.
  • Sysstat packages

– vmstat
– iostat
– mpstat

  • Perfmon – for Windows

– Task Manager

 

Unix/Linux OS commands
These are mostly Unix/Linux based, some do translate to Windows environments. Sorry Windows folks, I am a Unix/Linux geek.

  • uptime – Simple command to show when the system started and the 1, 5, 10 minute load averages
  • free – Shows memory usage
  • du, df – Disk usage and display free disk space
  • ps – List running processes. (there are lots of switches to play with)
  • ping, traceroute, ifconfig – Network utilities
  • dd – Data Duplicator, can be a very useful tool for testing IO
  • vi, view, cat, more, less – All are utilities that can be used to view log files

– Speaking of log files, make note of where the log files that you are most interested in are kept.

 

There are lots of other commands to be familiar with, this is just a start and not meant to be a comprehensive list.

Summary

The important part of all of this is that you have to be familiar with these things before you actually need them. Working to figure out architecture and how it is all suppose to work, while on a conference call trying to resolve the issue is not the right time. You will also want to be familiar with what normal performance is. Which means having real numbers here, as you simply cannot trust the end user saying “it just seems slow.” Lastly, you will want to be familiar with all of the OS tools and commands, as well as the information they provide, before you actually need them.

Table of Contents

Related Posts