David Woodard | Originally appeared in March 2006 Bricks newsletter. Content has been updated where appropriate.
In some companies, the Oracle DBA and the System Administrator are one and the same, and in others there are teams of both. Regardless of your work environment, knowing how to troubleshoot from the Unix operating system can help you discover problems that you could not find just by looking in Oracle. In this article I’m going to discuss some of my favorite utilities and show examples of how to use them. There are a great deal of useful tools and scripting languages, enough to fill several volumes. In this brief article, I will talk about those I use nearly every day to work with Oracle systems.
ps & pgrep
Of course the most used utility by anyone is the “ps” command. Some typical uses are to look at all processes for a user (e.g. ps -fu oracle), to look for a particular process by process ID (ps -fp PID), and to look for a process across the whole system (ps -ef|grep pmon). The “-e” option lists all processes running on the system, and “-f” give a “full” listing, which lists more columns of information, such at process start time, and the process ID of the parent process.
One of the problems in using “ps -ef|grep pmon” to look for the pmon process, for example, is that you also see the grep command itself:
$ ps -ef|grep pmon oraclexe 6179 1 0 20:00 ? 00:00:00 ora _ xe _ pmon _ XE oraclexe 6231 6149 0 20:01 pts/3 00:00:00 grep pmon
This is because the “pmon” argument to grep shows up on the command line. So, one thing you commonly see in a shell script to check for a process is “ps -ef|grep pmon|grep -v grep”. This excludes the grep command itself from the search. However, this starts another process, and takes time and resources. A better way is to use a regular expression with grep to keep it from matching itself. The simplest way is to put the first character of the search pattern in brackets: ps -ef|grep [p]mon
$ ps -ef|grep [p]mon oraclexe 6179 1 0 20:00 ? 00:00:00 ora _ xe _ pmon _XE
Grep sees the characters in brackets as a set and matches any of the characters you provide. We only put one character in the set, so “[p]” is the equivalent of searching for just “p”. The best part is that while the grep command line still has the brackets, grep itself strips them away and treats it as a single character, so we don’t match our own command line.
Now, typically what we are trying to do is check for the existence of a process, or get its process ID. Both ps tasks are better suited to the “pgrep” utility. Pgrep takes a string as an argument and returns the process ID’s where the program name matches that string. Now, one thing to keep in mind is that all Oracle background processes (pmon, smon, dbwr, lgwr) are all the same executable: $ORACLE_HOME/bin/oracle. By default “pgrep” searches for the executable name, not the process name. It’s possible for a program on Unix to change the process name that is displayed to something other than the executable name, and that is exactly what the Oracle background processes do. So to search for the process name we use the “-f” flag on pgrep (under Linux) to find our “pmon” process:
$ pgrep -f pmon oraclexe:[/usr/lib/oracle/xe]$ pgrep -f pmon 6179
Note that all we get is the PID, which is usually all we are after. See the man page for pgrep on your flavor of Unix to see how it works. By using pgrep, we only start one command, and since it’s not getting all the data that “ps -ef” does, it uses a lot less resources. If your OS does not come with pgrep, do some Googling, it may be available as open source.
Many Windows users are used to right clicking on a folder and seeing how much space that folder takes. I hear many Unix users complain that such functionality is not available in Unix. Well, it is, but since this is Unix, it’s available from the command line. While the “df” command shows me how full a file system is, the “du” utility lets me track down disk space usage by directory. If I type “du” from a directory, I will get a listing of all directories under it, and how much space they take in disk blocks. This is not always useful, since disk blocks are typically 512 bytes, and I am also interested in the files in this directory, as well as the directories below it.
Using “du” with the “-ks” flags solves the problem. “-k” gives me sizes in kilobytes, and -s summarizes the space for each item in the current directory, whether it is a file or directory. To track down the most space used in the /oradata directory, do this:
$ cd /oradata $ du -ks * |sort -n 8 oratmp 1708640 oratest1 3189856 oraprod1
Using “sort -n” sorts the output numerically, putting the largest disk space consumer at the bottom. If “oraprod1” was a file, we’d be done, having found the biggest space hog. Since it’s a directory, you would “cd” into it and do “du -ks * |sort -n” again, and continue looking for where your space it used. It may not be as easy as Windows, but it gets the job done.
One of my favorite utilities for Unix is “lsof”, which stands for “list open files”. I call it the Swiss Army knife of Unix tools. It comes with most every Linux distribution, but for other flavors of Unix, you must install it. It is open source and can be found at http://people.freebsd. org/~abe/. You should be able to find a compiled version for almost any version of Unix. To run it, you must either be root, or have the executable SETUID to root (preferred), since it has to look at kernel memory. Just running “lsof” with no arguments gives lists of every open file on the system. Keep in mind that nearly everything in Unix is a file, so you will see regular files, directories, pipes, and network connections.
One problem that you sometimes encounter is that the Oracle listener won’t start, or some Oracle Application Server component won’t start, and the error indicates that the port is already in use. The “-i” argument to “lsof” lets us find the processes that are using a particular port:
$ lsof -i:1521 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME ---------------------------------------------------------------------------- tnslsnr 6058 oraclexe 8u IPv4 24885 TCP *:1521 (LISTEN) oracle 6179 oraclexe 16u IPv4 27549 TCP winhost1:32903>hoblt8.hob.private:1521 (ESTABLISHED)
Note that we see process 6058, the Oracle listener, listening on port 1521, and we see one remote system, winhost1, connected to the listener. So, “lsof” can also be used to trace back what remote systems are connected to a port on your system. If you want to see all network connections and listening processes on your system, just issue “lsof -i” with no port number. The “netstat” utility on Linux has many of these features as well, but on other Unix platforms it can be impossible to trace a port to a process without “lsof”.
“lsof -p PID” will list all open files for a process, and “lsof |grep /path/to/file” will show you all processes that are using a given file. We’ll see a good example later of tracking down a specific file descriptor, or open file handle, for a process.
On most versions of Unix, there is a “fuser” utility to show what processes are using a file. It’s dangerous to SETUID this program so all users can use it, since that also gives them the ability to kill those processes. Using “lsof” is a safer, and much more flexible alternative.
Check back next week to read part two for insights on more great Unix utilities.