System Triage Part II – Host Performance Analysis Using Grid Control and Host Commands
Introduction
If you haven't read my previous blog titled Using
Deductive Reasoning and Communication Skills to Identify and Solve Performance
Problems, I would highly suggest that you do so before continuing. The blog
will provide you with recommendations that can be used to determine what architectural
component is causing the problem. This blog assumes that you have determined
that the performance issue is somewhere in the database ecosystem (database,
operating system, hardware).
Determining
the Scope of the Problem
We'll begin with the premise that the entire application is running slow. If
the problem is localized to a specific transaction, you won't need to perform
all of the steps I am providing. The next blog will be more pertinent for transaction
specific performance problems. But you will be able to use a subset of the investigative
activities in this blog because both sets of problem determination activities
are very close regardless if it is a specific transaction or an application-wide
performance problem. The remainder of this blog will focus on Host Performance
Analysis. We'll continue the discussion on System Triage in subsequent blogs
when we use the 10G Grid Control R2 toolset to indentify the specific transactions
and SQL statements that are causing the performance problem.
Host Performance
Analysis
We need to determine the health of the server that the database is running on.
We are able to use 10G Grid Control's host performance analysis capabilities
in conjunction with O/S commands to determine the current system load. In this
blog we'll be using UNIX as an example. Although the commands may be a little
different in LINUX, the output we will be evaluating will be very close. I'll
cover Windows system performance in an upcoming blog.
Let's begin by activating 10G Grid Control R2 and navigating to the Host's Home page. We accomplish that by selecting the 'Targets' tab at the top of 10G Grid Control R2's Home Page. 10G Grid Control R2 displays the Hosts Home Page.
10G Grid Control R2 will provide you with a listing of all hosts that is monitoring. If you would like to learn how new database ecosystems are added into Grid Control, please refer to one of my previous blogs. Although the blog doesn't provide detailed installation guidelines, it does provide a high level overview of the Grid Control Management System.
Your next step will be to select the host that runs the database having the performance problem. Once the host's home page is displayed, you will want to click on the performance tab at the top of the screen to display the Host Performance Home Page.
You will want to review the utilization of the three primary resources: CPU, Memory and Disk. Take a look at the CPU chart, you'll notice that our server was experiencing a high level of CPU usage a short time ago. Here's another performance home page showing some definite CPU utilization problems.
Each of the primary resources displayed on the page provides drill down capabilities that allow you to further investigate the problem. The drilldown panels allow you to review past performance of CPU, memory and disk.
For more information on Using 10G Grid Control to evaluate Host Performance, please refer to the following blogs: Using 10G OEM Grid Control's Host Performance Monitoring and Tuning Features and Host Performance Monitoring Using 10G Enterprise Manager Grid Control. The information they provide will be pertinent to both 10G Grid Control R1 and R2 users.
Operating System
Monitoring Tools
There is a host of UNIX performance monitoring tools that display performance
information. Two of my favorites are NMON and TOP. Here's an NMON
display verifying that we are indeed experiencing CPU problems.
NMON
Take a look at the CPU utilization display at the top
of the NMON page. The letter 'U' designates the CPU being consumed by User
processes, while the letter 'S" designates the CPU consumed by the system.
If you see a high number of S characters on the display and very little U characters,
it is time to contact your friendly system administrators and ask them why their
system is consuming such a high level of CPU. If you see a lot of U characters
and Oracle processes are the top resource consumers, your database is probably
the culprit.
For more information on NMON (including how to show the top processes), please refer to the IBM NMON home page. IBM's NMON is a free tool that is available for download and is available for most UNIX and LINUX systems. I personally prefer it to over the Top command.
Top
Let's take a look at a Top
screenshot. The important indicators are CPU States, Memory Utilization
and Top processes. Like NMON, Top also provides settings to provide you with
disk performance information. On both NMON and Top, you can use the Process
IDs to confirm the information displayed on the 10G
Grid Control R2 Host Performance Home Page.
Jonathan Lewis provides some helpful hints on the Top command in Appendix B of his book titled Practical Oracle 8i - Building Efficient Databases. Here's an addendum to the appendix that describes the output of the Top command that you may find useful. Although the book is on Oracle8i, the Top information is still pertinent.
Here's some additional information on the Top command that will help you understand Top. It is a little hard to read but the information it provides will be very useful. Lastly, if your system administrators have configured the MAN command, you can use the operating system manuals to retrieve informtion about most of the system commands we are discussing. If they don't have MAN configured, the first question you need to ask them is "why not?".
VMSTAT
The VMSTAT command is also available on many flavors of UNIX and Linux. VMSTAT
provides you much of the same information that TOP and NMON do, but it also
provides you with disk performance and system queueing information. Instead
of me regurgitating information on VMSTAT, here is an excellent
description of the VMSTAT utility. The author also shows you how to interpret
VMSTAT output. Although the author's discussion pertains to LINUX, like other
commands, the majority of information he provides will also pertain to the various
UNIX flavors.
Two of the display columns I review frequently when I am performing host performance analysis are the numeric values listed under the "R" and "B" columns in the output. The numbers under the "R" column in VMSTAT designate how many processes in the system are queued up and waiting to run. The higher the number, the bigger the problem. The numbers under the "B" column designate how many processes are blocked and can't do anything.
IOSTAT
LINUX and the many of the UNIX variations provide the IOSTAT command to allow
users to analyze disk performance. IOSTAT displays kernel I/O statistics on
terminal, disk and CPU operations. By default, IOSTAT displays one line of statistics
averaged over the machine's run time. The use of -c presents successive lines
averaged over the wait period. Here's an excellent
article that shows you how to use IOSTAT and interpret its output.
SAR
SAR is another command that can be used to evaluate host performance. One of
the benefits that SAR provides is that it allows you to run it via CRON on a
regular basis and spool the performance data to a SAR output file. Users are
then able to use the SAR command to retrieve historical performance statistics.
Once again, here's an article on how to use SAR to create historical performance reports. The article
also provides information on how to intrepret its output.
Using 10G Grid
Control in Conjunction with UNIX Monitoring Commands
10G Grid Control's Host Performance Home Page, provides you with a quick snapshot
of the host's key performance indicators (CPU, Memory and Disk). Oracle's intent
was to provide you with this information so that you would not be forced to
log on to the operating system to evaluate CPU, memory and disk performance
indicators. 90 percent of the time, I will use the key performance indicators
provided by Grid Control. They provide me with just the right amount of
diagnostic information I need to continue my investigation.
There are times when the system is so locked up that 10G Grid Control will "act up" when you attempt to access the host system's performance panels. That's a good indication that you are definitely experiencing some server resource problems.
In that case, your only choice is to revert back to the tried-and-true O/S commands that provide server performance information. I will also use O/S commands when I want to retrieve more detailed information than 10G Grid Control is able to provide.
Not all of the performance monitoring commands I have provided will be available on all flavors of UNIX and LINUX. Even if they are available, some of the commands must be specifically installed. You need to determine what tools are available, have your O/S admins install the ones that are and use the information in the links I have provided to understand them.
In my next blog, we'll continue our performance analysis using 10G Grid Control's database performance analysis features.
Thanks for Reading,
Chris Foot
Oracle Ace