Troubleshooting 10G OEM Grid Control's Management Server Components
Prerequisites
For those of you that haven't read my previous blog, it may be a good idea to
do so before continuing. I begin the blog by presenting a quick overview of
the 10G
Enterprise Manager environment. As I stated in my last blog, the management
service receives monitoring data from the various agents and loads it into the
management repository. The management console retrieves data from the management
repository, organizes it and then displays it as information to the administrator
via the HTML console interface.
Management Server Problem
Determination and Analysis using 10G Enterprise Manager
Before we begin analyzing management server trace and log information, 10G Grid Control's
Management Repository and Services panels often provide useful diagnostic information.
We'll begin our analysis by reviewing the contents of the Management
Services and Repository Overview panel. The overview panel provides general
information about management services configuration.
But the panel also provides important 10G Grid Control management server processing information. Look at the first graph at the top of the screen. During normal processing, the Loader Backlog chart (upper right hand chart) will show a series of spikes. Notice that the blue line on our Loader Backlog chart shows a single spike. The spike means that a number of files were uploaded by the agent(s) and processed by the management server. If your blue line ever looks like the red line I have drawn as an example, your management server is not processing the uploaded files and you need to perform 10G Grid Control management server problem determination. I'll show you where to look to find the uploaded files waiting to be processed later in this blog.
When I click on the second
tab, 10G Grid Control displays the Repository
Operations panel. If you see a series of errors being displayed like the
single error I have highlighted, you'll need to debug the problems on the management
server. This screen print shows that I had one problem in the past. If we click
on the Repository Metrics error shown in the red block, 10G Grid Control displays detailed
information about the error. Although, I have removed the name of the agent
we were having problems with for security reasons, it was my personal Oracle
lab environment which kept showing a "Status Pending" on 10G
EM's Agent Administration panel.
You won't see a status pending message on the above screen print because I have
since fixed the problem. I copied the text "sysman.metrics_severity_duplicates"
and "out of time sequence" from the panel and pasted it into Metalink.
I searched through the documents until I found one that described the problem
we were encountering. After further analysis, I solved the problem by running
the cleansing script provided at the bottom of my
previous blog.
If I click on the navigation tab titled "Management Services", 10G Grid Control will display the Management Services and Repository panel. This panel contains a wealth of diagnostic information. Let's review the information from left to right. The first block of information is the name of our management service (name removed for security reasons), the service's current status (Up, Down, Pending) and the last error that was generated. The management service name is a link to a drilldown panel that provides more detailed diagnostic and configuration information.
The block of information to the right shows the number of files waiting to be loaded and the directory that contains them. If the management service isn't processing files being uploaded by the agents, you'll see a high number in the "Files Pending Load" column. You can log on to the host running the management server and navigate to the directory displayed in the "Load Directory" column to verify that there are files waiting to be processed. Don't get excited if you see hundreds of files waiting to be processed or you have used up all of the available freespace in the load directory. The agents send a LOT of information to the management server during the course of normal operations. This backlog of files shows that at one time we had over 2000 files waiting to be processed on our management server. Fix the problem on the management server and do a refresh on the Management Services and Repository Overview panel and you should see the line on the Loader Backlog Chart drop quickly. The column on the far right of the panel displays the timestamp of the oldest file that needs loaded. The older the date displayed, the longer the problem has been occurring. All of the information on this panel shows that my management server is processing normally.
Continuing Oracle Management
Server Analysis
Let's continue our investigation by logging on to the host running 10G
EM's management server. I have set my home to the management server installation's
home directory. Before reviewing log and trace information, you can run the
following two commands:
- opmnctl status - You can use the Oracle Process Management and Notification (OPMN) utility, to display a status of all components that comprise the 10G Grid Control management server installation.
- emctl status oms - The Enterprise Manager command line utility can be used to start, stop or display a message indicating whether or not the Management Service is running.
Both of the above commands have numerous arguments that are used to manage and analyze the various 10G EM components. The Oracle Enterprise Manager Advanced Configuration Guide manual available on technet.oracle.com provides a wealth of information on both opmnctl and emctl utilities. Technet is Oracle's premier website for information dissemination and provides blogs, articles, white papers software downloads and documentation. Registration is both free and easy.
The directory structure that we will be spending all of our time in is the $ORACLE_HOME/sysman directory where $ORACLE_HOME is the home directory for your 10G Grid Control management server installation. The SYSMAN/RECV directory is where you will find all of agent upload files that are waiting to be processed. There is a SYSMAN/RECV/ERRORS subdirectory that contains files that were uploaded by the agents but could not be processed by the management server. You should start the problem determination process by reviewing the contents of both SYSMAN/RECV and SYSMAN/RECV/ERRORS directories.
The two directories that contain important diagnostic information are SYSMAN/CONFIG and SYSMAN/LOG. This screen print displays the contents of the CONFIG directory. Although this directory contains numerous configuration files, the EMOMS.PROPERTIES file is THE main configuration file used to customize 10G Grid Control management server installations. Virtually any modification you need to make during the management server's lifecycle will be accomplished by editing this file.
The
last screen print shows the contents of the SYSMAN/LOG directory. This is
the directory where you will be spending the bulk of your problem determination
activities. The two most important files used for debugging are EMOMS.LOG and
EMOMS.TRC. We have solved 100% of our problems caused by the management server
by reviewing the information contained in these two files. Review the most current
error messages in these logs and paste them into Metalink. I fully understand
that not all of your debugging actvities will be as simple as ours has been. But as I stated, reviewing
the information in EMOMS.LOG and EMOMS.TRC have helped us solve our management
server problems.
Next Up
If the error messages in EMOMS.LOG and EMOMS.TRC don't help you solve the problem,
you'll need to activate more in-depth traces on the management server and agents.
In my next blog, I'll show you how.