Skip to content

DBAzine.com

Sections
Personal tools
You are here: Home » Blogs » Chris Foot Blog » Chris Foot's Oracle10g Blog » Administering and Troubleshooting 10G OEM Grid Control Management Agents
Seeking new owner for this high-traffic DBAzine.com site.
Tap into the potential of this DBA community to expand your business! Interested? Contact us today.
Who Are You?
I am a:
Mainframe True Believer
Distributed Fast-tracker

[ Results | Polls ]
Votes : 3620
 

Administering and Troubleshooting 10G OEM Grid Control Management Agents Administering and Troubleshooting 10G OEM Grid Control Management Agents

As we continue down our path of 10G Grid Control enlightenment, I thought it might be advantageous to deviate from our discussions on the advisors for yet another blog. Keeping the communications flowing between the agents and the management console can be somewhat tricky at times. I will admit that most of our issues were "self inflicted". If you are like us, you'll have to learn how to troubleshoot a problem or two until you gain experience.

We have been installing and administering the 10G agents for some time now and we think we have crested the top of the learning curve. Once you gain experience, the environment pretty works as advertised. I thought I would give you a few helpful facts on how to administer the agents and also provide you with some information on how we solved some of the problems we have encountered. In an upcoming blog I'll describe how to troubleshoot the management server.

Architecture Overview
Like previous versions of Enterprise Manager, 10G Grid Control is a multi-tier architecture consisting of the HTML console, a management service with an integrated information repository and management agents running on all monitored targets.

The management service receives monitoring data from the various agents and loads it into the management repository. The management console retrieves data from the management repository, organizes it and then displays it as information to the administrator via the HTML console interface.

The agents are programs that run continuously on all servers that are controlled by the Enterprise Manager architecture. Examples of the more popular targets found on the servers are databases, application servers and listeners.

Agent Installation and Configuration Differences
Let's discuss a few differences between the 10G agents and agents from previous releases. In releases prior to 10G, the agent software was installed during the installation of the target's software (database, application server). Administrators then started the agent on the target server and notified Enterprise Manager to begin administering the new target by running a discovery wizard from the central management console.

In 10G, the agent software is installed separately from the database and application server. During the agent software installation, the installer will prompt the user for the node name of the server that is running the 10G management service. The target agent then contacts the management server and uploads its configuration information. A complete reversal of the way the process was executed in previous releases.

Agent Troubleshooting
But what if we lose communications between the target agents and the management service? Although we are experts by no means, we have done some troubleshooting from time to time.

If alerts and monitoring information isn't being received by the management service, there are only a few components that can be causing the communication breakdown. The problem could be on the management server itself. We'll take a look at troubleshooting that environment in an upcoming blog. The problem may be in the network connectivity, or lack thereof. If you don't have network connectivity, you probably have more problems than just being unable to administer and monitor the target with 10G EM. More than likely you will also be getting calls from irate users letting you know that they can't access the databases on that server. In that case, do what DBAs have been doing for years - blame the network administration boys.

The remainder of this blog will give you a few helpful hints on troubleshooting the target agent software. Let's take a look at our Agent Administration panel in 10G Grid Control. The agent administration screen lists all of the agents currently active in the 10G Grid Control environment. Each line displays the agent software version, status (up, down, problem), number of targets that are using it and the number of targets that aren't using the agent. Each agent's name is a link that allows the user to view more detailed configuration information about that agent. If I click on that link, EM will display a panel that allows me to drill down to more specific information about that agent. The Agent Drill Down panel provides detailed information on the agent's configuration, status, resource utilization, targets monitored and upload information.

The most important piece of information on the agent administration panel is the column titled 'Last Successful Load'. Although I had to remove some of the information from this screen for security reasons, most of the dates show a Last Successful Load date of Sept. 12 while one shows a Last Successful Load date of August 30. That's a good indication that we are having problems with that agent.

The next section of the agent administration panel provides information on metric collection errors. Each line contains information pertaining to the agent and when the collection error occurred. If Oracle is able to provide additional information, it will display the error message as a HTML navigation link that allows the user to drill down into more specific information. I can either roll my mouse over the navigation link or click on it to display the more detailed error information.

Let's pick one of the errors and go through the error determination process. We'll start with the problem that occurred at the top of our Metric Errors Collection report. 10G Grid Control was unable to run the Agent Process Statistics process on Sept 3 at 6:46:27 AM.

The first step that needs to be executed when performing agent error determination is to log on to the server where the agent is running, navigate to the agent's home directory and run a status command on the agent. The following command displays pertinent information about the status and configuration of the agent running on the target:

> emctl status agent

If you look at the output you will see that we have loaded 606.39 MEGs of XML data to the mangement server and we have 0 files and 0 MEGs of data pending upload. These numbers are telling us that we have a functioning agent on this platform. If your files pending upload increases continuously, you have an agent to mangagement server problem. If you want to see all of the variations of the EMCTL command, type "emctl" at the prompt and do not enter any other commands after it. Oracle will display a listing of all the variations of the EMCTL command on the screen.

This next screenshot displays the agent's parent directory structures. Notice that I am navigating to the SYSMAN subdirectory. SYSMAN is the parent directory for the subdirectories that contain the diagnostic information that we will need to use as input to solve our problem. We'll be spending the bulk of our time in this directory structure. Although there are numerous files and subdirectories, I will be covering just the files that we have been using to debug our environment. If I don't, I'll have another one of the world's largest blogs.

The CONFIG subdirectory, contains files that are used to tailor the agent to its host platform's configuration. The emd.properties file is the main configuration file that we have had to edit from time to time to fix a few of our issues. The management service that the agent communicates with is identified in the REPOSITORY_URL parameter. If you want to change the host or port name of the management server that this agent communicates with, you will have to edit the REPOSITORY_URL parameter to reflect the new information. Later in this blog, I'll show you how to cleanse the old information from the directories to allow the agent to successfully connect to the new management server.

There is one other parameter in emd.properties that may cause a few problems. If your agent NEVER successfully uploads data after installation, check the TIMEZONE parameter in the emd.properties file. It is usually the last parameter in emd.properties. When our agent's time zones didn't match the time zone in the management server's configuration file, we weren't able to establish a successful connection between the agents and the management service. So if you have agentTZRegion=America/New_York in the agent's emd.properties file, you'll prevent a lot of headaches if you use the same time zone in the management server's emd.properties configuration file. We don't have any servers in different time zones, but there is a wealth of information contained in the 10G Grid Control documentation to help those that do. What I do know is that when the time zones didn't match, we couldn't make a connection.

The EMD subdirectory contains a few files and subdirectories that are important to us. The UPLOAD subdirectory is a holding area for files that will be uploaded to the management server. Lastupld.xml is pretty self explanatory. It contains information on agent uploads to the management server. The file that you may be required to edit from time to time is the targets.xml file. This file contains information about all of the targets (databases, listeners,etc.) controlled by the agent on this platform.

The DBSNMP user is the account that 10G Grid Control uses to log on to the database to perform activities on behalf of the 10G Enterprise Manager toolset. If you change the password for DBSNMP, you will have to edit targets.xml to reflect the new password. To do this, you change the value of ENCRYPTED to FALSE and enter the new password after the VALUE column in the PASSWORD line for that database. I have also had to change the release identifier after upgrading a database being monitored by 10G Grid Control. If some targets are showing up on a platform, while others are not, you may want to review the contents of targets.xml to see if it has entries for the missing targets. Once again, you can use the cleanse script I'll provide you with later in this blog to update targets.xml and upload that information to the management server.

The LOG subdirectory will probably be the area where you spend most of your time when you are evaluating a problem with the agent. This subdirectory contains several files that will assist you in the problem determination process. The most important debugging files are:

  • emagent.nohup - Agent watchdog log file
  • emagent.log - Main agent log file
  • emagent.trc - Main agent trace file
  • emagentfetchlet.log - Log file for Java Fetchlets
  • emagentfetchlet.trc - Trace file for Java Fetchlets

There is an excellent bulletin on Metalink that will provide you with detailed information on the contents of these logs. The bulletin number is 229624.1. Let's take a look at the contents of emagent.trc to see if we can find the problem. I have scrolled to the times that match the error that was displayed on the top of our error metrics report. The file contains dozens of lines before and after the time of the error stating that we are having an out-of-memory condition. It looks like we have to perform some further error-determination activities to solve the memory problem before we can fix the agent. This out-of-memory condition is most certainly affecting our database connections so I'll need to fix it soon. Lucky for me it is our DBA "playpen" that we use for our own testing (or I would have a bunch of irate users and/or developers clamoring for my head).

The Cleansing Script
If you go to Metalink and do a search on the value 'sysman/emd/state/*' or document ID 303105.1, you'll see a series of commands that can be used to "cleanse" the environment as we like to describe it here at Giant Eagle. When you get a continuous 'status pending' message for a monitored target, change the name of the management server, reinstall the agent software, remove the agent from the management server and re-add it (and a host of other activities), Oracle recommends that you perform the following steps at the end:

1. Stop the agent on the target node

emctl stop agent

2. Delete any pending upload files from the agent home

rm -r $ORACLE_HOME/sysman/emd/state/*
rm -r $ORACLE_HOME/sysman/emd/collection/*
rm -r $ORACLE_HOME/sysman/emd/upload/*
rm $ORACLE_HOME/sysman/emd/lastupld.xml
rm $ORACLE_HOME/sysman/emd/agntstmp.txt
rm $ORACLE_HOME/sysman/emd/blackouts.xml

3. Issue an agent clearstate from the agent home

emctl clearstate

4 Start the agent

emctl start agent

5. Force an upload to the OMS

emctl upload

If you change the name or port of the management server, you will need to run these commands on all platforms that are running the 10G agents. Because we have been moving our 10G EM Management Service from one server to another during our testing and implementation, we have created a script that automates the above commands. We have also used the above series of commands as a last resort when all other debugging avenues have failed. The script just seems to fix a lot of agent to server communication problems. Set your ORACLE_HOME to the Oracle agent's home directory and run the script. If you execute it by typing clean_up_oms.sh with no arguments, it will display the current ORACLE_HOME, which is required to be set to the agent's home directory to allow the script to process successfully. To execute the commands in the script, supply the letter 'Y' as the single argument.

clean_up_oms.sh Y

Feel free to use this script and tailor it to your specific shop's requirements. You'll have to change the log output file and possibly some other file and directory names. The usual cautions apply. I can safely tell you that the script works in our AIX 5.2 and 5.3 environments and has saved us lots of time. I'd like to thank Jeff Kondas for writing the script and installing it on our 10G servers.

Thanks for reading!


Monday, September 19, 2005  |  Permalink |  Comments (0)
trackback URL:   http://www.dbazine.com/blogs/blog-cf/chrisfoot/blogentry.2005-09-17.7657940139/sbtrackback
 

Powered by Plone