Completing the Puzzle - Analyzing Agent to Management Server Communications
I’ll also provide a few hints that will help you determine where to start the debugging process (management server or agent). I’ll complete this series by showing you how to activate detailed tracing on the agent and management server components. If you can’t identify the problem by analyzing error messages normally generated by the agents and management server, you’ll have to activate more detailed traces to gather additional diagnostic data.
Target Agent or Management
Server?
In my previous two blogs, I discussed troubleshooting the 10G Enterprise Managment agents and management server. My intent was to provide you with a head start up the problem determination and analysis learning curve. Now that we have an understanding of the agent and management server environments, we need to determine which component should be analyzed first. Should we start our investigation on the management server or the target agents?
The information below should help
you to determine the scope of the problem and where to start your analysis:
- If the problem is happening on all of the monitored hosts, start your problem determination on the management server and repository.
- If the problem is happening on a single host, check the status of the agent and then continue your problem determination on the individual targets before reviewing diagnostic information on the management server.
- If the problem is occurring on an individual target (e.g. unable to communicate with a database or listener) and not the entire agent, the problem could be a permission issue with the agent. Start your problem determination on the targets.xml file to determine the accounts and passwords being used.
10G Enterprise Manager
Information
Another way to determine the problem's scope is to review the information
displayed in 10G Grid Control's agent administration and management repository services
panels:
- Management
Services and Repository Overview panel. During normal processing, the
Loader Backlog chart (upper right hand chart) will show a series of spikes.
Notice that the blue line on our Loader Backlog chart shows a single spike.
The spike means that a number of files were uploaded by the agent(s) and processed
by the management server. If your blue line ever looks like the red line I
have drawn as an example, your management server is not processing the uploaded
files and you need to perform 10G Grid Control management server problem determination.
- Management
Services and Repository panel. The first block of information is the name
of our management service (name removed for security reasons), the service's
current status (Up, Down, Pending) and the last error that was generated.
The block of information to the right shows the number of files waiting to
be loaded and the directory that contains them. If the management service
isn't processing files being uploaded by the agents, you'll see a high number
in the "Files Pending Load" column.
- Agent
Administration panel. The agent administration screen lists all of the
agents currently active in the 10G Grid Control environment. Each line displays the
agent software version, status (up, down, problem), number of targets that
are using the agent and the number of targets that aren't using the agent. Although I had to remove some of the information from this screen for security reasons, most of the dates show a Last Successful Load date of Sept. 12 while one shows a Last Successful Load date of August 30. That's a good indication that we are having problems with that agent. Each agent's name is a link that allows the user to view more detailed configuration information about that agent.
- Agent Drill Down Panel. This panel provides detailed information on the agent's configuration, status, resource utilization, targets monitored and upload information. The most important piece of information on the agent administration panel is the column titled 'Last Successful Load'.
Management Server and
Agent Logs
We continue our analysis by logging on to the hardware servers that are hosting
the management server and agent processes. I don't want to rehash the information
I provided in the agent
and management
server troubleshooting blogs. The instructions contained in these blogs
should help you to identify the problems that are preventing agent to management
server communications from occurring.
The files and directories listed below will be used during the analysis process:
Agent
- Agent $OH/sysman/config/emd.properties - Agent configuration file.
- Agent $OH/sysman/log/emagent.log - Agent log information.
- Agent $OH/sysman/log/emagent.trc - Agent trace information.
Management Server
- OMS $OH/sysman/config/emoms.properties - Management server configuration file.
- OMS $OH/sysman/config/emomslogging.properties - Management server trace activation and configuration file.
- OMS $OH/sysman/log/emoms.log - Management server log information.
- OMS $OH/sysman/log/emoms.trc - Management server trace information.
- OMS $OH/sysman/recv/errors/*.* - Directory containing error messages pertaining to agent files that could not be processed.
Activating Detailed Tracing
If the information in the agent and management server log and trace files don't
provide you with enough information to identify the problem, you may need to
activate more detailed traces to gather additional diagnostic data. The 10G
EM agent and management server components provide configuration files that allow
administrators to activate traces that produce more detailed tracing information.
The possible logging levels for both the agents and management server components are:
- ERROR - Reports only critical errors.
- WARN - Reports critical errors and warning.
- INFO - Includes informational messages.
- DEBUG - Full debug trace.
Agent Tracing
The agent's $OH/sysman/config/emd.properties file provides parameters that control
tracing and logging file sizes and rotation limits. 10G Grid Control, by default, allows
trace and log file sizes to attain a maximum size of 4096 KB before renaming
them and creating a new current trace file. The LogFileMaxRolls, LogFileMaxSize,
TrcFileMaxRolls and TrcFileMaxSize parameters are used to tailor the file size
and number of backups for tracing and logging files.
Logging is performed in a hierarchical manner with the tracelevel.main being the highest level. All other components inherit the logging level from the components above them in the hierarchy. The default logging level for tracelevel.main is WARN meaning that all agent modules use this setting as their default. I have provided a subset of the emd.properties file containing the modules and their default trace settings.
In the sample I provided, tracelevel.fetchlets would be a parent in the hierarchy and tracelevel.fetchlets.os, tracelevel.fetchlets.osline, tracelevel.fetchlets.oslinetok, etc. would be children of that parent component. If we change tracelevel.fetchlets' trace setting to DEBUG, all children components would inherit that level of tracing.
To activate more detailed tracing, change the component's associated trace parameter in the emd.properties file, and recycle the agent using the "emctl stop agent" and "emctl start agent" or "emctl reload agent" commands. Please note that the value supplied in the tracing parameter must be entered in uppercase letters.
Management Server Tracing
The steps to activate management server tracing are similar to the steps required
to activate tracing for the agents. 10G Grid Control provides a tracing configuration
file $OH/sysman/config/emomslogging.properties
that allows administrators to activate and configure tracing for the management
server and repository services.
To activate more detailed tracing on the management server, change the "log4j.rootCategory=WARN, emlogAppender, emtrcAppendertrace" parameter in the emomslogging.properties file and recycle the management server using the "emctl stop oms" and "emctl start oms" commands. Please note that once again, the value supplied in the tracing parameter must be entered in uppercase letters.
Wrapup
I hope you enjoyed this mini-series on debugging agent to management server
communication failures. The intention of this series was to create a foundation of knowledge that would assist you in the analysis process. Oracle's
Metalink website provides a wealth
of information on the 10G Enterprise Server environment. We currently have a 100% success rate using Metalink documents to solve our agent to management server communication problems. It is highly recommended that you leverage the information in Metalink early and often during problem analysis.