Let’s Get Technical! – Using Deductive Reasoning and Communication Skills to Identify and Solve Performance Problems
Remember the old days? When the applications our customers used were simple and problem determination was easy? The days of true client server application architectures that consisted of a fat client application running on the PC, a network and a database. High-level performance problem determination was a breeze. It was either the PC, the network or the database.
Well, we all know those days are looong gone. We can also agree that application complexity will continue to grow and continue to complicate the problem determination process. From Service Oriented Architectures and BPEL to advanced disk drive hardware configurations (and everything in-between), our environments are growing more challenging by the moment.
Performance
Problem Analysis - System Triage
One of the common mistakes that is often made during a tuning exercise is not
gathering enough high level information during the initial stages of the problem
determination process. I often have the same issue. I get an inkling of what
I think the problem is and jump right in. As I stated in my introduction, the
key to success is to begin the analysis at a high level and then continue by
narrowing the scope of information collection until the problem is identified.
Once the problem is identified, we tune and tweak, record the same statistics we used to identify the problem and compare the before and after information. This iteration of recording/changing/recording comparing is done again (and again) until the problem is corrected.
Asking the Right
Questions to Help Identify the Problem
OK, so you have been called about a performance problem. What do you do? Suggestion
- blame the network! OK, I'm kidding. What you need to do is contact the parties
affected by the problem and perform a little detective work.
Before we continue, let's take a quick look at the application architecture diagram for the sample application that we'll be using in this discussion. It is a simple multi-tier application that uses a lightweight application on the PC, a workload balancer (CoyotePoint, HydraWeb, etc.), one or more web/application servers and finally, an Oracle database back end. Denver and Atlanta are remote sites while the application hardware, additional business users and you are located in Dallas. Like most applications, each site runs a common set of programs with a few programs that are site specific. We also can assume that most load balancers are configured to transfer connections to the least utilized web/application server.
It's 2:30 PM on Friday. You are counting down to the weekend. You're thinking to yourself "this week wasn't so bad" when the phone rings. It's Mike the applications manager telling you that users are complaining about poor database performance and you need to fix it. The first question that comes to your mind is "Why does everyone always assume that it is a database problem?"
The DBA Detective
You decide to begin your detective work by going straight to the end-user (or
users) experiencing the problem.
You ask Mike the manager for the end-user's contact information. He gives you the number for the business team lead in Denver. You make the call and begin your interrogation. Your first question is as follows:
How many users are affected by the problem?
If it is all
users at a specific site:
You continue your investigation by calling the other remote site to determine
if they are also affected by the problem.
If the other remote site is not experiencing the problem, continue you're your own investigation but you should also place an emergency call to your friendly network specialists and ask them for help. Explain to them that you have one remote site that is running fine, while the other site is affected by poor application performance.
If the other remote site is also experiencing the problem, call one of your favorite users in Dallas. Since the hardware for the application is also in Dallas, this is an easy way to determine if the problem is localized to just the remote sites. Ask the user in Dallas if they can run the application. If the Dallas users are running fine, the performance of the remote network is now a suspect. Once again, continue your own investigation but also ask assistance from network support.
If all sites
are affected:
The list of possible suspects now grows. It could be any piece of the puzzle
- network, load balancing, application server, database, hardware server. One
of the tricks I have used to assist in my analysis of performance sensitive
applications is to talk with the application developers and ask them to send
me a half dozen to a dozen performance sensitive SQL statements from their application
that don't perform data updates.
I will try and mimic the application as closely as possible. If the application is using stored procedures, it's easy, I'll create my own stored procedure. If the application uses bind variables, I'll code my program to use bind variables. Here's an example (courtesy of ace DBA Ron Berner) of how to use inline input to SQL bind variables that will be executed in SQL*PLUS. Please note that executing a SQL statement in SQL*PLUS requires that you define both input and output variables.
I'll run the statement when performance is good with timing on in SQL*PLUS to get a baseline. When I'm troubleshooting a performance problem, I'll execute the SQL statements and compare their performance against my baseline. I like to run the statements twice, once from my PC and another execution on the database server itself. If it runs fine on the database server yet runs poorly from my PC, the network is once again one of my prime suspects.
I'll l continue my own quick database ecosystem (database, operating system, hardware server) triage before contacting other units. I'll be looking at CPU, memory and disk utilization and identifying processes that are consuming large amounts of resources. I'll log on to the database itself and look at top SQL, session waits, locking and system event statistics for more information. If the database is 10G, I'll use an entirely different set of tools than I would with 9I. I'll cover the different toolsets in upcoming blogs. I want to continue to focus on high-level investigation activities in this blog.
If the statements run close to the baseline from both my PC and on the database server and my initial review of the database ecosystem looks like it is working as it should, I don't immediately rule it out as the cause. But it will affect the way I continue my analysis. Before I begin my own more detailed database ecosystem performance analysis, I will contact the technicians responsible for the application and network tiers. All groups involved can work in unison to identify the problem.
I'll ask the folks responsible for the application tier to log on to their web/application servers and review the same CPU, memory and disk utilization statistics that I would on a database server. The application tier folks can also review connection input/output queues. I will also ask them to compare performance statistics amongst all hardware components of the web/application tier. If one web/application server is at 100% utilization while the others are at 20%, the workload balancer now becomes one of the suspects.
I will also ask the network and server specialists to review network performance of all hops from the database server to the remote and local sites. As I stated previously, I don't arbitrarily rule out the database ecosystem if the SQL statement runs fine on the server, but I know that other components of the application also need to be investigated.
If it is a select
group of users having a problem:
If Mike the Manager states that only a subset of users is experiencing a problem,
you will need to identify the specific transactions they are executing. It's
time to talk one-on-one with a business user that is having the problem. If
only one user is experiencing the problem, ask them if they can log on to the
application using a fellow user's computer and execute the same transaction.
If performance is acceptable, ask the user to run other transactions. If they
continue to run with acceptable performance, ask them to log on to their original
PC and run an additional workload. If performance is once again poor, their
PC and/or network connection could be the culprits.
If it is a group of users, or the user continues to experience poor performance on all PCs they try, it's time to get the developers involved and ask a few more questions:
- Is it the entire application running slow or just specific screens/transactions?
- How much slower is it?
- Is the problem intermittent or continuous?
- Has this problem occurred before?
- If so, is it getting worse?
- Does the problem occur at the same time every day?
After the line of questioning is complete, you will need to determine exactly what transaction is causing the problem. It's a relatively simple process if all users come in with unique accounts and/or program identifiers. Log on to Grid Control or 9I OEM, ask the user to log on and look at the sessions or top sessions screens. Both 9i and OEM provide components that allow you to drill down and identify what SQL statements a specific transaction is executing. In addition, both toolsets provide screens that show database locking information.
An application that uses a common ID and program complicates the problem identification process. Having an application developer on hand that knows the SQL being executed by the various transactions can be invaluable. Oracle 9I and Oracle10G both provide components that allow you to monitor the current SQL being executed. More on that later...
As I stated previously, I don't' want to get too in-depth in this blog. In upcoming blogs, we'll take a look at what tools you can use to identify poor performance when you find out that it is your database causing the problem.
The intent of this
blog was not to persuade you to follow the steps it provides verbatim, it was
to provide you with a few high-level recommendations to help you quickly determine
what component is causing the problem. Look for more detailed information in
upcoming blogs.
Thanks for Reading,
Chris Foot
Oracle Ace