Oracle Wait Analysis Techniques
Oracle provides special views called v$session_wait and v$system_event that give detailed information about the wait state for ongoing Oracle transactions. This article explores how Oracle waits can be analyzed to look for system-wide bottlenecks and individual objects that experience wait issues.
Wait event analysis for Oracle cane be broken down into three areas:
- Time-based event analysis - The Oracle STATSPACK utility can be used to show trends in wait event over long time periods, and fluctuations in waits can often provide useful information.
- System-wide event analysis - We will start at a high level and show scripts that can be used to track system wide events, show events for background processes.
- Session wait events -- These are real-time waits at the exact moment that you run the script. We will show techniques for locating specific database objects that may be causing the wait issue.
Let's begin with a quick review of Oracle wait events so we clearly understand how they can help us tune our database.
When running the system-wide scripts, you will see areas where the database is spending most of its time waiting. However, high waits do not always indicate a bottleneck or a problem. High waits may indicate a bottleneck, but some waits are a normal part of database operations. In general, the system-wide wait events tell us where the database spends most of its time.
For example, high waits on 'db file sequential read' events may indicate a disk bottleneck, but you must check your average disk queue length for each disk spindle to be sure that these waits are abnormal. In one case, an Oracle shop ran the script for system-wide wait events and discovered that their RAID-5 configuration was causing a huge amount of disk enqueues. The shop reorganized their disk to RAID 0+1 and experienced a 3x performance improvement for the whole database.
Prerequisites for Monitoring Oracle System Waits
Collecting Oracle wait events requires that the Oracle parameter is timed_statistics=true. In Oracle9i this is the default, but Oracle9i and prior require you to use set timed_statistics.
We also have to filter out wait events that are not helpful to our tuning effort. The list below shows all system "idle" wait events that have no meaningful information:
- dispatcher timer
- lock element cleanup
- Null event
- parallel query dequeue wait
- parallel query idle wait - Slaves
- pipe get
- PL/SQL lock timer
- pmon timer
- rdbms ipc message
- slave wait
- smon timer
- SQL*Net break/reset to client
- SQL*Net message from client
- SQL*Net message to client
- SQL*Net more data to client
- virtual circuit status
Once we have proper scripts ready, we can start exploring Oracle wait events. Let's start with time-series wait event analysis using the Oracle STATSPACK utility.
Event Wait Analysis with STATSPACK
In STATSPACK, you can get a snapshot of Oracle wait events every hour and plot changes in wait behavior over time. You can also set thresholds, and report only on wait events that exceed your pre-defined threshold. Here is the script that is commonly used for exception reporting of wait events.
-- prompt -- prompt -- prompt *********************************************************** -- prompt Excessive waits on background events -- prompt *********************************************************** -- prompt ttitle 'High waits on background events|Rollup by hour' column mydate heading 'Yr. Mo Dy Hr' format a13; column event format a30; column total_waits heading 'tot waits' format 999,999; column time_waited heading 'time wait' format 999,999; column total_timeouts heading 'timeouts' format 9,999; break on to_char(snap_time,'yyyy-mm-dd') skip 1; select to_char(snap_time,'yyyy-mm-dd HH24') mydate, e.event, e.total_waits - nvl(b.total_waits,0) total_waits, e.time_waited - nvl(b.time_waited,0) time_waited, e.total_timeouts - nvl(b.total_timeouts,0) total_timeouts from stats$bg_event_summary b, stats$bg_event_summary e, stats$snapshot sn where snap_time > sysdate-&1 and e.event not like '%timer' and e.event not like '%message%' and e.event not like '%slave wait%' and e.snap_id = sn.snap_id and b.snap_id = e.snap_id-1 and b.event = e.event and e.total_timeouts > 100 and ( e.total_waits - b.total_waits > 100 or e.time_waited - b.time_waited > 100 ) ;
Here is the output from this script. As we can see, we get a time-series result, showing us those days and hours when our thresholds are exceeded. If we look at this listing, we see that every evening between 10:00 p.m. and 11:00 p.m., we experience high waits on the redo logs.
Wed Aug 21 page 1 High waits on background events Rollup by hour Yr. Mo Dy Hr EVENT tot waits time wait timeouts ------------- ------------------------------ --------- --------- -------- 2002-08-18 22 LGWR wait for redo copy 9,326 1,109 286 2002-08-18 23 LGWR wait for redo copy 8,506 316 8 2002-08-18 23 buffer busy waits 214 21,388 206 2002-08-19 00 LGWR wait for redo copy 498 5 0 2002-08-19 01 LGWR wait for redo copy 497 15 0 2002-08-19 22 LGWR wait for redo copy 9,207 1,433 427 2002-08-19 22 buffer busy waits 529 53,412 515 2002-08-19 23 LGWR wait for redo copy 9,066 367 9 2002-08-19 23 buffer busy waits 250 24,479 236 2002-08-20 00 LGWR wait for redo copy 771 16 0 2002-08-20 22 LGWR wait for redo copy 8,030 2,013 634 2002-08-20 22 buffer busy waits 356 35,583 343 2002-08-20 23 LGWR wait for redo copy 8,021 579 86 2002-08-20 23 buffer busy waits 441 44,677 432 2002-08-21 00 LGWR wait for redo copy 1,013 26 1 2002-08-21 00 rdbms ipc reply 160 30,986 149 2002-08-21 01 LGWR wait for redo copy 541 17 0
Using Queries Against v$event_wait
Here is the most commonly used script for displaying system-wide events, called current_waits.sql.
set pages 999 set lines 90 column c1 heading 'Event|Name' format a30 column c2 heading 'Total|Waits' format 999,999,999 column c3 heading 'Seconds|Waiting' format 999,999 column c4 heading 'Total|Timeouts' format 999,999,999 column c5 heading 'Average|Wait|(in secs)' format 99.999 ttitle 'System-wide Wait Analysis|for current wait events' select event c1, total_waits c2, time_waited / 100 c3, total_timeouts c4, average_wait /100 c5 from sys.v_$system_event where event not in ( 'dispatcher timer', 'lock element cleanup', 'Null event', 'parallel query dequeue wait', 'parallel query idle wait - Slaves', 'pipe get', 'PL/SQL lock timer', 'pmon timer', 'rdbms ipc message', 'slave wait', 'smon timer', 'SQL*Net break/reset to client', 'SQL*Net message from client', 'SQL*Net message to client', 'SQL*Net more data to client', 'virtual circuit status', 'WMON goes to sleep' ) AND event not like 'DFS%' and event not like '%done%' and event not like '%Idle%' AND event not like 'KXFX%' order by c2 desc ;
Below is the output from this script. As we see, the report shows all cumulative waits since instance startup time.
Tue Aug 20 page 1 System-wide Wait Analysis for current wait events Average Event Total Seconds Total Wait Name Waits Waiting Timeouts (in secs) ------------------------------ ------------ -------- ------------ --------- db file sequential read 2,902,850 3,829 0 .001 latch free 2,248,864 496 1,524,235 .000 PX Deq: Table Q Normal 2,080,463 4,469 248 .002 PX Deq Credit: send blkd 1,321,019 52,251 23,032 .040 direct path read 986,951 6,931 0 .007 PX Deq Credit: need buffer 800,970 1,091 84 .001 log file parallel write 415,175 5,078 2 .012 direct path write 321,096 9,342 0 .029 PX Deq: Execution Msg 198,768 56,384 26,020 .284 log file sequential read 118,480 164 0 .001 PX Deq: Execute Reply 92,487 5,628 2,407 .061 log file sync 87,295 1,327 7 .015 db file scattered read 70,112 162 0 .002 enqueue 44,335 1,468 354 .033 PX Deq: Join ACK 42,503 77 0 .002 file open 28,545 8 0 .000 PX Deq: Signal ACK 26,753 253 6,705 .009 log file switch completion 54 18 5 .341 control file parallel write 23,785 340 0 .014 SQL*Net more data from client 19,847 878 0 .044 PX Deq: Parse Reply 17,885 26 0 .001 db file parallel write 17,745 1,293 4 .073 PX Deq: Msg Fragment 15,612 18 8 .001 rdbms ipc reply 11,013 167 66 .015 PX Deq: Table Q qref 7,778 2 0 .000 LGWR wait for redo copy 7,608 2 43 .000 control file sequential read 4,098 1 0 .000 buffer busy waits 2,970 9 0 .003 PX Deq: Table Q Sample 2,053 25 27 .012 library cache pin 1,353 6 0 .004 PX Deq: Table Q Get Keys 530 1 0 .001 local write wait 317 177 163 .560 file identify 315 0 0 .000 refresh controlfile command 214 0 0 .002 PX qref latch 135 105 102 .780 log file single write 80 2 0 .021 imm op 67 0 0 .000 process startup 61 2 0 .036 write complete waits 33 34 33 1.032 library cache load lock 25 6 2 .256 log buffer space 20 6 0 .286 db file single write 6 0 0 .007 row cache lock 5 0 0 .006 db file parallel read 2 0 0 .010 instance state change 2 0 0 .000 reliable message 1 0 0 .010 library cache lock 1 0 0 .020
This should give a list of "current" wait events, but some events are more important than others. These are the ones to watch out for:
- async disk I/O
- control file parallel write
- control file sequential read
- db file parallel write
- db file scattered read
- db file sequential read
- direct path read
- direct path write
- log file parallel write
- log file sync
For details on singular wait events, consult Appendix A of the Oracle9i Reference Manual or check MetaLink.
Once we have a system-wide picture of wait events, we can drill down into wait events that are directly associated with background processes.
set pages 999 set lines 90 column c1 heading 'System|ID' format 999 column c2 heading 'Prcs' format a10 column c3 heading 'Event|Name' format a30 column c4 heading 'Total|Waits' format 999,999 column c5 heading 'Time|Waited|(in secs)' format 999,999 column c6 heading 'Avg|Wait|secs' format 999 column c7 heading 'Avg|Wait|secs' format 999 column c8 heading 'Max|Wait|(in secs)' format 999 ttitle 'System-wide Wait Analysis|Detail Analysis|for current wait events' select b.sid c1, decode(b.username,NULL,c.name,b.username) c2, event c3, a.total_waits c4, round((a.time_waited / 100),2) c5, a.total_timeouts c6, round((average_wait / 100),2) c7, round((a.max_wait / 100),2) c8 from sys.v_$session_event a, sys.v_$session b, sys.v_$bgprocess c where event NOT LIKE 'DFS%' and event NOT LIKE 'KXFX%' and a.sid = b.sid and b.paddr = c.paddr (+) and event NOT IN ( 'lock element cleanup', 'pmon timer', 'rdbms ipc message', 'smon timer', 'SQL*Net message from client', 'SQL*Net break/reset to client', 'SQL*Net message to client', 'SQL*Net more data to client', 'dispatcher timer', 'Null event', 'io done', 'parallel query dequeue wait', 'parallel query idle wait - Slaves', 'pipe get', 'PL/SQL lock timer', 'slave wait', 'virtual circuit status', 'WMON goes to sleep' ) order by 4 desc ;
Here is the output from this script. Here we see each background process and the time each has spent waiting within Oracle.
Tue Aug 20 page 1 System-wide Wait Analysis Detail Analysis for current wait events Time Avg Avg Max System Event Total Waited Wait Wait Wait ID Prcs Name Waits (in secs) secs secs (in secs) ------ ---------- ------------------------------ -------- --------- ---- ---- --------- 3 LGWR log file parallel write 415,082 5,076 2 0 1 11 ARC0 log file sequential read 108,834 139 0 0 0 4 CKPT control file parallel write 23,035 327 0 0 1 4 CKPT file open 20,842 4 0 0 0 2 DBW0 db file parallel write 17,684 1,292 4 0 1 4 CKPT direct path read 9,745 106 0 0 6 25 ARC1 log file sequential read 9,603 25 0 0 0 3 LGWR LGWR wait for redo copy 7,608 2 43 0 0 4 CKPT direct path write 3,743 2 0 0 0 5 SMON db file scattered read 1,824 7 0 0 0 4 CKPT control file sequential read 1,355 0 0 0 0 5 SMON db file sequential read 794 2 0 0 0 156 CPA1_OWNER db file sequential read 677 3 0 0 0 3 LGWR control file sequential read 468 0 0 0 0 11 ARC0 control file sequential read 433 0 0 0 0 3 LGWR control file parallel write 271 6 0 0 0 2 DBW0 control file sequential read 253 0 0 0 0 3 LGWR file open 190 0 0 0 0 160 CPA1_USER log file sync 152 2 0 0 0 25 ARC1 control file sequential read 143 0 0 0 0 2 DBW0 file open 112 0 0 0 0 5 SMON file open 111 0 0 0 0 2 DBW0 direct path read 108 0 0 0 0 3 LGWR direct path read 108 2 0 0 0 2 DBW0 file identify 108 0 0 0 0 3 LGWR direct path write 107 0 0 0 0 11 ARC0 control file parallel write 103 2 0 0 0 11 ARC0 file identify 102 0 0 0 0 12 file open 91 0 0 0 0 16 file open 90 1 0 0 0 20 file open 89 1 0 0 0 3 LGWR log file single write 80 2 0 0 0 3 LGWR file identify 80 0 0 0 0 183 CPA1_USER log file sync 73 1 0 0 0 11 ARC0 file open 70 0 0 0 0 3 LGWR imm op 67 0 0 0 0 3 LGWR log file sequential read 43 0 0 0 0 15 file open 38 0 0 0 0 6 RECO db file sequential read 36 0 0 0 0 156 CPA1_OWNER latch free 34 0 21 0 0 160 CPA1_USER latch free 30 0 20 0 0 2 DBW0 latch free 29 1 29 0 0 5 SMON latch free 17 0 11 0 0 125 SYS latch free 14 0 7 0 0 183 CPA1_USER latch free 14 0 9 0 0 3 LGWR latch free 11 0 11 0 0 156 CPA1_OWNER file open 11 0 0 0 0 25 ARC1 control file parallel write 9 0 0 0 0 183 CPA1_USER db file sequential read 9 0 0 0 0 25 ARC1 file identify 9 0 0 0 0 115 CPA1_USER log file sync 8 0 0 0 0 25 ARC1 file open 8 0 0 0 0 4 CKPT file identify 4 0 0 0 0 15 latch free 3 0 3 0 0 2 DBW0 process startup 3 0 0 0 0 160 CPA1_USER db file sequential read 3 0 0 0 0 11 ARC0 enqueue 2 4 0 2 3 132 CPA1_USER log file sync 2 0 0 0 0 125 SYS db file sequential read 2 0 0 0 0 4 CKPT latch free 1 0 1 0 0 12 latch free 1 0 1 0 0 11 ARC0 process startup 1 0 0 0 0 3 LGWR enqueue 1 3 0 3 3 6 RECO file open 1 0 0 0 0 183 CPA1_USER file open 1 0 0 0 0 125 SYS file open 1 0 0 0 0 160 CPA1_USER file open 1 0 0 0 0 166 ANAND file open 1 0 0 0 0 166 ANAND db file sequential read 1 0 0 0 0 125 SYS log file sync 1 0 0 0 0 5 SMON buffer busy waits 1 0 0 0 0 25 ARC1 enqueue 1 0 0 0 0 3 LGWR process startup 1 0 0 0 0 16 latch free 1 0 1 0 0 6 RECO latch free 1 0 1 0 0
From all of the above reports, we see the following areas for improvement:
- Event waits for parallel query dequeues. We need to check the default degree of parallelism for database objects and turn-off parallelism at the system level.
- Event waits for "db file sequential reads." This is most likely due to segment header contention on the indexes, but it could also be due to disk contention. We will start by increasing the number of freelists on the indexes. If the waits persist, the offending index should be striped across multiple disk spindles.
These are very general wait conditions, but they can sometimes be fixed by changing parameters or object characteristics. Some possible solutions include:
- Improve the optimizer's propensity to use indexes by adjusting optimizer_index_cost_adj to a value less than 10, thereby reducing unnecessary full-table scans
- Check and adjust the value of parallel_threads_per_cpu to reduce parallel query
- Check for segment header contention/waits on index headers
- Create multiple segment header blocks for stressed indexes (using alter index xxx storage (freelists 1))
- Distribute heavy impact tables and indexes onto a faster disk, or stripe the object across multiple disks.
Session Detail Event Waits
Once we have covered the system and background waits, we can find tasks that are currently waiting. This script is tricky because it must be running at the exact moment of a wait. Some Oracle professionals run this script every 60 seconds, sending an e-mail when an important wait occurs:
wait_detail_process.sql script.
clear columns set pages 999 set lines 100 col c1 heading 'SID' format 999 col c2 heading 'User|Name' format a16 col c3 heading 'Event|Name' format a10 col c4 heading 'Secs|Wait' format 9.99 col c5 heading 'Wait|Time' format 9.99 col c6 heading 'state' format a10 col c7 heading 'P1 text' format a10 col c8 heading 'P1|Val' format 999,999,999 col c9 heading 'P1|Raw' format a10 col c10 heading 'P2|Text' format a10 col c11 heading 'P2|Val' format 999,999,999 col c12 heading 'P2|Raw' format a10 col c13 heading 'P3|Text' format a10 col c14 heading 'P3|Val' format 999,999,999 col c15 heading 'P3|Raw' format a10 ttitle 'Individual process wait times' SELECT a.sid c1, decode( b.username, NULL,c.name, b.username) c2, a.event c3, a.seconds_in_wait c4, a.wait_time c5, a.state c6, a.p1text c7, a.p1 c8, -- a.p1raw c9, a.p2text c10, a.p2 c11, -- a.p2raw c12, a.p3text c13, a.p3 c14 -- a.p3raw c15 FROM sys.v_$session_wait a, sys.v_$session b, sys.v_$bgprocess c where event NOT LIKE 'DFS%' and event NOT LIKE 'KXFX%' and a.sid = b.sid and b.paddr = c.paddr (+) and event NOT IN ('lock element cleanup', 'pmon timer', 'rdbms ipc message', 'smon timer', 'SQL*Net message from client', 'SQL*Net break/reset to client', 'SQL*Net message to client', 'SQL*Net more data to client', 'dispatcher timer', 'Null event', 'parallel query dequeue wait', 'parallel query idle wait - Slaves', 'pipe get', 'PL/SQL lock timer', 'slave wait', 'virtual circuit status', 'WMON goes to sleep' ) order by 4 desc ;
Here is the output from this script. Here we see two types of wait conditions:
- User CPA1 is waiting to access file 116, block 51253 - We can use the get_object_by_block_nbr.sql script to see the exact block.
- IUD_READ_ONLY users are waiting on a parallel query dequeue.
Individual process wait times SID Name Name Wait Time state P1 text Val Text ---- ---------------- ---------- ----- ----- ---------- ---------- ------------ ---------- P2 P3 P3 Val Text Val ------------ ---------- ------------ 75 CPA1_USER db file se .00 .00 WAITING file# 116 block# quential r ead 51,253 blocks 1 52 IUD1_READ_ONLY PX Deq Cre .00 .00 WAITING sleeptime/ 268,566,527 passes dit: send senderid blkd 284 qref 0 253 IUD1_READ_ONLY PX Deq Cre .00 .00 WAITING sleeptime/ 268,566,527 passes dit: send senderid blkd 284 qref 0 151 IUD1_READ_ONLY PX Deq Cre .00 .00 WAITING sleeptime/ 268,566,527 passes dit: send senderid blkd 284 qref 0 147 IUD1_READ_ONLY PX Deq Cre .00 .00 WAITING sleeptime/ 268,566,527 passes dit: send senderid blkd 284 qref 0
Methods for Using Event Wait Data
Because this is a short article, let's confine our wait discussion to the most onerous wait event, the db file sequential read wait event.
In our example from the system-wide reports we noted that the number one wait in v$system_event is "db file sequential reads." This wait event is usually for index reads. If we look at the detail listing for these wait events, we see that the script returns the file number and block number where Oracle was waiting for the event.
SID Name Name Wait Time state P1 text Val Text ---- ---------------- ---------- ----- ----- ---------- ---------- ------------ ---------- P2 P3 P3 Val Text Val ------------ ---------- ------------ 75 CPA1_USER db file se .00 .00 WAITING file# 116 block# quential r ead 51,253 blocks 1
Here we see that the CPA1_USER was waiting on access to file number 16, block number 51,253. Given this information, we can drill down into dba_extents and dba_data_files to see the data file that is causing the waits.
Here is a simple script that will display the file name for any file number:
accept filenbr prompt 'Enter the file#: ' select tablespace_name, file_name from dba_data_files where file_id = &&filenbr ;
When we run the script we can quickly see the name of the file that is experiencing the "db file sequential reads" wait:
Enter the file#: 10 TABLESPACE_NAME ------------------------------ FILE_NAME -------------------------------------- USERS C:\ORACLE\ORADATA\DIOGENES\USERS02.DBF
To properly locate the source of the wait we need more than just the file name We can use the get_object_by_block_nbr.sql script to display all object on the target block:
set pages 999 set verify off set lines 80 col c1 heading 'Segment|Name' format a40 col c2 heading 'Segment|Type' format a10 col c3 heading 'First|Block|of Segment' format 999,999,999,999 col c4 heading 'Last|Block|of Segment' format 999,999,999,999 accept filenbr prompt 'Enter the file number: ' accept inblock prompt 'Enter the block number: ' select segment_name c1, segment_type c2, block_id c3 -- block_id+blocks c4 from dba_extents where &filenbr = file_id and &inblock >= block_id and &inblock <= block_id+blocks ;
Here we see the exact source of the wait event is an index called IDX_EVENTCASE_STATUS_OVERAGE.
Enter the file number: 116 Enter the block number: 51253 First Segment Segment Block Name Type of Segment ---------------------------------------- ---------- ---------------- IDX_EVENTCASE_STATUS_OVERAGE INDEX 51,205
So, why are sessions waiting on index access? The most common cause is a freelist shortage. Prior to Oracle bitmap freelists (automatic extent management), segment header contention can happen if multiple tasks compete to simultaneously update the same index. The script can easily check the number of freelists for this object:
clear columns col c1 heading 'Table|Name' format a20 col c2 heading 'Table|Freelists' format 99 col c3 heading 'Index|Name' format a20 col c4 heading 'Index|Freelists' format 99 select distinct t.table_name c1, t.freelists c2, index_name c3, i.freelists c4 from dba_tables t, dba_indexes i where t.table_name = i.table_name and i.index_name = 'IDX_EVENTCASE_STATUS_OVERAGE'
Of course, there could be other problems such as a high disk enqueue on the disk that contains the index, but we should always try to add freelists first. If more freelists do not solve the wait issue, then the index must be placed on a faster disk, a less busy disk, or striped across multiple disks.
Also, note that because we have the SID for the waiting task, we can use the SID to join into v$session and v$sql to see the SQL that was causing the wait condition.
Other Wait Analysis Techniques
This article would not be complete without mentioning the 10046 trace events. Proponents of the 10046 approach to wait-event tuning will say that using the 10046 approach is the best and easiest way to analyze Oracle event waits.
If you want to drill down into trace event analysis, there are several companies that sell software to analyze trace files for wait events.
Conclusion
Oracle wait event analysis is extremely complex and it is one of the most challenging aspects of database tuning. While the effort is high, the rewards can also be great, especially for database that have problems with disk I/O contention, or objects with sub-optimal parameters.
--
Donald K. Burleson is one of the world’s top Oracle Database experts with more than 20 years of full-time DBA experience. He specializes in creating database architectures for very large online databases and he has worked with some of the world’s most powerful and complex systems. A former Adjunct Professor, Don Burleson has written 15 books, published more than 100 articles in national magazines, serves as Editor-in-Chief of Oracle Internals and edits for Rampant TechPress. Don is a popular lecturer and teacher and is a frequent speaker at Oracle Openworld and other international database conferences. Don's Web sites include DBA-Oracle, Remote-DBA, Oracle-training, remote support and remote DBA.
Contributors : Donald K. Burleson
Last modified 2005-06-22 12:36 AM