Troubleshooting Oracle Performance – Part 1

Jonas Mason, Senior Consultant

In a blog post titled Oracle AWR and Statspack Reports: Performing an Initial High Level Review, I covered a general performance assessment for an Oracle instance. AWR and Statspack metrics covered in that blog are useful in understanding how much CPU, memory, and disk I/O are utilized by Oracle. These are important for current and future right sizing efforts, and can also help answer questions about what has changed when a performance problem is encountered and initial assessments are made.

In this blog post, I will consider the general troubleshooting approach when confronted with an end user who complains, “the database is slow.” I will also include a brief overview of the Method R approach to tuning and why the Oracle Wait Interface is so important.

The goal of this blog is to provide a context and methodology that encourages a problem resolving dialogue between DBAs, developers and business users. Troubleshooting any problem involves developing a narrative. The narrative, or multiple narratives, should be subjected to debate and modified or discarded when refuted. Since most in IT have either been in support or opened a support ticket, we can understand that having a written log of observations over time is useful, if not mandatory. The ticket notes should provide a logical explanation of a developing narrative.   The narrative and the success of that narrative in resolving performance issues depends in large part on the methodology used and interaction between technical and business resources.

The benefit of successfully troubleshooting performance problems is that a corresponding reduction in CPU, memory, and disk I/O utilization is possible. The methods described in this blog can also be utilized proactively to identify and tune resource intensive systems before users complain. In this case, the organization may inventory what is responsible for the costs associated with an Oracle instance and its server, and may realize greater efficiencies through tuning.

Reducing system resource utilization impacts the bottom line in that right-sized systems cost less than over-sized systems. In the case of CPU utilization and CPU counts required per instance, there is a direct impact on Oracle license cost maintenance. Clearly, using less CPU on your Oracle servers is in your organization’s best interest.

This blog draws heavily from the books below, which are recommended reading:

 

Troubleshooting Approach

Often a performance problem is brought to the attention of the DBA or developer with only a statement like, “It’s running slow” or “Oracle is broken.” Based on this general and vague statement, a series of questions and answers must occur to identify the root cause and resolve it.

When beginning performance troubleshooting, it is important to know what constitutes normal be it system load or SQL statement performance.

This baseline is then used to compare against the current problem performance scenario, which should always be defined as narrowly as possible.

  1. Establish Baselines – Establishing and saving baselines now on typical server, instance, and SQL behavior is useful for remediating poor performance later. Baselines on CPU and disk I/O utilization by an instance are valuable in helping answer questions of what changed and why it might matter. Several baselines may be useful, including statspack rollups by hour, day and week, for a period of time. Finding out what changed is usually easier when you know what metric values once were. These statistics are available through statspack, AWR, and Oracle instance views. Oracle Enterprise Manager and other third party performance monitoring tools pull from these same views.
  1. Narrow Time Frame – When tuning a system or statement, identifying as narrow a time frame as possible is important. While statspack by default collects one hour snapshots of time, if activity that is problematic lasts only 10 minutes, be sure to only review this 10 minute increment with a shorter snapshot. Important details can be lost if only 60 minutes or more of activity is reviewed.
  1. Identify Session – Sometimes one or more poorly performing SQL statements for a few sessions can impact an entire system. Sometimes the poor performance associated with a session does not impact other sessions, or result in a substantial increase in server resource utilization. In other cases, the person experiencing poor response times is being impacted by other sessions, batch processes, or reporting.
  1. Identify SQL – The execution of SQL statements consumes server resources such as memory, CPU and disk IOPS/throughput. Poor SQL response times often mean that more resources are being consumed inefficiently.

The first step with troubleshooting any performance problem is identifying the root cause. Once identified, the root cause can be remediated. The findings here could also turn into a server right sizing effort, though the focus in this blog series is on identifying a problem SQL statement and remediating it.

Method R – A Response Time-Based Performance Improvement Method

In Optimizing Oracle Performance, Cary Millsap recommends the Method R approach to tuning, which is “Work first to reduce the biggest response time component of a business’ most important user action.”

A DBA or System Administrator working on a server in isolation is not going to be successful with the Method R approach because they are not interacting with a business’ end users. Adding CPU or memory does not necessarily improve response times for SQL executed by an end user, though it might help with system right sizing. With Method R, the DBA and developer interact with the end user iteratively to provide targeted tuning efforts that yield tangible results that benefit the business.

Per Optimizing Oracle Performance:

  1. “Select the user actions for which the business needs improved performance.”
  2. “Collect properly scoped diagnostic data that will allow you to identify the causes of response time consumption for each selected user action while it is performing sub-optimally.”
  3. “Execute the candidate optimizations activity that will have the greatest net payoff to the business. If even the best net-payoff activity produces insufficient net payoff, then suspend your performance improvement activities until something changes.”
  4. “Go to step 1.”

The focus here is on improving response times common to most businesses, including batch processes, report generation, application response times, and transactional throughput.

Oracle Wait Interface

Oracle captures wait events for the system and SQL statements that allow DBAs and developers to identify bottlenecks. There are 13 wait classes made up of over a thousand distinct wait events.

Below are examples of Oracle instance wait classes:

  • Administrative
  • Application
  • Cluster
  • Commit
  • Concurrency
  • Configuration
  • Idle
  • Network
  • Other
  • Queueing
  • Scheduler
  • System I/O
  • User I/O

 

These wait classes cover a significant spectrum in terms of Oracle instance performance, and are very useful in identifying what the system is doing to respond to user requests.

Below are examples of typical wait events and their classes:

  • db file sequential read, wait class User I/O
  • enqueue, wait class Other
  • direct path write, wait class User I/O
  • log file sync, wait class Commit

 

These are system level wait events that result from SQL being executed on a system.

We primarily want to address significant wait events because they have the most impact on system and SQL performance. If the original wait event only impacted the system and/or SQL statement by 5%, then tuning to eliminate or reduce this 5% only yields up to a 5% improvement. If another wait event is consuming 70% of a system or a statement, tuning to reduce or eliminate this 70% yields the greater return.

In my next blog post, I will cover Oracle views, statspack SQL sections, and table and index statistics that help narrow the focus of a performance issue. Once a problem SQL statement is identified, I will review how to interpret an explain plan to identify the most significant waits encountered, including remediation methods.

References:

 

Table of Contents

Related Posts