The Oracle RAC Dilemma – Part I
Dave Welch (@OraVBCA), Chief Evangelist
Part I: Four Criteria for Introducing or Keeping RAC
VMware vSphere High Availability can provide significant levels of high availability for many workloads. VMware HA does so without the complexity, fragility, or cost of Oracle Real Application Clusters™ (RAC).
Oracle RAC can introduce significant complexity, expense, and risk into a system stack. There are many shops that experience net worse HA with RAC than with single instance Oracle databases on VMware HA.
When is Oracle RAC the Right Choice?
Your organization may benefit from introducing, or maintaining, RAC if you can identify with at least one of the following evaluation criteria.
- You have an explicit SLA that the database is down for no more than four minutes in an emergency
- RAC’s rolling upgrade capability is needed
- You have configured and/or programmed to RAC’s Oracle Notifications Services (ONS)
- The database load outstrips vSphere’s 128 vCPU limit per virtual machine
Let’s look at each of these scenarios more closely.
1) You have an explicit SLA that requires the database be down for less than four minutes in an emergency – VMware HA needs about four minutes to restart a virtual machine and have the database ready for connections. RAC can have a failed node’s transactions failed over to a surviving node and the RAC cluster un-paused in about one minute. A corollary criterion is whether applications can provide a scheduled maintenance window.
2) RAC’s rolling upgrade capability is needed – Because a RAC cluster involves multiple instances, many patches can be done with no downtime to the database. However, major patch sets usually update the database data dictionary forcing downtime for the entire RAC cluster. Oracle Critical Patch Updates also almost always require database downtime.
One of our many clients, who leads their vertical worldwide, brought us in years ago for a statistical assessment of their various workloads’ need for RAC. They confessed that up to that point they had handed out RAC to business units based on request rather than metrics. Although they had assumed that their uptime requirements could not be met without RAC, it was determined that 80% of their RAC implementations were unnecessary, and single instance on VMware HA provided more than adequate downtime for the required patching windows. Accordingly, those RAC implementations were reconfigured as single instance on VMware HA.
3) You have configured and/or programmed to RAC’s Oracle Notifications Services (ONS) – ONS allows application middle tiers to take code branches in response to various RAC cluster notifications. Fast Connection Failover is also a capability of ONS. It, and other capabilities, can be leveraged through the insertion of code hooks and dependent logic, as well as through middle tier configuration (in many cases). RAC provides superior capabilities for application stacks to monitor and respond to load and high availability events. However, applications that run on RAC, which also leverage ONS, would appear to be exceptions. Despite my enthusiasm for this capability, and promotion of ONS in the Oracle University RAC classes House of Brick led years ago, I have only ever heard of two organizations leveraging it.
4) The database load outstrips vSphere’s 128 vCPU limit per virtual machine – After all appropriate performance tuning, the database load outstrips vSphere 6’s 128 vCPU limit per virtual machine. In informal observations, easily 98 percent of Tier 1 clients’ workloads can fit within 128 vCPUs with scalability to spare. vSphere’s continuous march toward more per-VM compute power is rendering this criterion increasingly irrelevant.
Stability and Data Corruption
A RAC cluster can become unstable with a configuration oversight in the wrong place. An unstable RAC cluster can reduce availability compared to single instance rather than increase it. That being said, in the hands of qualified delivery partners, a RAC cluster can always be stabilized.
It is worth noting that a clustered database is inherently at higher risk for data corruption than a single instance database. RAC provides no incremental data corruption protection, or corruption recovery mechanisms, compared to a single instance database. If data corruption occurs, the entire cluster may go down while the data is being repaired, and the database will definitely be down during a restore/recovery or database flashback operation.
Inherent advantages to Single-instance Oracle on VMware
Single-instance Oracle on VMware HA may be a reasonable alternative to RAC when considering high availability. Single-instance Oracle on VMware HA is:
- Far less complex
- Inherently capable of being more stable
- Far more approachable for a wider array of less expensive technical staff
- Considerably less expensive
- Cloneable in vSphere (without working around shared storage)
VMware HA is not RAC
Single-instance Oracle on VMware HA is not equivalent to RAC, however. Various IT industry conversations note RAC capabilities in excess of those offered by single instance Oracle on VMware HA. The issues listed here do not include those previously addressed in this post:
- Listener crash or accidental shutdown
- Oracle instance crash or accidental shutdown
- Listener IP failure
- Oracle instance out of memory
- Oracle session crash
- ORA-600 errors
- Deletion of the Oracle binaries
We at HoB love the challenge of RAC. We’ve only ever encountered two kinds of RAC clusters: extremely stable and extremely unstable. We’ve never met a RAC cluster we couldn’t stabilize. Fifteen years ago, RAC relatively stood alone as the ultimate HA option. Now, we no longer recommend RAC for easily two thirds of workloads, for which RAC would have been the only solution for back in the day, as single instance on VMware HA can handle them with far less complexity, operational overhead, and expense.
- The RAC Dilemma Part II: Four RAC Operational Best Practices
- The RAC Dilemma Part III: The HA Feat RAC Will Never Pull Off