Dave Welch, CTO and Chief Evangelist
If there were ever a system stack begging for comprehensive break/fix and performance/throughput testing, it’s Oracle Real Application Clusters, or RAC.
In my non-statistical observation, we can only justify RAC to continue to be maintained in about a third of our customers’ Tier-1 workloads for which RAC was the only viable High Availability (HA) solution a decade ago.
Let’s take a minute to discuss risk management for the remaining third of the stacks that really do need RAC.
I can’t count the number of times I have told a prospect there are only two kinds of RAC clusters in the world – “extremely stable, and extremely unstable.” Shops that haven’t been through RAC break/fix testing are at much higher risk of being in the extremely unstable camp. Thus, their actual HA experience is often south of what they had projected it to be on RAC. That usually means they would have had a better HA experience had they stayed on a single instance Oracle.
Lack of Testing in Native Hardware Environments
It’s quite difficult, let alone expensive, to provision hardware and configuration for RAC testing in a native hardware environment. It’s not uncommon for us to encounter shops that don’t introduce RAC into the system stack until the last hop before production. Possibly because of that, most shops don’t complete either break/fix testing or RAC performance/throughput testing. Another reason may just be a lack of knowledge as to the operational threat that RAC presents. Rather, they roll the dice and hope things go well when they promote the stack to RAC in production.
RAC opens the door to two threats not shared by single instance workloads: interconnect-induced performance problems, and node hard-codes. I resigned my work as a contract Oracle University RAC instructor back in the 10.2 days due to professional opportunity cost. What the O.U. RAC Admin manual should have said and didn’t was this: the RAC DBAs’ and RAC developers’ prime performance directive is to do everything possible to “keep the interconnect quiet.” Node hard-codes will render a workload RAC-impossible – meaning it can’t fail over to a surviving RAC node, but rather has to wait to come back up when the original node is repaired or the original node’s IP has been adopted by another node. Use of the supplied packages DBMS_PIPE, DBMS_SIGNAL, and, DBMS_ALERT are common culprits in node hard-codes.
RAC operational testing comprehends validating a stack against both of those threats. About 85% of shops get away without such testing it because their RAC-agnostic code winds up performing adequately. As for the other 15%, it can get expensive and embarrassing.
VMware Removes Common Barriers
VMware largely removes these barriers. It allows for the RAC stack to be forwarded all the way back to the earliest design/development stage with a fraction of the configuration dependencies and without the hardware dependencies. This gives shops the ability to stress test their code on RAC early on, and far more efficiently than they could on bare metal. VMware’s enablement of RAC break/fix lab activity allows motivated shops to really understand RAC’s operational characteristics, which leads to stable RAC stacks.
Other benefits in the production environment:
- Why bless a workload with RAC’s superior HA (when properly configured), but curse it with Oracle Data Guard’s inferior DR capability? Rather, when we shroud the production workload with VMware, we can use replication lower in the stack, thus approaching 100% DR reliability.
- VMware in production also offers a lot more system stack cloning options as compared to bare metal. But that’s more of a benefit to the pre-production lifecycle.
Finally, VMware allows RAC teams to achieve a higher level of RAC expertise and experience in less time because they’ve already intentionally induced every conceivable failure and appropriate remediation in the RAC stack during the break/fix process. As teams experienced with RAC-on-VMware break/fix have a dramatically higher ability on average in our observation than their bare metal counterparts to efficiently get to root cause diagnostics in RAC emergencies.
Conclusion
If Oracle RAC is appropriate for your workload at all, RAC on VMware is a marriage made in heaven.