Blog

The Oracle RAC Dilemma – Part III

Dave Welch

July 13, 2020
Oracle, VMware

Dave Welch (@OraVBCA), Chief Evangelist

Part III: The HA Feat RAC Will Never Pull Off

Real Application Clusters had always been the high availability king. It surprises me that after 15 years since the ground-breaking VMware Fault Tolerance was introduced, I encounter many IT operatives that have never heard of it. Occasionally I have an interaction where it is clear I’m not connecting with the person on the VMware Fault Tolerance proposition. I don’t short them their incredulity that what I am about to describe could even be possible.

Out of my annual attendance at VMworld since 2005, one of the top three sessions I attended was at VMworld 2011 U.S., BCO2874 vSphere High Availability 5.0 and SMP Fault Tolerance presented by Keith Farkas and Jim Chow.

I had not planned to attend the session, actually. I noticed it, happened to be close by, and wandered in on the spur of the moment. My interest in the session was really for the Site Recovery Manager piece.

I saw Diane Greene and her husband Mendel Rosenblum demonstrate Fault Tolerance at VMworld 2005. Amazing. But what good would a single vCPU capacity be to the majority of House of Brick’s clients running business-critical Oracle workloads? Years after the Fault Tolerance GA release, the only implementation we knew of was our reference client, GE Appliances and Lighting, protecting a famously-vulnerable SAP queue in production with single vCPU Fault Tolerance.

Back to the 2011 session. Only three weeks before, the VMware Engineering Team had solved the famously difficult problem of synchronizing replay of executions across multiple cores. And here they were chancing a live demo in a heavily attended session with the show’s video crew capturing the proceeding.

Then Jim Chow’s disclaimer: “Now, before I press ENTER here, I just want to say one thing (audience laughter). This is a developer prototype. I compiled it last week. I can say that I’ve seen it fail (more laughter, applause). It does fail. It can fail. And when it fails, it fails in strange and shocking ways (laughter), in ways that make you ashamed to put your name with, to associate yourself with it. But I can tell you we did the session on Tuesday and it worked flawlessly. So if things go horribly, horribly south, you can find yourself someone who went to the Tuesday session and they can speak in hushed, you know, with wonder and amazement at how well the thing worked. Whatever doesn’t kill you makes you stronger.” I restrained myself from leading a chant. Mentally, I was screaming, “Push the button. Push the button!”

Chow abended the OS where the database and Swingbench workload were executing, simulating full hardware failure under the database and 100% of its client connections. With less than a second of momentary dip in execution, the Swingbench workload continued. The room erupted with applause and approval. No ORA-03113 End of File Communication Channel. The workload never knew anything happened. See for yourself in the recorded session [start at 27:00].

There was a disclaimer about substantial latency that the feature induced into the protected workload. That didn’t bother me. It seemed to me that ever faster chips and ever faster storage would more than compensate for the latency.

After this demonstration, it was four agonizing years of waiting until we saw the SMP Fault Tolerance code released in production, with four vCPUs maximum in the protected VM. But with four 3.5 GHz cores and the hyper-threading lift, that’s easily 16 GHz of processing power. VMware’s tired slides of Capacity Planner workload specifications proved a substantial percentage of production Oracle workloads would fit in that.

Meanwhile, at VMworld 2012, I was having lunch in Yerba Buena Gardens with professional associate and Oracle employee Michael Timpanaro-Perrotta. At the time, one of Michael’s responsibilities was to attend VMworld, show up in all the Oracle sessions, and provide a report back to Oracle management. I have a lot of respect for Michael despite the fact that it seemed we were on polar opposite sides in terms of our efforts to either advantage or disadvantage the cause of Oracle on VMware. He said there was an Oracle on VMware licensing session that he thought he’d go to. I said, “Nah, you know what they’re going to say. I’ve got something to show you that’s a lot more interesting.” So, Michael went with me to the 2012 reprise of the 2011 session. Same substantial latency caveats. Among other things, they admitted to having tested it on 16 vCPUs. Michael was ecstatic. “If both sides of that have to be licensed, we’ll sell a lot more RAC!” Indeed. Layering RAC on SMP Fault Tolerance would be the ultimate HA play.

By 2013, I was impatient for the code to be released.

SMP Fault Tolerance milestones:

Code released in vSphere 6 beta July 2014—four vCPUs per VM.
GA released in vSphere 6 March 2015.
vSphere 6.5 GA November 2016: fundamental rewrite, minimized SMP Fault Tolerance latency.
vSphere 6.7 GA April 2018: eight vCPUs per VM.

Here’s a high-level comparison of the HA offerings:

Single instance Oracle DB on VMware HA: 100% outage for about three to four minutes on modern x86 servers.
RAC: full loss of whatever percentage of the connections were running on the failed host. A minimum 30 second cluster-wide serialized pause while the surviving instances remaster the resources of the failed instance. Attempt to configure Clusterware tighter than 30 seconds and you risk RAC getting jumpy and invoking false node evictions.
SMP Fault Tolerance, now up to eight vCPUs as of vSphere 6.7. No outage.

Need more horizontal scalability than eight vCPUs? Layer RAC on SMP Fault Tolerance.

Could RAC ever do what SMP Fault Tolerance does? Sure, beginning with a tooling handshake down into the hypervisor layer. Much to my disappointment, since Larry’s keynote Oracle VM (OVM) announcement at Oracle Open World 2007, Oracle just hasn’t shown interest in maturing OVM anywhere near what would be needed to have OVM participate in technical competition with SMP Fault Tolerance. Until that changes, I predict that RAC will never be able to pull off SMP Fault Tolerance’s HA feat.

Diane Greene, High Availability, hyper-threading, Jim Chow, Keith Farkas, Mendel Rosenblum, Michael Timpanaro-Perrotta, Mworld 2011 U.S. BCO2874, Oracle RAC, Oracle Real Application Clusters, Oracle VM, single instance on VMware HA, Single Instance on VMware SMP Fault Tolerance, Single instance Oracle Database on VMware HA, SMP Fault Tolerance, Swingbench, VMware Fault Tolerance, VMware Site Recovery Manager, vSphere 6

Dave Welch

Founding Partner, CTO and Chief Evangelist, Dave Welch is House of Brick’s Oracle Licensing practice lead. He specializes in Oracle enterprise license assessment, audit defense, enterprise infrastructure assessment, Business Continuity options, performance, scalability, and system architecture. Dave has been involved in the reduction and/or reallocation of millions of dollars of hardware and software budget. His mission-critical Oracle DBA experience began in 1994. Dave has been a prominent voice in the industry for the implementation of Business-Critical Application workloads on VMware virtualization technology. He is also a former Oracle University RAC instructor.

All Posts

SQL Server

Upgrade Your SQL Server 2014 Before Support Ends

Introduction and Background About House of Brick In our over twenty-five-year history at House of Brick, we have worked with tens of thousands of Oracle,

June 26, 2024

Cloud

Oracle Updates Cloud Licensing Policy

Oracle recently released an update to the policy document that outlines the rules for licensing their core technology software in what they term “Authorized Cloud

June 20, 2024

Java

2024 Oracle Java Licensing Risks: Strategies and Insights for Your Business

Explore essential strategies and insights on navigating the Oracle Java licensing changes in 2024. Learn how to mitigate risks and manage compliance with our detailed guide, ensuring your business is prepared for Oracle’s aggressive licensing policies.

June 4, 2024

Solutions

Resources

Blog

Upgrade Your SQL Server 2014 Before Support Ends

FAQ

Running Oracle on AWS – FAQs

Datasheet

Managed OpsCompass for Oracle

EBook

7 Oracle Audit Defense Strategies to Survive an Audit Notice

House of Brick

Expertise with solutions and pioneering technology

Blog

The Oracle RAC Dilemma – Part III

Dave Welch

Dave Welch

Table of Contents

Related Posts

Upgrade Your SQL Server 2014 Before Support Ends

2024 Oracle Java Licensing Risks: Strategies and Insights for Your Business

Solve Your Most Complex Cloud and Operational Challenges with Experts by Your Side.