House of Brick Oracle Oracle RAC on VMware – Beware Oversubscription

Oracle RAC on VMware – Beware Oversubscription

Oracle, VMware

by Cameron Cameron, Senior Consultant

At House of Brick, we’ve been virtualizing Oracle RAC on VMware for over twelve years. The technology is reliable. Of course, prerequisites and best practices exist in a virtual deployment, just like they do with physical hardware.

Last year, I was sent to a client site where they were reporting RAC node evictions in their virtual environment. They were predisposed to believe that the issues had to do with the cluster interconnect, so a VMware NSX expert was also onsite. Suffice it to say that no network issues were found. I found, within the first hour, that there was some minor memory ballooning that would occasionally occur, which we determined to be the cause of the node evictions. The remainder of my visit was spent educating them on why memory pressure was so bad for Oracle RAC.

A significant reason why VMware is attractive to businesses is its ability to consolidate workloads and oversubscribe CPU and memory resources, so what’s the big deal? Well, it’s true, you can oversubscribe CPU, and sometimes memory, as long as you keep a few things in mind:

Tier-1 workloads are generally not good candidates for oversubscription
Oracle RAC is not a good candidate for oversubscription, regardless of whether it’s Tier-1 or other

At House of Brick, we have a set of Oracle on VMware best practices, which we advise our clients to follow, whether we’re assisting them in virtualizing existing workloads, or performing health checks. Part of our best practices indicate that you should use full memory reservations for Oracle Database VMs, or at least reserve an amount of memory equal to the size of the SGA. For CPU, we acknowledge that it’s generally okay to oversubscribe up to a ratio of ~1.5.

Not all shops choose to follow our advice however. In the case of Oracle Database VMs, the worst that can happen is a performance impact, which may or may not be noticed. Let me postulate that oversubscription = latency. Why is resource oversubscription so bad for Oracle RAC? The short answer is Oracle Grid Infrastructure.

Consider running a two-node RAC cluster on physical machines, Node-A and Node-B. If you were able to induce a condition on Node-A that caused some kind of latency across the cluster interconnect, what would happen? It’s very likely that Node-A would be evicted from the cluster; in other words, Node-A would be rebooted. A key difference between Oracle Database and Oracle Grid Infrastructure is that Oracle GI has the ability to reboot the machine.

Running Oracle RAC in a VMware environment, in a situation where the VM is not guaranteed to get the memory resources it needs, could result in ballooning. Ballooning on the ESXi host at the client site corresponded to what I labeled as “unpredictable results” in Clusterware. The word “unpredictable” is appropriate because the result of the increased latency in the environment presented in a number of ways:

Loss of access to a CRS/voting disk
Unacceptably high network latency
Failure to communicate with the CSS daemon

Here are examples of errors shown in some of the Clusterware alert logs immediately following ESXi memory ballooning events:

2018-07-29 08:29:24.963 [OCSSD(12804)]CRS-1614: No I/O has completed after 75% of the maximum interval. Voting file ORCL:OCR_DATA_02 will be considered no functional in 4070 milliseconds

2018-07-29 08:29:27.951  [CSSDAGENT(12767)]CRS-1661: The CSS daemon is not responding. Reboot will occur in 5569 milliseconds;….

2018-08-19 04:37:57.859 [OCSSD(9589)]CRS-1612: Network communication with node [nodename] (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.030 seconds

2018-08-19 04:38:20.45 [OCSSD(5141)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; … The CSS daemon is terminating due to a fatal error;….

As I mentioned earlier, the client was predisposed to suspect that the cluster interconnect was the culprit. If they reviewed AWR reports during their forensic research, this is not surprising. Any resource contention on other RAC nodes would cause a degradation in performance, which would have an effect on interconnect traffic. Of course, the surviving nodes’ AWR reports are going to show latency issues.

Oracle RAC on VMware runs well, and at House of Brick we’ve never encountered a RAC on VMware issue that couldn’t be stabilized, as long as the proper resources were provisioned. For Oracle RAC, this means setting memory reservations for the Oracle VM, or running it on a host that is not oversubscribed.

AWR reports, cluster interconnect, Clusterware, consolidate workloads, Latency, memory ballooning, memory pressure, memory reservations, Oracle Grid Infrastructure, Oracle on VMware best practices, Oracle RAC on VMware, oversubscribe CPU, RAC node evictions, resource contention, resource provisioning, Running Oracle RAC in a VMware environment, virtualizing Oracle RAC, VMware NSX

House of Brick Staff

All Posts

Stop Guessing About Your Database Estate

Get continuous visibility into database sprawl and licensing risk across hybrid environments.

Oracle

How to Configure Continuous Database Inventory for Audit Readiness

Learn best practices for configuring continuous database inventory with automated discovery, unified tracking, and historical snapshots to eliminate audit surprises.

March 26, 2026

Oracle

Oracle Database Feature Usage is Your Single Biggest Audit Trap

Oracle feature usage can trigger massive audit penalties. Learn how to detect, track, and avoid licensing risk before it’s too late.

March 24, 2026

Diagram showing the AWS database visibility gap: AWS infrastructure tools see EC2 and RDS instances but cannot see database-level details like Oracle and SQL Server editions, feature usage, or license compliance status

AWS

You Can’t Address Database Sprawl Without Knowing What You Have

AWS tools see instances, not databases. Learn why fixing Oracle and SQL Server sprawl requires visibility that connects infrastructure data to database-level compliance information.

February 27, 2026

Popular Keywords

Categories

About House of Brick