Why Run RAC?
Oracle Real Application Clusters (RAC) is a way of accessing a database from multiple database servers. This provides some added High Availability (HA) features over a single-instance database.
- Hardware failures
- Rolling patches
- Horizontal scalability
Let’s explore these features briefly.
Hardware Failures
RAC protects from hardware failures by providing database connections to the application via the surviving nodes. If the application is RAC aware, this can be a seamless operation. Most applications are not RAC aware, so typically a hardware failure still results in a partial outage.
Rolling Patches
Some patches to the Oracle software allow for application one node at a time without a total database outage. In this situation, one instance goes down, gets patched, and comes back up and rejoins the cluster. This is repeated until all nodes in the cluster are patched. The result is similar to the hardware failure scenario. If the application is RAC aware, then there is no outage. Otherwise, there is a partial outage every time an instance goes down to apply the patch.
Horizontal Scalability
In a RAC cluster, connections to the database can be spread across the instances in the cluster. This provides horizontal scalability as the database load is spread across multiple servers. This load balancing comes at a cost. There is significant overhead involved in keeping the data blocks in sync across the nodes in the cluster, so single instance databases are more efficient.
HA Solutions for Oracle in AWS
While RAC is technically possible using native AWS infrastructure (recent developments by AWS), House of Brick always recommends moving away from RAC if possible. RAC adds significant complexity to your Oracle database environments. Unless you can take full advantage of RAC functionality, it is an unnecessary complication to your infrastructure. Designing HA and DR solutions for Oracle single instance databases in AWS is no different than designing HA and DR in AWS for any other application. The design needs to include both infrastructure redundancy as well as data redundancy.
Infrastructure Redundancy
Like all hyper scale public clouds, AWS manages infrastructure resources based on geographical region and datacenter. In AWS, a separate geographical region is, well, an AWS region. An Availability Zone (AZ) is a separate set of isolated resources (datacenter) within a region.
For DR purposes, it is best to distribute workloads across regions. This provides the most insulation from disaster. AWS regions are very durable and contain multiple levels of redundant resources, but it is feasible that an entire region could experience an outage.
To make your workload highly available in AWS, it is advisable to distribute EC2 resources across AZs. This allows for close network connectivity while providing isolated server and storage redundancy.
While multi-AZ in AWS makes for a good HA solution for EC2, EBS storage is isolated to a single AZ so the data for an application will need to be replicated.
Data Replication
AWS provides great DR solutions for data replication. An EBS volume can have a snapshot taken to a different AZ. Once the snapshot exists, it can be copied to a different region. Using a snapshot of an EBS volume requires some form of configuration, recovery (potentially) and failover before it is useful to the application, making snapshots a DR solution, not an HA solution.
HA requires that the data be immediately available upon disruption to the primary data source. In other words, if the primary database goes down there needs to be an active copy ready to accept connections. The only way to do this in AWS is to use Oracle database replication technologies to perform the data replication.
Data Guard
Traditional Data Guard replication to a standby database is physical data replication. Meaning, blocks of data are copied from the source to the target and then updated perpetually using Oracle media recovery. As blocks are modified on the source, Oracle extracts the deltas into redo logs and ships the deltas to the standby database to roll the database forward. This is a very efficient process as only the changes to the blocks are shipped. Data Guard is considered a warm standby solution. The database is kept continually in sync, but the standby database is in a mounted state during steady state data replication mode. In this case, the standby database must be activated and opened before the connections from the applications are accepted.
GoldenGate
Unlike Data Guard’s physical replication model, GoldenGate uses logical replication to keep the target database in sync. GoldenGate interrogates the source database’s redo logs and strips out the SQL that was run to modify the database. The SQL is then shipped to the target database and applied.
GoldenGate, when employed in this fashion, is considered a hot standby. The target database is open and accepting connections while at the same time, logical replication operations are keeping the database in sync.
There are multiple ways to employ GoldenGate for database replication
- Extract, Transformation, Load (ETL) – Extract and load a subset of data from source to target, this is typically used in data warehouse operations
- Active/Passive – The source database is active and the target database is only replicating, there are no active connections to the database
- Active/Active – There are active connections to both databases and replication is bi-directional, keeping each database in sync with the other
ETL does not apply to HA, but rather for data warehousing. Active/active requires logic in the application to prevent data integrity issues.
A GoldenGate active/passive database solution comes very close to a RAC level of availability. All application connections are directed to the active database. When the primary database comes down, whatever the reason, the connections to the database are lost. If you are using an AWS load balancer, it can be configured to always connect to the primary database, unless there is a failure. If connections cannot be made to the primary database then the load balancer will redirect traffic to the secondary database. If your application server is configured to connect to the load balancer, then all new connection requests can be rerouted to the surviving database with no human intervention.
Conclusion
While there are alternatives to Oracle RAC, regardless of the infrastructure on which you are running, AWS is designed with a high level of fault tolerance. Everything in AWS is designed to eliminate, as much as possible, lengthy unscheduled outages. Your application is not going to be down for hours if there is a datacenter issue in AWS. However, if your application requires an elevated level of HA, there are ways to configure Oracle to get near-RAC availability. GoldenGate with active/passive databases will get you very close to RAC’s level of HA. The same disclaimer applies to GoldenGate active/passive that applies to RAC; If your application is not RAC aware then no HA solution will provide zero downtime in the event of a failure.