Nick Walter, Principal Architect
Part Two – Architecting the Environment
“How do I architect my Aurora environment?” is a very frequent question I hear from clients who have decided to ditch Oracle and move to Aurora as part of an AWS migration. While the cost savings of Aurora can be enticing, the prospect of retraining DBA staff and suffering through the false starts of climbing the learning curve, may seem daunting. The good news I give all my concerned clients is that Aurora really isn’t that hard to set up properly. There are certainly fewer factors to consider than there are when approaching Oracle for the first time as a new DBA.
While in part one we focused on building the business case, in this second blog I want to share some of the key tips that I often give clients to help them set up their first Aurora stack in AWS. The focus will be on Aurora/Postgres, as opposed to Aurora/MySQL, as Aurora/Postgres is the most popular target for Oracle migrations (due to the similarity of the PostgreSQL feature set with Oracle).
Storage Sizing
Storage sizing in Aurora is extremely straightforward as there is nothing to calculate. There are no decisions to make since Aurora auto allocates storage as needed, with no user intervention. There are not multiple storage tiers available for Aurora, as there are with other RDS offerings. Aurora is built on a custom storage management layer, which uses SSD storage exclusively for very high IOPS and throughput capacity. Billing is based on used gigabytes and I/O per second metrics.
While calculating the necessary storage capacity is not a prerequisite of provisioning Aurora, due to auto provisioning, many House of Brick clients ask for guidance on determining how much Aurora storage will be used, in order to anticipate Aurora storage costs. Happily, it is pretty easy to come up with a rough calculation. While exact storage requirements can vary a little between an Oracle source database and an Aurora/Postgres target database, as a general rule the used disk space (datafiles + tempfiles + redo log files) for an Oracle database will be sufficient for the same database once it has been migrated to Aurora.
Instance Sizing
When architecting a new Aurora/Postgres environment for an Oracle migration, one of the first difficulties most clients encounter is appropriately sizing the environment. Sizing for any migration project is quite difficult, and can be especially difficult when the migration target is a different database engine. It can be nearly impossible to know in advance if the new database engine will be more or less RAM efficient, CPU efficient, or disk I/O efficient for the same query load. As a general rule, Aurora tends to need a slightly smaller instance size than the original Oracle database for a migrated workload. This is due to the disk I/O advantages of Aurora, which is built on high-speed SSD storage. Given higher disk IOPS and throughput, migrated databases typically need less RAM for caching data. OLTP workloads with lots of small queries really benefit from this, but so do reporting workloads that scan large tables. In my experience, the only workloads that cannot be reduced in size when moving to Aurora are heavy reporting workloads, such as Business Intelligence (BI) applications that perform extensive sorts and joins. Such sort and join queries are often very RAM/CPU hungry, and are not good candidates for downsizing as part of a move to Aurora.
The good news, when it comes to sizing, is that Aurora is very much a flexible public cloud offering, so Aurora clusters can be resized on the fly with minimal downtime. Thus, any missteps in sizing an Aurora cluster are very easily corrected. For non-production workloads, House of Brick’s best practice is to start with an instance on the small end of the estimated size range and then upsize as needed. For production workloads, the opposite is considered a best practice by House of Brick – start large and plan to size down once the system has been put under production load, and there is ample performance data to review to determine if the instance should be downsized. In order to illustrate how simple resizing can be, I first have to explain a bit about the Aurora architecture and how Aurora clustering works.
Aurora Clustering
All Aurora implementations are clusters, in some ways similar to Oracle RAC, in that the additional cluster nodes all share the same on-disk data. So as clusters grow, the data storage needs (and costs) do not grow. One notable difference from Oracle RAC however, is that unlike a shared-everything RAC implementation, an Aurora cluster has only one master cluster node with read/write access at a given time. Other cluster nodes have read-only access to the database data and can execute queries but they cannot write to the database. Any of the read-only nodes can, however, take over quickly as the master node should the master node become unavailable due to either a planned maintenance or an unplanned event that affects service. Aurora even offers a connection router, conceptually similar to the SCAN feature used in Oracle RAC, which will automatically reconfigure during a master node failover to allow reconnection to a new master node using the same cluster master address as the previous node.
There is no requirement of uniformity for sizing the cluster nodes, and it is common to dedicate a read-only node to some supporting purpose, such as ETL extracts or handling reporting queries. If multiple, differently-sized, read-only nodes are available it is possible to define a failover priority to determine which of the nodes can inherit the master node’s responsibilities should the original master become unavailable. From a practical standpoint, Aurora read-only cluster members can be considered similar to Oracle Active Data Guard standbys. They have the full dataset available for queries and reporting, and can also serve as immediate failover targets should an incident bring down the production master database. The chief practical difference between Aurora clustering and Oracle Active Data Guard however, is that Aurora clustering will perform the failovers quickly and automatically. Active Data Guard configurations typically require human intervention to failover, and then take several minutes to promote a standby to a master, whereas Aurora fails over in ~30 seconds in House of Brick’s experience.
Resizing on the Fly
Because it is easy to failover master responsibilities in an Aurora cluster in about 30 seconds, resizing a master to fit changes in workloads is quite easy. Simply add a new node to the cluster (of the desired new size) as a read-only node, make sure that it is set to have primary precedence in the event of a failover, and then induce a failover. Roughly 30 seconds later, that node will be the new cluster master node. With this flexibility in mind, it is fairly easy to adjust any Aurora instance in a cluster to the proper size fairly quickly and with minimal disruption to traffic.
High Availability
The clustered nature of Aurora makes high availability (HA) an automatically available feature. To get full, multi-availability-zone HA, one simply has to make sure that a cluster has at least one read-only node in a different availability zone than the master – that’s all it takes. The storage layer of Aurora is automatically replicated across multiple availability zones, so there is no risk in data loss from an AWS availability zone going down. The only risk is the that data wouldn’t be available if the availability zone containing the master instance of the cluster was in the impacted availability zone and became unavailable. An Aurora cluster, containing at least one read-only node in another availability zone, would survive the loss of an entire availability zone with barely a hiccup.
House of Brick best practice calls for all production Aurora clusters to have at least two nodes in two separate availability zones. In practice, business critical systems typically require more than one read-only node in order to keep servicing application traffic without incident, in the event of a loss of the master node. Many enterprises heavily utilize read-only cluster members for reporting, ETL extracts, or read-only application traffic.
All high availability architecting ultimately revolves around the tradeoff of incremental costs for additional redundancy and uptime versus cost to the business from an unplanned outage. Careful thought should go into whether the cluster loss of a read-only node would constitute an unacceptable loss of service or not. In environments where the read-only traffic is considered business critical, House of Brick advises adopting an N+1 approach, where an extra node, above and beyond the capacity needed for typical daily load on the read-only cluster members, is kept online and available. This allows the cluster to absorb the loss of a node without a reduction in capacity. In environments where brief interruptions of service to the read-only traffic are more acceptable, at least in the short term, House of Brick has seen companies not adopt N+1 design, but rather internally document a manual response procedure that includes having operational personnel add additional nodes to the cluster to regain lost capacity. Typically, this can address the loss of capacity within 30 minutes of an outage, and offers cost savings over keeping an extra node online beyond needed capacity.
Backup and Recovery
A good DBA should always be worried about the technical and operational rigor of their database’s backup and restore system. As database specialists, House of Brick has been called into more ugly database recovery emergency scenarios than we would like. Almost all of them were preventable emergencies caused by a lack of testing of restore processes, or a lack of monitoring for successful backup completion. The great news for a traditional DBA moving to Aurora is that Aurora handles nearly everything related to backups automatically. Aurora is built on the RDS platform, which has rich backup features. The RDS platform offers a snapshot backup feature, which includes both full database content and transaction logs, with backup retention period configurable up to 31 days (for automatic snapshots of data and transaction logs). The daily backup window is even configurable, so DBAs can control when their daily snapshot backups happen.
In addition to automated snapshots, it is also possible to create manual snapshots. Manual snapshots can be created either on-demand, as a one-off backup of the database, or by creating a converted copy of an existing automatic backup snapshot. Unlike automatic snapshots, which have a retention window that RDS automatically manages, manual snapshots can be retained for any length of time and be copied to other regions or other AWS accounts as desired.
House of Brick cautions DBAs, excited about having a reliable automatic backup/restore mechanism provided, that operational rigor and manual testing are still required. So, it remains important to spend some time during the design phase contemplating off-site backup, backup security, and documenting/testing restore processes.
Database Tuning
Proper database tuning is another area that causes experienced DBAs anxiety when moving to a new platform. This is quite understandable, as many popular commercial RDBMS engines require manual adjustment and tuning in their out-of-the-box state, in order to appropriately take advantage of the resources available to the server or VM hosting the database. Aurora dodges this issue entirely by virtue of being a managed service. This means that no matter what size instance is chosen for an Aurora cluster member, the default parameters for the database engine are already optimized for that instance size. Indeed, one of the many lessons I give to DBAs who are new to Aurora is to not change any parameters in the default parameter group until a clear need to do so arises. The default presumption should be that the parameters are well-tuned for the environment.
Conclusion
Putting all these lessons together should make it considerably more comfortable for a DBA faced with provisioning their first Aurora environment. It really is just as easy as:
- Compute the storage needs (if needed for cost anticipation)
- Compare the instance sizes available to the source DB, and then start with something a little smaller
- Ensure a second read-only node is added to the cluster in another availability zone (If the Aurora cluster will be used for production)
- Assign the default parameter and option groups to the cluster as a starting point
- Make sure to set the maintenance and backup windows, as well as an appropriate automatic backup retention
- Start migrating data
While there are certainly other decisions that need to be made when provisioning an Aurora cluster, such as network and security settings, these are the questions that House of Brick hears the most from our customers. Hopefully, this article helps to reassure any Oracle DBA looking at Aurora that it really is a simple platform to provision correctly.