House of Brick Principal Architect
Welcome to the sixth part in our series of posts regarding the virtualization of your business critical SQL Servers. In this post, I discuss Disaster Recovery of your SQL Servers while running on VMware. I wrap up the post with a discussion of techniques used to help demonstrate the power of virtualization and discuss the benefits to those individuals that might continue to fear (or not understand) virtualization.
High Availability and Disaster Recovery
To start the discussion, let’s begin with a reminder of the two definitions of High Availability (HA) and Disaster Recovery (DR).
High Availability (HA) is a design approach where systems are architected to meet a predetermined level of operational uptime, such as a Service Level Agreement (SLA). This means systems should have appropriate levels of redundancy while still keeping the environment as simple as possible as to not introduce more failure points.
Disaster Recovery (DR) is the process of preparing for, and recovering from, a technology infrastructure failure that is critical to a business. The core metrics of DR are the Recovery Point Objective (RPO) and Recovery Time Objective (RTO). The RPO is a measurement of the maximum time window where data could be lost in the event of a disaster. The RTO is the measurement of the maximum time that a core service or system can be offline after a disaster.
High availability is not disaster recovery, and disaster recovery is not high availability. Do not confuse the two, and do not assume that a solution for one is the same solution for the other.
In my previous post I talked about the options for SQL Server HA on VMware. We continue this topic now with a greater deep-dive on DR options for SQL Server on VMware.
Disaster Recovery Discussion Topics
A proper DR configuration must be geographically separated by a reasonable amount of distance. Here in the Midwest, the extreme weather plays a key role in determining DR distances. For example, a tornado can take out a few square miles of land, but what it does strike it will probably level. On the other hand, a blizzard could come through in the winter months and knock out power for large amounts of the area for days. In this scenario, the equipment suffers no real damage, but it is equally as unavailable. Both scenarios knock you offline for an indefinite period of time.
Be smart in your placement of your resources. One of the worst examples of this was from a friend whose company had its primary datacenter in Miami. The secondary datacenter was in New Orleans. Does anyone remember Hurricane Katrina? It was about to knock out their first datacenter, so the company failed over to the New Orleans site. Four days later it destroyed their secondary datacenter, and because the primary was still down, they were left with their datacenter on the third floor of a building with 15 feet of water in the lower floors. The business was down for weeks while off-site backups were gathered, a new datacenter and equipment allocated, and the business restored to the new site. It was an awful process.
The RPO and RTO come into more importance when discussing DR strategies. The recovery point and recovery time objectives must be determined before you start the DR planning process. They drive certain technologies as part of the architecture. These numbers can quickly eliminate several options and, sometimes, can help the business rethink the numbers.
One of the major points that few people discuss is the fail-back process. How do you get your data back to the primary datacenter after you have failed over and now have new transactions? What if your downtime requirements are tight?
So, without further ado, here are the methods for achieving DR with SQL Server on VMware.
VMware Site Recovery Manager
VMware Site Recovery Manager (SRM) and its little sister, vSphere Replication, are both means to handle SAN-to-SAN replication for disaster recovery purposes. The recovery time objectives with both of these products are low and, most importantly, the human steps in the failover process are few (or sometimes nonexistent).
SRM also handles failover and failback processes and can perform automatic testing of the replication process. It’s fantastic for those environments that have to routinely demonstrate the DR process without impacting the servers. Multiple failover and failback plans can be defined and carried out as required.
Single instance SQL Server on VMware works great with SRM, as long as you can demonstrate transactional integrity during the replication. Many SAN vendors have SRM guides for properly configuring the environment to work with SRM. Test transactional integrity with your normal (and maximum) workloads before putting these solutions into production.
However, the smallest RPO per VM or LUN replication is 15 minutes, and sometimes longer, depending on the SAN vendor. Keep this in mind. Sometimes, a combination of this strategy and a more real-time replication, such as database mirroring or AlwaysOn, could be used to complement the SRM strategy and reduce the RPO accordingly.
RTO: Very low compared to other solutions. Measured in minutes. Can dictate order in which VMs are powered on.
RPO: No shorter than 15 minutes, which can be problematic for some environments.
Failover Process: Simple, repeatable, and multiple plans can be defined for various situations.
Failback Process: Simple, repeatable, and multiple plans can be defined for various situations.
Pros: Low human intervention means failover process has a lower chance of errors during failover or failback. Can be tested and audited periodically without impacting operations.
Cons: RPO could be higher than business can allow.
vSphere Replication
vSphere Replication is a new feature of vSphere 5.1 that is included for free with all versions from Essentials Plus and above. Instead of orchestrating san-to-san LUN replication, this technology handles site-to-site replication of virtual machine change blocks of individual virtual machine disks. Failover is manual, but for a free product, this technology handles replication very well.
As with VMware SRM, the smallest RPO per VM replication is 15 minutes. Keep this in mind, and sometimes a combination of this strategy and a more real-time replication, such as database mirroring or AlwaysOn, could be used to complement the SRM strategy and reduce the RPO accordingly.
RTO: Very low compared to other solutions. Measured in minutes.
RPO: No shorter than 15 minutes, which can be problematic for some environments.
Failover Process: Simple, manual, and repeatable.
Failback Process: A manual process but is simple and repeatable.
Pros: Included with the core vSphere 5.1 suite in almost all licensing levels, and is simple to setup DR with commodity hardware and a 15-minute RPO.
Cons: RPO could be higher than business can allow. Failback process is manual.
SQL Server Asynchronous Mirroring
SQL Server database mirroring has been around since 2005 Service Pack 1. Asynchronous database mirroring allows a near-real-time replication of data between the two nodes. If a failover occurs, the secondary node fires up and picks up where the other left off. However, data might be missing if failover occurs while data is still in transit.
When running in this configuration, a witness server should also be used to facilitate the automatic failover. Otherwise, the decision to promote a secondary server to principal server is not automatic.
Also, asynchronous mirroring is only available in the Enterprise editions of the product. Standard edition gets you ‘full safety’ mode, which is real-time synchronous replication only. If your bandwidth between the two sites is either slow or has a high latency, you will feel the performance impact rather quickly. You could even experience significant delay in your production operations because of this impact. Unless the bandwidth AND latency between the two sites exceeds the transactional rates required by your workload during ALL periods of the day, synchronous mirroring is not recommended as a solution for production DR.
You will also notice that if the bandwidth throughput between the two sites is lower than required, data can stack up waiting to be replicated and the recovery point objective might not be obtainable.
RTO: Very quick, as the roles are flipped and the secondary node fires right up.
RPO: Variable, depending on transactional volume and available bandwidth.
Failover Process: Quick and generally without human intervention.
Failback Process: Resynchronize and fail the nodes back to primary. It is very straightforward and simple.
Pros: Quick failover times. Simple and not error-prone failover or failback processes.
Cons: Expensive licensing. Application must support mirroring failover target. Failover is at the database level and not the instance, so applications with multiple databases might require scripting.
SQL Server 2012 AlwaysOn Asynchronous Replication
SQL Server 2012 AlwaysOn is a blend of the virtual IP address and failover of Microsoft Failover Clustering and an improved derivative of SQL Server database mirroring. Because of the options that I previously covered here in this post, AlwaysOn can serve as both a DR and an HA solution, depending on the configuration. Again, I stress asynchronous replication here because of the impact of the speed and latency logistics around WAN synchronous replication.
RTO: Failing a node over to a DR site could be measured in seconds, depending on transactional volume and replication rates.
RPO: Could be as low as a second or less, depending on transactional volume and replication rates.
Failover Process: As simple as it gets. Happens in seconds. The Availability Group moves over and the application reconnects to the same IP address as before.
Failback Process: Even simpler. It’s the same process as failover, and just works!
Pros: Easy to setup and easier to manage. Extremely simple failover and failback processes.
Cons: Potentially expensive licensing. Must go through the upgrade process to SQL Server 2012.
Transactional Replication
Transactional replication allows the replication of data and schema changes from one server to one or more other servers. Data can be replicated at various time intervals in a number of ways. One good note about transactional replication is that you can select which tables get replicated. If some tables need to be replicated and others can be ignored, you can do this!
No automatic means for failover currently exist, so failover is a manual process. Data sources must have their targets changed, or for the more technically savvy, DNS aliases would need to be updated to point to the new server. The failover process would need to be documented and heavily tested. No automated fail-back procedure exists, so this process would also need to be planned for and tested heavily as well.
RTO: Varies depending on the complexity of the environment and the applications that connect to the database(s).
RPO: Varies depending on the replication time period to the distributor plus the time to have the subscribers fetch and apply the transactions.
Failover Process: Purely manual process.
Failback Process: Purely manual process.
Pros: Can selectively replicate databases or data subsets to one or more target servers on whatever time period you require.
Cons: Potentially complex and can be cumbersome for failover and failback processes.
Log Shipping
Log shipping is a process for backing up, copying, and then restoring database transaction logs from a source database to a standby server. It has been around quite a while, and has been used successfully for DR purposes for years.
RTO: Relatively short, generally 15 minutes or less.
RPO: Varies, based on the configuration, and can be very short.
Failover Process: Purely manual process.
Failback Process: Purely manual process.
Pros: Reliable once configured and established.
Cons: Purely manual failover and failback. Downtime is guaranteed while failover occurs. Application must be reconfigured to point to the new server, or a DNS alias adjusted.
Backup and Restore
If a low RPO is not a requirement (and yes, those environments actually DO exist!), a simple replicated backup and an automated restore could suffice. If a somewhat lower RPO is needed, periodic transaction log backups could be replicated as well.
RTO: Varies. Many factors affect this figure: size of the database, speed of the servers and storage at the DR site, and the state of the replicated backup files, just to name a few.
RPO: Varies, and is probably poor. It could be as long as the time taken between backups AND / or the time to replicate the backup file(s) to the DR location.
Failover Process: Restore last known good and successfully replicated backup. If exist, restore any replicated post-backup transaction log backups.
Failback Process: Create a new backup and restore at the primary site.
Pros: Least complex and simple to manage.
Cons: Long RPO and possibly RTO. Manual process.
Why isn’t Clustering in this List?
Clustering is not a solid DR solution. Why not? Think about the following points.
Shared storage is available at a single location, practically speaking. Extending a SAN’s high speed interconnects across a WAN is impractical.
Your one shared-storage SAN is still a single point of failure. I don’t care what any SAN vendor says. It can still fail. It still needs maintenance. It still has a power cord that can be tripped over. Etc. etc.
Now ask that question again. Clustering is for high availability, not disaster recovery.
Now What? Demonstrating that Virtualization Helps
So now what? You now have a solid understanding of each component in the physical and virtual stacks. You know the methods for profiling the existing physical servers and building the appropriately sized virtual equivalents. You know about HA and DR strategies.
But what is the hardest part about virtualizing business-critical servers?
It’s the environment.
It’s people.
People come in many forms – from employees who were burned from a failed virtualization attempt. It can come from skeptics (my favorite challenge). It can come from an organization that is naturally resistant to change. It can be from those with no frame of reference and are fearful of adding another black-box layer to their environments.
So what do you do?
You educate.
You demonstrate.
You amplify.
You support.
Educate the stack and application owners and stakeholders. Identify and educate the organizational advocates that exist in any group of people. Educate a team on what the other teams expect from them. Educate a team on what advantages virtualization brings to them specifically. Educate them on the technologies and how to identify the key performance metrics that matter to them.
Education will make believers of most people.
Next, setup a virtualization proof of concept on similar, if not equivalent, hardware to what you run in production. Identify your strangest, largest, or most complicated servers, and then add the servers you fear. You know, those servers that have been cranky when changes so they sit isolated and no one touches them. Clone the databases and workloads onto the new, properly sized virtual machines. Simulate production-like workloads and measure all of the key performance metrics you identified when you profiled your physical server. I know that if you followed the steps outlined in the previous posts, you will find that the performance of those virtualized workloads at least matches the physical equivalent. The technology has caught up with the business need and is now functionally transparent.
Now that you have objectively demonstrated equivalent performance in the virtualized environment, run up and down every hallway at your organization shouting ‘Virtualization works!’ No, wait. That might be counter-productive in the long run. Instead, demonstrate the performance equivalent to everyone in the organization that needs to believe in the technology. Demonstrate core vSphere functionality, such as vMotion, snapshots, or resource changes, and every response is ‘Wow!’
Make the right people believers by letting them see it for themselves. These people can become your biggest virtualization advocates.
Finally, support everyone through the production migration process. Sure, there might be hiccups along the way, but the hardest part is done. Once the virtualization of your business-critical applications has completed, the process is complete. All new servers will be virtualized up-front, and discussions would happen only if the workload was not assumed to be suitable for virtualization (fringe cases do exist but it is rare). If performance issues arise, virtualization will not be immediately everyone’s scapegoat.
You have succeeded.
Acknowledgements
Special thanks go to my coworkers at House of Brick. We all fight the virtualization fight and enjoy the challenge. I believe we have developed the best business-critical workload virtualization team in the world, and thank you all for constantly pushing yourselves and each other to stay in front of this groundbreaking shift in IT’s core.
We all believe in the technology, and I hope it shows in our eagerness and determination to progress as the technologies continue to evolve. After reading this blog post series, I hope that you believe in the technology as much as we do. We will continue to post new and exciting technologies to this blog, all of which are furthering the quest for performance for your data. Stay tuned!