House of Brick Principal Architect
High Availability is one of those topics that many people either confuse or intermingle with Disaster Recovery, or just plain misunderstand. This post is an attempt to help differentiate between High Availability (HA), Fault Tolerance (FT), and Disaster Recovery (DR). I will also discuss some of the High Availability options present in Microsoft SQL Server that your organization should take into consideration when planning your business continuity strategy.
First, a few definitions are needed:
High availability is a system design approach and associated service implementation that ensures a pre-arranged level of operational performance will be met during a contractual measurement period. (Wikipedia – http://en.wikipedia.org/wiki/High_availability)
Fault-tolerance is a design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of the system fails. (Wikipedia – http://en.wikipedia.org/wiki/Fault_tolerance)
Disaster recovery involves the processes, policies, and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after a natural or human-induced disaster. (Wikipedia – http://en.wikipedia.org/wiki/Disaster_recovery)
When looking at High Availability options for Microsoft SQL Server (with an emphasis on VMware), there are several options available (in no particular order):
SQL Server 2008R2 Cluster on Windows Server 2008R2 Cluster
SQL Server 2008R2 Database Mirroring
VMware High Availability (HA)
VMware Fault Tolerance (FT)
SQL Server 2012 AlwaysOn Cluster
These options are complex and are not interchangeable.
Microsoft Failover Clustering (MFC) runs at the operating system level and requires two or more host servers with shared storage. An MFC can host cluster-aware applications, such as SQL Server, in cluster service groups that have their own name and IP address. In the event of a host failure – or a service failure – MFC can fail the cluster service group (including the name and IP) over to another host node. Throughout this document, I will refer to this as “traditional clustering” to differentiate it from SQL Server 2012 AlwaysOn.
Database Mirroring requires two or more host servers but does not require shared storage. It runs a primary database copy that is used by the application and a “mirror” copy that is constantly updated (synchronously or asynchronously) via transaction copies. With the use of a witness server, mirroring can be set up to automatically failover in the event of a host or service failure on the primary. The catch is that the secondary server has its own name and IP address so the application has to be able to handle the change. Some applications cannot be modified to point to a second SQL Server and accommodate the failover. The additional storage for the second copy of the data is sometimes cost-prohibitive.
VMware HA monitors for the failure of the ESXi host and, in the event of a failure, evacuates all guest VM’s on that host to other ESXi hosts. This is not an online operation where the VM stays running if a host fails, and will cause brief downtime. Normally, this downtime is less than three minutes, but does result in a VM being restarted on a new host.
VMware FT is similar to VMware HA in that it protects against the failure of the ESXi host, but it does it in a different way. VMware FT maintains a “shadow copy” of the VM’s CPU and memory footprints on another ESXi host, which is kept in “virtual lockstep” with the primary VM. In the event of a host failure, the “shadow copy” instantaneously takes over without a visible loss of service to the outside world. VMware FT is limited to a single vCPU at this time, so it not broadly applicable to production environments. VMworld 2011 brought a demo of vSMP FT, so we know VMware is actively working to add more vCPUs into the feature.
SQL Server 2012 AlwaysOn is a new technology that has taken the benefits of operating system clustering and database mirroring, engineered away the features that cause administrators so many headaches, and combines them into a new package. AlwaysOn clusters do not require shared storage (and is similar to mirroring). They also have a shared name and IP (like O/S clustering) so that the applications do not need to be aware of multiple database server names.
Before I start the discussion on any one of the above listed technologies, I first ask the following question:
Do you have a less than four minute Service Level Agreement on your application?
This four-minute mark is, in the event of a sudden VMware host failure, a conservative estimate of the time taken to restart a virtual machine on a surviving host and have the SQL Server service up and responding to requests. If you have an SLA greater than four minutes (or no SLA at all), then VMware HA alone could be speedy enough to meet your availability needs. If the answer to the four-minute SLA question is yes, then consider the following questions:
Do you need to utilize rolling maintenance to minimize downtime?
Is your application middle tier cluster-aware?
Does your company have the technical expertise to manage Microsoft clustering?
The answer to all of these questions needs to be in the affirmative to successfully implement clustering. Without rolling maintenance, cluster-aware applications, and technical expertise, traditional clustering can introduce more downtime than it will prevent.
Many organizations think they are forced into using an operating system-level solution (clustering or mirroring) over the VMware solutions (HA and FT) because the operating system solutions protect you from a failure of the application service itself inside the operating system. However, programs like Symantec ApplicationHA and Neverfail vAppHA can handle this task through the vSphere 5 Application Awareness API (download the GuestAppMonitor SDK if you wish to investigate the API behind this feature).
SQL Server 2012 comes with a new spin on HA clustering called AlwaysOn (which requires Enterprise Edition). AlwaysOn has several key differences from traditional clustering and database mirroring:
AlwaysOn uses separate storage for each SQL Server instance whereas traditional clustering relies on shared storage.
AlwaysOn uses “Availability Groups” to group one or more databases as a logical unit that moves as one unit, whereas traditional mirroring is done database-by-database either synchronously or asynchronously.
AlwaysOn Availability Groups can have an individual IP and DNS name that moves with the group, whereas traditional mirroring has a different IP and name for each server in the cluster.
AlwaysOn can also open up the secondary databases in a read-only manner for reporting and backups.
I’ve blogged about my high opinion of AlwaysOn before on this blog, and I’ll continue to endorse this technology.
If you need to cluster, whether traditional clustering or SQL Server 2012 Always On, then the next question often is: can you cluster on VMware, getting the best of both worlds? Yes, it is fully supported, but there are additional complexities and architectural limitations incurred by this decision.
You lose many of the traditional benefits of VMware (such as vMotion and Storage vMotion), and you gain new, more stringent requirements on storage, such as the requirement for either raw disk mapping (RDM) or in-Guest iSCSI rather than VMFS. (VMFS is only supported for “cluster in a box” where a single ESXi host hosts all cluster node VMs, which is not a High Availability solution due to the single point of failure of the single ESXi host.) What you gain is VMware HA (once the appropriate host affinity rules are configured to keep the guest cluster nodes on separate ESXi hosts by default).
HA recommendations are always specific to a particular environment, but due to the many advantages of AlwaysOn, as described above House of Brick’s general recommendation, if clustering is needed is to create a SQL Server 2012 AlwaysOn solution on VMware if possible. If clustering is not necessary based on the criteria contained in this document, VMware HA is our general recommendation.
But… what about DR?
HA is not DR. HA is not your Business Continuity Plan (BCP). HA is not your backup.
This is extremely important, so I’ll repeat this with more emphasis.
HA is not DR. HA is not your Business Continuity Plan (BCP). HA is not your backup.
HA is your first strategy to provide a buffer for a set of minor scenarios. It can help keep unexpected downtime to either minutes or seconds rather than hours or days. Disaster recovery ensures that your critical data and their respective systems are not lost in the event of a serious catastrophe.
So, how do the aforementioned technologies play in DR? MFC is an HA technology, not a DR technology for more than one datacenter. So are VMware HA and FT. So what do you do? Well, the SQL Server toolkit provides the following options for DR:
SQL Server Database Mirroring
SQL Server Transactional Replication
SQL Server Log Shipping
SQL Server 2012 AlwaysOn Cluster
The differences between these technologies are dramatic.
Database mirroring to a DR location is an asynchronous solution that is configured on a per-database basis. Transactions are queued up and replicated to the DR location. This solution works very well for DR if the following two conditions are met: First, the mirrored database needs to be identical to the principal database. Secondly, the WAN connection between the two sites must be able to transmit the volume of data in a reasonable time window.
Transactional replication and log shipping are similar, in that transactional data is copied and applied from the primary datacenter to the server in the DR site. No automatic failover detection exists. Fail-back from the DR site is most likely complicated. However, it can replicate only the specific data you are interested in, if that is one of your goals for DR. This can cut down on the bandwidth required for DR. It’s an interesting topic to discuss if you think either of these solutions are an option for your DR initiative.
SQL Server 2012 AlwaysOn can handle your DR replication and failover in a simple GUI. The full discussion on using SQL Server 2012 AlwaysOn for HA and DR together was previously presented in this blog post of mine: How SQL Server 2012 AlwaysOn Blurs the Lines Between High Availabilty and Disaster Recovery.
So, what do you use?
That’s simple. You use the best tool for the job. Every environment is different. Every organization treats these topics in a unique manner. Use the feature that makes the most sense for your environment, test – test – test (and test again for good measure), and you are on your way!