When the term High Availability is used in the context of a SQL Server deployment, features such as database mirroring and failover clustering are almost always the focus of attention. Whilst their contribution to a highly available SQL Server environment is beyond question, they should be seen as a single, albeit important, component of a much broader solution.
Almost every DBA function can be viewed through the prism of high availability. Security breaches, corrupt backups and poor maintenance practices can all contribute to unexpected outages and missed availability targets. In many ways, high availability is as much a state of mind as it is a feature or installation option.
A critical component of any SQL Server solution, particularly those designed for high availability, is a Service Level Agreement (SLA) which defines a number of system attributes such as the acceptable data loss, disaster recovery time and transaction performance targets. A common SLA entry is the availability target, usually expressed as a percentage, for example, a 99% availability target allows approximately 3.5 days of downtime per year. In contrast, a 99.999% target allows 5 minutes. Each "9" added to the availability target significantly increases the cost of building an appropriate solution.
I'm frequently involved in meetings in which availability targets are discussed with the business. In almost all cases, the conversation goes something like this;
DBA: What's the acceptable data loss and database down time per year
Business: Zero
At this point, one of two things usually happens. The DBA will disappear for a few weeks designing a system, come back with a price in the millions, and the business will freak out. Alternatively, the discussion will veer off into an analysis of how much each "9" costs, for example, how much extra will it be to upgrade from 99 % to 99.9%. Such conversations are usually pointless, and miss the opportunity to focus on what's really important.
In my experience, the major mistakes made during SLA negotiations generally fall into 3 categories;
Unnecessary Expense; A lot of people tend to focus on availability targets with a religious zeal. For certain environments, that's fair enough, but for a lot of others, is there really much difference between 5 minutes downtime per year(99.999%) and 8 hours (99.9%)? Put another away, is the substantial amount of extra money required to build a "5 nines" system better spent on other things?
Scheduled Downtime; A common mistake is to only consider unexpected outages as downtime. For example, do you consider the outage associated with installing a service pack to be down time? From a customers point of view, down time is down time, regardless of the reason. The only thing worse than agreeing to a particular availability target without considering scheduled downtime is to avoid proactive maintenance in an attempt at meeting the target. Availability targets are, of themselves, not the be all and end all. What's far more important is a stable, secure and reliable system. The business is far more likely to agree to a series of planned outages in return for a stable system. In contrast, running at 100 % availability for a year or so before suffering a major catastrophe is not cool,
Scope; As mentioned earlier, highly available environments involve a lot more than features such as mirroring, log shipping and clustering. Here are a number of commonly overlooked items that are just as important;
- Validating storage systems for stability with SQLIOSIM before production implementation. Storage corruption problems are filthy and nasty. The only thing worse than physical corruption is getting into an argument with the SAN administrator/vendor about whose fault it is during downtime,
- Simulating small disasters such as dropping a table, and getting all DBA team members to practice recovering from backups up to the point of failure. Most disasters occur on a small scale, and it's quite astonishing the amount of sites that don't validate their backup strategy by actually running through simulated small scale disasters,
- Benchmark testing and baseline analysis to identify upcoming performance problems. Even though the database may be "available", if a user's query times out, they'll consider the database as unavailable
There are countless other examples of best practices which contribute to a highly available environment, most of which I've tried to address in my upcoming book.
In closing this post, I'd like to share a technique I use when negotiating SLA targets with a business. As mentioned earlier, it's very easy to get bogged down in a how many nines and how much will it cost conversation. To circumvent this and sharpen the focus on what's really important, I like to prepare options papers for the major high availability categories. For example, in regards to a backup solution, I'd present something like this;
OPTION A OPTION B OPTION C
----------------------------------------------------------------------------
NAME SQL 2005 (native) 2008 SAN Snap/Clone
COST - 10 20
BACKUP TIME 3hrs 1.5 hrs Instant*
RESTORE TIME 3hrs 1.5 hrs Instant*
COMPRESSION no yes no
ENCRYPTION no yes no
NOTES - upgrade training
required required
Presented in this manner, a business can clearly see the pro/cons & cost of a number of options side by side, therefore making the decision much easier for all involved. It's certainly a lot easier than an esoteric argument around how many nines the business wants.
In the next post, I'll float an idea I've had for a while which uses end of year bonus payments (in lieu of on-call allowances) for meeting availability targets.
Cheers
Source: http://www.rodcolledge.com