“Availability”, as used in the world of computing, encompasses the concepts of system failures and recovery schemes and the impact of each on downtime and uptime. Availability is commonly quantified by the “number of nines”, meaning the percentage of time that a given system is active and working. For example, “2 nines” means 99% availability, and “3 nines” means 99.9% availability. The following table demonstrates the maximum system downtime required to achieve the coveted increase in nines.
AcceptablePerPerPerUptime (%)daymonthyear9572.00minutes36hours18.26days9914.40minutes7hours3.65days99.986.40seconds43minutes8.77hours99.998.64seconds4minutes52.60minutes99.9990.86seconds26seconds5.26minutes
As can be seen, to increase availability from “two nines” to “five nines” requires a decrease in system downtime from 14.40 minutes per day to only 0.86 seconds per day. Many customers require a certain level of system availability, from their service providers and typically specify this level of availability in a Service Level Agreement (SLA). The SLA may also specify what percentage of the time services will be available, the number of users that can be served simultaneously, performance benchmarks to which actual performance are periodically compared and the like. Often, financial penalties are levied for failure to meet these contractual requirements, thus providing a considerable incentive to service providers to increase system availability. Correspondingly there is a need for service providers to be able to predict availability levels with a considerable degree of accuracy and robustness.
One way to improve availability is by the use of clustering. A cluster is a group of independent computers that work together to run a common set of applications or services but appear to the client and application to be a single system. Clustered computers are physically connected by cables and are programmatically connected by specialized software, enabling features (such as load balancing and fail-over) that increase availability.
Load balancing distributes server loads across all the servers in the system, preventing one server from being overworked and enabling capacity to increase with demand. Network load balancing complements clustering by supporting availability and scalability for front-end applications and services such as Internet or intranet sites, Web-based applications, media streams and terminal-emulating server-based computing platforms.
Fail-over automatically transfers resources from a failing or offline cluster server to a functioning one, thereby providing users with constant access to resources. For example, a MICROSOFT SQL SERVER or MICROSOFT EXCHANGE SERVER, among others, could be implemented as a clustered server.
Current analysis methods used for calculating system availability typically consume massive amounts of time and hardware resources and thus can be enormously expensive. One or more servers are typically set up in the deployment and tests that are supposed to simulate expected usage are run. Availability statistics are collected and metrics such as Mean Time To Fail (MTTF) are computed. Not only are these tests expensive to run, the test results themselves are suspect because the code designers fix the errors encountered in the tests. Thus the simulation does not reflect the real world, and estimations of availability based on the simulation lack credibility. Additionally, in the case of calculating availability of Microsoft clustered systems, no known method has been developed whereby the connections between the server elements can be clearly expressed. Hence, there is a need in the art to calculate availability of such clustered systems in a less costly, more accurate and more credible manner. It would also be helpful to be able to realistically estimate availability to the order of precision required by the “number of nines” promised.