Conventional data storage systems and conventional service providing systems use distributed computing platforms to store data or host user services. Such systems are file storage systems, remote monitoring systems, remote control systems, etc. In order to maintain continuous operations, providers of these systems use conventional failure tolerance mechanisms.
Some system failures are planned and others are unplanned. Unplanned failures include hardware failure, power outages, software bugs, user error, or other individual resource or network problems. To correct for unplanned system failures, some providers designed their systems to have fault tolerance mechanisms. Conventional fault tolerance mechanisms require human intervention with some automated redundant device techniques.
For unplanned system faults, conventional automatic redundant techniques include information redundancy, which uses error correction codes. Other conventional automatic redundant techniques include time redundancy by performing a faulty operation several times, such as retransmission. Other techniques include physical device redundancy, whereby one device is active and another is standby to take over when the active device fails. Other techniques include replication where several units operate concurrently and use a voting system to select the outcome. Typically, the more fault tolerance built into the system the more costly the system is.
An example of a planned system fault is when a resource in a system requires improvement via maintenance or an upgrade. Conventional systems can be improved when a human physically disconnects the old resource and configures the software of the improved resource and physically integrates the improved resource. This conventional technique causes system down time.
Providers of conventional systems are under pressure to reduce downtime and maintain seamless operation during planned and unplanned system failures. To evaluate different failure tolerance mechanisms, providers have developed a measurement, termed system “availability,” to describe a desired or achieved level of fault tolerance, which results in continued system access for their users. Availability refers to the amount of time a system is functioning without interruption. The fraction of time the system is available may be expressed as a percentage. A system available an entire year with a maximum of 5.26 minutes of downtime is said to be highly fault tolerant or to have 99.999% availability. Telephone systems have the goal of achieving this level of tolerance. Other systems, such as an energy monitoring system may experience 44-87 hours of downtime for a year, which amounts to a 99-99.5% availability.
On top of operating demands, complexity of storage and service host systems has grown with the advent of the need for increased data storage, delivery, analysis, transformation, and presentation. In addition, commercial and customer data and software service providers have moved storage and hosting functionality from local to remote locations. Some rely on global networks to host their storage or service capabilities. Providers find themselves using larger and more complicated systems, which have more difficult problems in maintaining system availability.
There is a need for improved system availability in data storage and service providing systems.