1. Field of the Invention
The present invention relates generally to predicting a combined performance and availability characteristic of complex systems—the combination referred to hereinafter as “performability”. More specifically, the present invention relates to making efficient, computerized predictions of the performability of systems composed of components that can fail independently of one another. More particularly, the present invention includes a method and apparatus for determining whether a predetermined multi-component system can meet target performability requirements. An exemplary embodiment is described for predicting multi-part performability of a predetermined, complex, multi-component, data storage system.
2. Description of the Related Art
Systems composed of many parts that can fail independently are commonplace. Consider the simple example of a school bus system that uses a number of different buses of different capacities; each bus can break down, independently of the others. Almost all such systems have desired performance goals. In the example, it might be that all the students be brought to school at the start of the school day in time for the first class.
It is natural to want such systems to be able to provide full service all the time, but it is usually too expensive to ensure this. In such circumstances, people are usually willing to live with a lower level of performance for a certain period, in order to reduce the overall cost of the system. In the example, a school district might be able to afford a spare bus or two—but it would be unlikely to be able to keep a complete spare fleet of buses. Even if there were plenty of spare buses, if a bus failed on its rounds, some of the students might have to wait for a replacement bus to arrive, and so be late. It might even be acceptable for no bus to be available for a day or two if alternative mechanisms existed to get the students to school.
In general, in a system that can exhibit partial failures, the performance of the system with one or more partial or complete component failures is often lower than when the system is completely failure-free. The term “performability of a system” describes the resulting set of performance levels achieved (perhaps under some load) and the fractions of time that the system achieves them (as a result of partial or complete component failures). The performability of a system can be represented as a set of pairs “(r,f)”, where each “r” represents a performance level, and each “f” the fraction of time that performance “r” is achieved or bettered. It is common to include pairs for both “full performance” and “no performance” in the specification.
Performability can be predicted, or actually achieved (or measured) in practice. “Performability requirements” are target goals for the performability of a system, and “performability specifications” are written versions of performability requirements. Similarly, a “performability function” is a representation of a performability requirement (or measurement or prediction). It is sometimes shown as a “performability curve”, which is a graphical rendering of the function. Performability requirements (or measurements or predictions) consist, by the previous definitions, of multiple parts; each different combination of the performance and the availability represents a different part. In what follows, we sometimes use the term “multi-part” without loss of generality, to bring this fact to the attention of the reader.
A system “meets (or satisfies, or performs, or fulfills) its performability requirements” (or “is predicted to . . .”) if the achieved (or predicted) performability of the system is at least as good as the performability requirement for it. An object of this invention is to provide a faster way to calculate whether this is the case.
In the example above, the ideal performance requirement is “all students get to school on time, every school day”. But, given data on the rate at which school buses fail, and the costs of having standby buses, the school district is likely to accept a lower level of performance as long as the lower performance (fewer students on time) does not happen too often and does not last too long. Thus, the requirements for the school bus service for this example might read as follows (in part):                “Ideally, all 3000 students will get to school on time, every school day.        It is acceptable for up to 40 students to be delayed by 15 minutes, as long as this does not happen more than 14 days a year.        It is acceptable for up to 20 students to be delayed by as much as an hour, as long as this does not happen more than on 7 days a year, provided that each occurrence affects no more than 3 consecutive school days.        And, at most once a year, it is acceptable that schools close for one or two days if too many buses all fail at the same time.”        
These requirements combine performance (how many students get to school when) with availability (the likelihood of a given number of students arriving on time). A performability specification may contain or imply a “workload” as part of its performance specifications (the number of students in this example), but not need do so.
The performability specification is a concise way to cope with the fact that the average performance over a system's lifetime is often not a useful metric for many target-system designers. Even with all the data at hand, it is often difficult to work out whether a particular system design is going to be able to meet a performability specification. For example:                (a) the number of failure scenarios can be very large, and may have to encompass multiple concurrent failures, not just one failed component at a time; furthermore, each component may have different performance and failure characteristics: so it may not be enough simply to predict the effects of a single “representative” example failing;        (b) it may be expensive or difficult to predict the performance or likelihood of each failure scenario; and        (c) the performability specifications may themselves be complicated.        
Each of these issues will be discussed briefly below.
(a) The number of failure scenarios can be very large. In the example system, assuming 100 buses in the fleet, there are 100 different cases of a single bus failing that need to be analyzed. Each combination of possible failures—including the special case of no failures—is a separate “failure scenario”. Analyzing 101 different failure scenarios is not too hard using a computer. But if the school buses fail often enough that there is a reasonable chance that two of them will be out of service, the number of combinations increases dramatically—to approximately 100 times 100 cases, or about 10,000. Coping with a third failed bus may involve almost 1 million failure scenarios to analyze. And each bus may have its own, distinct failure mechanisms (for example, one might be have a petrol engine, and another a diesel engine).
There are, of course, systems that are much larger still. For example, some computer storage systems may contain thousands of components. The number of failure scenarios grows exponentially as a function of the number of components in the system, and of the number of concurrent failures it can tolerate while still remaining “functional.”For example, if a computer storage system with 10,000 data storage disks could tolerate a maximum of two disk failures, then it would be necessary to evaluate its performance in each of 100,000,000 different failure scenarios.
(b) It may be expensive or difficult to predict the performance or likelihood of each failure scenario. A “target system” being designed may be very complicated, the workloads may be very complicated, the components may themselves be complex systems, the models used to predict performance may be very complicated, slow, or difficult to use, or some combination of these, may all be relevant to performability. Analysis is even more difficult if different components have different failure rates or performance characteristics. Returning to the simple example system, a fleet may consist of different kinds of school buses, each with its own failure rate, speed, and capacity.
(c) The performability specifications may themselves be complicated. In the school bus example above, the number of students was constant; but it could also be the case that the number varies; perhaps the school population fluctuates at different times of year, or perhaps the number of students trying to get to school drops in very cold weather, at precisely the time when buses are more likely to break. Such complications make the problem of determining whether a given design (in this case, the number and type of buses) will meet its performability specification even more difficult.
In computer storage systems, the workload may be very complicated indeed: it is necessary to describe dozens of pieces of information about each portion of the workload to be able to predict accurately its performance.
The number of possible combinations of different types of workload is effectively extremely large.
Therefore, there is a need for a computerized system that can determine, economically and efficiently, whether a given complex system design meets a given performability specification.
3. Problem Statement: (Applied to an Exemplary Data Storage System).
This invention is especially suited for assisting with the design of multi-component computer-based systems (for brevity, generally referred to hereinafter merely as “systems” in the context of this document). Such systems typically comprise a collection of one or more data processing, communication, and storage components. The design, configuration, and maintenance of such computer systems is difficult. There is a need to predict the behavior of the systems in a novel manner such that many system design problems will be alleviated, including problems such as: allowing systems that were installed to meet their requirements more often; reducing the cost of systems, because there would be less need for expensive over-design; and reducing the number of emergency repair and preemptive maintenance calls by having a good understanding about which particular failures are relatively benign.
Generally, and as will be described in more depth hereinafter with respect to an exemplary embodiment, complex systems, such as those comprising hundreds and thousands of computers and attendant mass data storage apparatus, create complex problems regarding selection, configuration, and maintenance for system designers and information technology system administrators. The purchase and maintenance of such systems are expensive costs of doing business. Pre-purchase configuring of such systems and modeling performance without building prototypes is a complex task due to the nearly infinite number of possible workloads and system configurations.
As shown in FIG. 1, a representative system used hereinafter as an exemplary embodiment for discussion of the present invention, the target system 100 is a relatively small computer system having a disk array 132. We describe the invention in the context of a performability analysis of the data storage subsystem. No limitation on the scope of the invention is intended by the inventors nor should any be implied from the use of this example.
As an example of how the system 100 of FIG. 1 may be used, consider a health maintenance organization (HMO) where host A 111 processes patient records data while host B 111′ processes provider and employer/employee data. Storage requirements for profiling thousands of patients records and provider records and employer/employee information may require a storage capacity of hundreds of gigabytes (GB), or even more than a terabyte (TB), for the associated data. Based on the currently available hardware technology, such a system 100 for the HMO might require seventy disk drives and ten controllers.
The data storage subsystem of the system 100 includes the disk array 132. The disk array 132 typically contains several disk drives 101-108 to store data, one or more controllers 121, 121′ both to communicate with the clients (host computers 111, 111′) and to control the operation of the disk array, one or more data caches 131, 131′ to hold stored data temporarily in a manner that is very fast to access, thereby improving performance, and appropriate buses 109 or other interconnections to join these components together. Other, associated components such as device drivers, disk array firmware, and modules that implement basic functions in the data path (e.g., parity calculation engines, direct memory access engines, busses, bus bridges, communication adapters for busses and external networks, and the like, not shown) are also part of a typical data storage subsystem.
There are many different designs of disk arrays, but most of them share one common trait: they are intended to increase the likelihood that data stored in them will survive the failure of one or more of the disk array's components. Because of this, such disk arrays are often used for large data storage and management applications, often storing very critical data.
The basic approach to achieving acceptable partial failure modes of operation is to provide redundancy—the provision of more components than strictly needed to accomplish the basic function. A redundant array of inexpensive disks (“RAID”) is often used for large data storage and management applications (slower, high capacity, tape drive arrays and optical disk drive apparatus can be employed similarly). RAID was first popularized by researchers at the University of California, Berkeley: see e.g., D. Patterson, G. Gibson and R. Katz, A Case for Redundant Arrays of Inexpensive Disks (RAID), Proceedings of the 1988 SIGMOD International Conference on the Management of Data, Chicago, Ill., May 1988. These RAID ideas were initially only applied to the disk drives in a disk array, whereas, it is important to realize that true failure scenarios affect the other data storage subsystem components (namely, all of FIG. 1 excluding the Host A 111 and Host B 111′ computers). In one mode of RAID operation, multiple copies of the stored data are kept, each copy on a different disk drive. This is often referred to as “RAID 1 mode”, or “mirroring”. Although this increases the number of disk drives needed, it also increases the availability of the data; if one disk drive breaks, another is there to provide a copy of the data. Other RAID modes allow for partial redundancy: these provide failure tolerance at lower cost, but with a more significant performance degradation after a failure and during certain normal-mode input/output (“IO”) operations.
Data inside the array is spread out onto what is referred to in the art as Logical Units (“LUs”), sometimes referred to as “LUNs” for Logical Unit Numbers, which are the names by which LUs are known. Each LU represents a subset of the total storage space available in the array, and is an aggregation of disk drives into a single logical construct, visible to the host computers that use the array. Each LU is used and managed independently of the other LUs. Typically, LUs can be constructed from any aggregation of the disk drives accessible to a controller inside an array. In the FIG. 1, four disk drives 101, 102, 103, 104 are grouped into one such LU, and the remaining disk drives 105, 106, 107, 108 into another. (It is also common in the art to say that the LU is “placed on” the disk drives.) Data flow is indicated generally by phantom lines, demonstrating that in the example system shown in the FIG. 1, host computer A 111 uses the LU on disk drives 101, 102, 103, 104, and host computer B 111′ uses the LU on disk drives 105, 106, 107, 108.
The introduction of redundancy in disk arrays—that is primarily present to handle component failures—raises several issues about performance, including:
(a) although it may be possible to continue operation after a failure, the performance of the system usually degrades in this state;
(b) even in failure-free mode, data replication can improve performance (e.g., if two disks contain identical copies of the same data, it is possible to read it from the disk that is less busy to improve performance), or hurt it (e.g., with two copies, an update has to be sent to both before the operation is completed, and that takes longer than simply updating one copy); and
(c) when failures are being repaired, recovery may impact performance: for example, after a disk drive has failed, the system will begin rebuilding its contents on a spare disk drive, so accesses originated by the reconstruction task compete with normal accesses.
The workloads processed by computer storage systems are often themselves very complicated, requiring a great many parameters to characterize them accurately. Small variations in such workloads can sometimes have large effects on the performance and cost of the associated storage systems. In the system 100 of FIG. 1, a workload can be defined by the stream of IO requests (READ and WRITE commands) issued by hosts A 111 and B 111′ to the disk array. The workload can be characterized by parameters such as the typical IO request size, and the rate at which such requests are generated. But this is not enough to predict the performance of the storage subsystem of FIG. 1 accurately. To do so involves including many other workload characteristics, such as the amount of the workload that can be cached in cache memories 131 and 131′, the degree of sequential accesses to the on-disk data, and in some circumstances, correlations in the access patterns between the two hosts 111, 111′. Each of these can have a significant impact on the performance and cost of the storage system needed to meet the needs of the workload.
Developing designs for such systems is hard enough when failures are not taken into account. There are many published papers on estimating the performance characteristics of disk arrays. One example is the paper by Lee and Katz on An Analytic Performance Model of Disk Arrays, published in the proceedings of the ACM SIGMETRICS conference, May 1993 (pages 98-109).
Since a primary reason for deploying disk arrays is to support continued operation in the presence of failures, it is often very important to be able to predict the performability of the system under the range of failures that the system is likely to encounter in practice. Doing so with manual techniques is slow, error-prone, and unlikely to be satisfactory. It also suffers from all of the problems outlined above.
4. Description of the Prior Art
Given the complexity of designing and managing large computer systems, system designers rely on very simple rules of thumb to make design, configuration and purchase decisions. This can lead to systems that do not satisfy their performance expectations, or to excessive system cost due to over-design, or both.
A common approach is to build an actual, test system for a proposed workload in order to evaluate system performance empirically. Although this is often time consuming, hard to do, and expensive, it is nonetheless often the method employed by a system supplier to demonstrate the viability of a proposed system to a potential customer. Because of the difficulty of working with such systems, it is rare to explore even simple failure modes because doing so is expensive, time consuming, and may not even be possible.
Another technique in the prior art is to predict the performance of storage devices with a performance model, such as that described in the paper by Lee and Katz on An Analytic Performance Model of Disk Arrays. This approach is not a complete multi-part performability analysis as defined herein because it fails to take into account the multi-part performability requirements that are the essence of this invention. It is hard to predict the performance of a single configuration using these tools, let alone explore hundreds or thousands of failure scenarios.
Moreover, average performance over the system's lifetime is not a useful metric for many target-system designers.
There is a need for a more realistic metric for the suitability of a given system for a given task. Such a metric comes from putting together the concepts of availability and performance, during both failure-free and degraded modes of operation under a given workload.
Prior solutions generally treat the failure analysis of complex systems as Markov chain reward models or Petri net models. These methods work best when predicting whether a system meets a single-part performability specification, for example, the average bandwidth available in a disk array, taking into account the possibility of disk component failures. See e.g., S. M. Rezaul Islam, Performability Analysis of Disk Arrays, Proceedings of 36th Midwest Symposium on Circuits and Systems, Detroit, Mich., Aug. 16-18 1993, pp. 158-160, IEEE Publications; Ing-Ray Chen, Effect of Probabilistic Error Checking Procedures on Performability of Robust Objects, Proceedings of 8th ACM/SIGAPP Symposium on Applied Computing, Indianapolis, Ind., Feb. 14-16 1993, pp. 677-681, ACM Publications. Both of these papers use a different definition of performability than we do. They represent the system as a Markov chain and associate a single-valued “reward” with each state in the Markov chain. They compute the expected value of this reward based on all the state probabilities. Our definition of performability is multi-part; it does not require that the underlying system be modeled as a Markov chain, and we do not require all state probabilities to be calculated.
None of the above approaches provides an efficient, effective, high-quality prediction of whether a target system will meet its performability goals.
Thus, given a description of a complex system, and the performability requirements of the end-user or target-system designer, there is a need for a method and apparatus to provide a multi-part performability assessment as to whether the candidate system meets the requirements.