1. Field of the Invention
The present invention relates to computing systems. More specifically, the present invention relates to systems for increasing the fault tolerance of computing systems.
While the present invention is described herein with reference to illustrative embodiments for particular applications, it should be understood that the invention is not limited thereto. Those having ordinary skill in the art and access to the teachings provided herein will recognize additional modifications, applications, and embodiments within the scope thereof and additional fields in which the present invention would be of significant utility.
2. Description of the Related Art
In large distributed computing systems, a plurality of host computers are typically connected to a number of direct access (permanent) storage devices (DASDs), such as a tape or disk drive unit, by a storage controller. Among other functions, the storage controller handles connection and disconnection between a particular computer and a DASD for transfer of a data record. In addition, the storage controller stores data in electronic memory for faster input and output operations.
The IBM Model 3990 storage controller, is an example of a storage controller which control connections between magnetic disk units and host computers. The host computers are typically main frame systems such as the IBM 3090, the Model ES9000, or other comparable systems.
The IBM 3990 Model 3 type controller can handle up to sixteen channels from host computers and up to sixty-four magnetic storage units. The host computers are connected to the storage controller by at least one and by up to four channels. The storage controller typically has two storage clusters, each of which provides for selective connection between a host computer and a direct access storage device and each cluster being on a separate power boundary. The first cluster might include a multipath storage director with first and second storage paths, a shared control array (SCA) and a cache memory. The second cluster typically includes a second multipath storage director with first and second storage paths, a shared control array and a non-volatile store (NVS).
Thus, each storage path in the storage controller has access to three addressable memory devices used for supporting storage controller operation: the cache; the non-volatile store; and the shared control array. The three memory devices and asynchronous work elements (AWEs) comprise the shared structures of the 3990 control unit.
Cache is best known for its application as an adjunct to computer memory where it is used as a high speed storage for frequently accessed instructions and data. The length of time since last use of a record is used as an indicator of frequency of use. Cache is distinguished from system memory in that its contents are aged from the point of time of last use. In a computer memory address space, program data has to be released before data competing for space in the address space gains access. In cache, competition for space results in data falling out of the cache when they become the least recently used data. While infrequently accessed data periodically enter cache, they will tend to "age" and fall out of cache. Modified data in cache is duplicated in nonvolatile memory. Storage controller cache performs an analogous function for direct access storage devices and storage controllers. Reading data from (and writing data to) the magnetic media of the direct access storage devices is fairly time consuming. Among the factors slowing the read and write operations are time required for the magnetic disk to bring a record location into alignment with a transducer and the limited bandwidth of the magnetic transducer used to read and write the data. By duplicating frequently accessed data in cache, read time for data is reduced and data storage system throughput is considerably enhanced.
Nonvolatile storage (NVS) serves as a backup to the cache for the buffering function. Access to NVS is faster than access to a direct access storage device, but generally slower than cache. Data are branched to cache and to NVS to back up the cache in case of power failure. Data written to NVS have been treated as being as safe as if written to magnetic media. Upon staging of a data record to NVS indication is given to the host computer that the data are successfully stored. The NVS is required for Fast Write operations and to establish Dual Copy pairs. If cache is made unavailable, all Fast Write data will be destaged during the make unavailable process and no new Fast Write data will be written to the NVS until cache is made available. When cache is unavailable, the NVS is still required to maintain the bit maps defining the cylinders that are out-of-sync between the primary and secondary devices for Dual Copy.
A shared control array (SCA) is a memory array which is shared over all storage paths. There are typically two types of data in the SCA. The first is data to support the DASD and the second is the data to support the caching and extended functions (i.e. Fast Write and Dual Copy).
Another resource available to the mainframe computer may be an asynchronous work element (AWE). An AWE is a task performed by any processor by which data is taken from the cache and written or "destaged" to DASD. These structures control the internal work elements which control the asynchronous function required by the caching control unit (i.e. Pack Change, destaged modified data, cache space management, etc.)
The conventional storage control unit is typically designed so that no single point of failure in the unit will cause a failure of the entire system. The failure of certain components, however, can cause a degradation in performance of the control unit. A failure in cache, for example, typically results in such a performance degradation. Unfortunately, host systems have become tuned and therefore so reliant on the speed afforded by a fully functional cache, that the performance degradation associated with a failure in cache has the same effect as a single point failure.
Thus, there is a need in the art for a system and technique for mitigating performance degradation in a storage control unit associated with a failure in cache memory associated therewith.