1. Field of Invention
This invention relates to data storage systems and in particular to an improved arrangement for recovering data stored in a memory unit which has failed.
2. Description of Prior Art
The typical data processing system generally involves one or more memory units which are connected to the Central Processor Unit (CPU) either directly or through a control unit and a channel. The function of these memory units is to store data and programs which the CPU uses in performing a given data processing task.
Various type of memory units are used in current data processing systems. The response time and capacities of memories vary significantly, and in order to maximize system throughput the choice of a particular type memory unit involves generally matching its response time to that of the CPU and its capacity to the data storage needs of the data processing system. To minimize the impact on systems throughput which may be caused by slow access storage devices, many data processing systems employ a number of different types of memory units. Since access time and capacity also affect the cost of storage, a typical system may include a fast access small capacity directly accessible monolithic memory for data that is used frequently and a string of tape units and/or a string of disk files which are connected to the system through respective control units for data which is used less frequently. The storage capacities of these latter units are generally several orders of magnitude greater than the monolithic memories, and hence the storage cost/byte of data is less expensive.
However, a problem exists if one of the large capacity memory units fails such that the information contained in that unit is no longer available to the system. Generally, such a failure will shut down the entire system.
The prior art has suggested several ways of solving the problem. The most straightforward way suggested involves providing a duplicate set of storage devices or memory units and keeping a duplicate file of all data. While such a solution solves the problem it involves duplicating the cost of storage, some impact on system performance since any change to stored data requires writing two records and also some added requirement for keeping track of where the duplicate records are kept in the event the primary records are not available.
In some systems when the records are relatively small, it is possible to use error correcting codes which generate ECC syndrome bits that are appended to the record. With ECC syndrome bits it is possible to correct a small amount of data that may be read erroneously, but these are generally not suitable for correcting or recreating long records which are in error or unavailable.
Another solution suggested by the prior art involves the use of "check sums." In this solution, the contents of one memory unit subject to a failure would be "Exclusive ORed" with the contents of a second memory unit subject to a failure and the resulting "check sum" stored in a third memory unit. Such an arrangement has the advantage over the dual copy solution in that only one additional memory unit is required. However, each time that data is changed in either of the two units a new check sum has to be generated and rewritten on the third unit. Such an arrangement can be extended to more than two units since the "exclusive OR" operation to generate the check sum is merely repeated using the data in each of the added memory units.
The above arrangement has the disadvantage that each time a record is updated in one unit, the "check sums" stored in the check sum unit must be read and "Exclusive ORed" with the old data, and "Exclusive ORed" with the new data and then both records must be rewritten. If the memory unit has the ability to directly address only that part of the record to be changed and/or the failing memory unit has a very small capacity, the disadvantage in terms of impact of the system throughput is relatively small. However, if the amount of data transferred to or from the CPU in response to one Input/Output instruction of the system, or the length of a record stored at one address is large, then the disadvantage becomes significant in terms of impact on system throughput since the time required to generate the check sum becomes excessive.
In those situations when the amount of data transferred involves substantial time to transfer to or from the memory unit either because a large amount of data is involved in each transfer or the memory unit cannot address smaller amounts of data, the prior art solutions discussed above are not practical commercial solutions.