1. Field of the Invention
The present invention is directed toward a method and apparatus for improving the performance of a disk array system in a computer system, and more particularly to a posted write memory used in conjunction with a disk array system to increase the efficiency of the disk array system.
2. Description of the Related Art
Personal computer systems have developed over the years and new uses are being discovered daily. The uses are varied and, as a result, have different requirements for various subsystems forming a complete computer system. With the increased performance of computer systems, mass storage subsystems, such as fixed disk drives, play an increasingly important role in the transfer of data to and from the computer system. In the past few years, a new trend in storage subsystems, referred to as a disk array subsystem, has emerged for improving data transfer performance, capacity, and reliability.
A number of reference articles on the design of disk arrays have been published in recent years. These include "Some Design Issues of Disk Arrays" by Spencer Ng April, 1989 IEEE; "Disk Array Systems" by Wes E. Meador, April, 1989 IEEE; and "A Case for Redundant Arrays of Inexpensive Disks (RAID)" by D. Patterson, G. Gibson and R. Catts, Report No. UCB/CSD 87/391, December, 1987, Computer Science Division, University of California, Berkeley, Calif.
One reason for building a disk array subsystem is to create a logical device that has a very high data transfer rate. This may be accomplished by "ganging" multiple standard disk drives together and transferring data to or from these drives in parallel. Accordingly, data is stored "across" each of the disks comprising the disk array so that each disk holds a portion of the data comprising a data file. If n drives are ganged together, then the effective data transfer rate may be increased up to n times. This technique, known as striping, originated in the supercomputing environment where the transfer of large amounts of data to and from secondary storage is a frequent requirement.
In striping, a sequential data block is broken into segments of a unit length, such as sector size, and sequential segments are written to sequential disk drives, not to sequential locations on a single disk drive. The combination of corresponding sequential data segments across each of the n disks in an array is referred to as a stripe. The unit length or amount of data that is stored "across" each individual disk is referred to as the stripe size. The stripe size affects data transfer characteristics and access times and is generally chosen to optimize data transfers to and from the disk array. If the data block is longer than n unit lengths, the process repeats for the next stripe location on the respective disk drives. With this approach, the n physical drives become a single logical device.
One technique that is used to provide for data protection and recovery in disk array subsystems is referred to as a parity scheme. In a parity scheme, data blocks being written to various drives within the array are used and a known EXCLUSIVE-OR (XOR) technique is used to create parity information which is written to a reserved or parity drive within the array. The advantage of a parity scheme is that it may be used to minimize the amount of data storage dedicated to data redundancy and recovery purposes within the array. For example, FIG. 1 illustrates a traditional 3+1 mapping scheme wherein three disks, disk 0, disk 1 and disk 2, are used for data storage, and one disk, disk 3, is used to store parity information. In FIG. 1, each rectangle enclosing a number or the letter "p" coupled with a number corresponds to a sector, which is preferably 512 bytes. As shown in FIG. 1, each complete stripe uses four sectors from each of disks 0, 1 and 2 for a total of 12 sectors of data storage per disk. Assuming a standard sector size of 512 bytes, the stripe size of each of these disk stripes, which is defined as the amount of storage allocated to a stripe on one of the disks comprising the stripe, is 2 kbytes (512.times.4). Thus each complete stripe, which includes the total of the portion of each of the disks allocated to a stripe, can store 6 kbytes of data. Disk 3 of each of the stripes is used to store parity information.
However, in addition to the advantages of parity techniques in data protection and recovery, there are a number of disadvantages to the use of parity fault tolerance techniques in disk array systems. One disadvantage is that traditional operating systems perform many small writes to the disk subsystem which are often smaller than the stripe of the disk array, referred to as partial stripe write operations. As discussed below, a large number of partial stripe write operations in conjunction with a parity redundancy scheme considerably reduces disk array performance.
Two very popular operating systems used in personal computer systems are MS-DOS (Microsoft disk operating system for use with IBM compatible personal computers) and UNIX. MS-DOS, or more simply DOS, is a single threaded software application, meaning that it can only perform one operation at a time. Therefore, when a host such as the system processor or a bus master performs a disk array write operation in a DOS system, the host is required to wait for a completion signal from the disk array before it is allowed to perform another operation, such as sending further write data to the disk array. In addition, in DOS, the quantity of data that can be transmitted to a disk array is relatively small, comprising a partial stripe write. Therefore, limitations imposed by DOS considerably reduce disk array performance because a large number of partial stripe writes are performed wherein the host must continually wait for the completion of each partial stripe write before instituting a new write operation.
UNIX also includes certain features which reduce disk array performance. More particularly, UNIX uses small data structures which represent the structure of files and directories in the free space within its file system. In the UNIX file system, this information is kept in a structure called an INODE (Index Node), which is generally two kilobytes in size. These structures are updated often, and since they are relatively small compared with typical data stripe sizes used in disk arrays, these result in a large number of partial stripe write operations.
Where a complete stripe of data is being written to the array, the parity information may be generated directly from the data being written to the drive array, and therefore no extra read of the disk stripe is required. However, as mentioned above, a problem occurs when the computer writes only a partial stripe to the disk array because the disk array controller does not have sufficient information from the data to be written to compute parity for the complete stripe. Thus, partial stripe write operations generally require that the data stored on the disk array first be read, modified by the process active on the host system to generate new parity, and written back to the same address on the data disk. This operation consists of a data disk READ, modification of the data, and a data disk WRITE to the same address. In addition to the time required to perform the actual operations, it will be appreciated that a READ operation followed by a WRITE operation to the same sector on a disk results in the loss of one disk revolution, or approximately 16.5 milliseconds for certain types of hard disk drives.
Therefore, in summary, when a large number of partial stripe write operations occur in a disk array, the performance of the disk subsystem is seriously impacted because the data or parity information currently on the disk must be read off of the disk in order to generate the new parity information. Either the remainder of the stripe that is not being written must be fetched or the existing parity information for the stripe must be read prior to the actual write of the information. This results in extra revolutions of the disk drive and causes delays in servicing the request. Accordingly, there exists a need for an improved method for performing disk array WRITE operations in a parity fault tolerant disk array in order to decrease the number of partial stripe write operations.
One technique for improving disk array system performance in general is the use of disk caching programs. In disk caching programs, an amount of main memory is utilized as a cache for disk data. Since the cache memory is significantly faster than the disk drive, greatly improved performance results when the desired data is present in the cache. While disk caching can be readily applied to read operations, it is significantly more difficult to utilize with write operations. A technique known as write posting saves the write data in the cache and returns an operation complete indicator before the data is actually written to the disks. Then, during a less active time, the data is actually written to the disk.
Background on write posting operations in computer systems is deemed appropriate. An example of write posting occurs when a microprocessor performs a write operation to a device where the write cycle must pass through an intermediary device, such as a cache system or posting memory. The processor executes the write cycle to the intermediary device with the expectation that the intermediary device will complete the write operation to the device being accessed. If the intermediary device includes write posting capability, the intermediary device latches the address and data of the write cycle and immediately returns a ready signal to the processor, indicating that the operation has completed. If the device being accessed is currently performing other operations, then the intermediary device, i.e., the posting memory, need not interrupt the device being accessed to complete the write operation, but rather can complete the operation at a later, more convenient time. In addition, if the device being accessed has a relatively slow access time, such as a disk drive, the processor need not wait for the access to actually complete before proceeding with further operations. In this manner, the processor is not delayed by the slow access times of the device being accessed nor is it required to interrupt other operations of the device being accessed. The data that has been written to the intermediary device or posting memory that has not yet been written to the device being accessed is referred to as dirty data. Data stored in the posting memory that has already been written to the device being accessed is referred to as clean data.
Therefore, when a posting memory is used in conjunction with a disk array subsystem, when a write request is received from a host, i.e., a processor or bus master, the data is written immediately to the posting memory and the host is notified that the operation has completed. Thus, the basic principal of a posted write operation is that the host receives an indication that the requested data has been recorded or received by the device being accessed without the data actually having been received by the device. The advantage is that the data from the host can be stored in the posted memory much more quickly than it can te recorded in the device being accessed, such as the disk array, thus resulting in a relatively quick response time for the write operation as perceived by the host. However, if the write operation is a partial stripe write, which is usually the case, then additional reads are still necessary to generate the required parity information when the data is transferred from the posting memory to the drive array. Therefore, a method and apparatus is desired to efficiently implement a posting memory in conjunction with a drive array system to reduce the percentage of partial stripe write operations as well as reduce the number of overall operations to the drive array and increase disk array performance.
Background on other data integrity methods is deemed appropriate. Other methods that are used to provide data protection and recovery are mirroring techniques for disk data and battery backup techniques for semiconductor memory. Mirroring techniques require that one disk drive be set aside for the storage of data as would normally be done, and a second equivalent disk drive is used to "mirror" or identically back up the data stored. This method insures that if the primary disk drive fails, the secondary or mirrored drive remains and can be used to recover the lost data. Battery backup techniques provide that if a power loss occurs to a memory system, the battery is enabled to maintain power for a period of time until an operator can ensure an orderly shutdown of the system.