1. Field of The Invention
The invention relates to the field of non-volatile storage device operation, and in particular, to the temporary storage of the contents of a direct access storage device (DASD) write command prior to the contents being written to their intended destination location on a disk surface, commonly referred to in the art as write caching.
2. Background Information
In data processing systems, processors and electronic memory devices generally operate at relatively high speeds. Volatile electronic memory devices can be written to and read from rapidly. However, when system power is removed, intentionally or accidentally, the contents of volatile memory, by definition, are not retained. Many data processing applications require long-term data storage and/or a high-degree of data integrity, and these needs are met by non-volatile data storage devices. Non-volatile storage can be provided by a variety of devices, most commonly, by direct access storage devices (DASD""s), often also referred to as hard disk drives.
However, many non-volatile memory devices, such as DASD""s and tape drives, for example, are inherently slower than data processors and volatile memory devices by virtue of their being mechanical devices having moving parts. A DASD generally has one or more magnetically readable and writable disks, which rotate, each disk having one or more electromagnetic transducers (heads) per readable/writable surface which can be positioned over a desired location to read and/or write data on a respective disk surface. A transducer (head) receives an electrical signal and produces a magnetic field to write data to a disk surface. Conversely, to read written data from a disk surface, a head is passed over magnetized areas on the disk surface and a magnetic field is thereby induced in the head producing an electrical signal output.
Hard disks are generally metal or glass platters coated with a magnetic oxide that can be magnetized to represent data. Floppy disks, by contrast, are generally made of a flexible plastic (e.g., Mylar) material, disposed in a protective hardened plastic jacket. A hard disk drive (DASD) generally has several disks, also called platters, assembled together in a disk pack. All of the platters rotate together on a common spindle. A series of access arms carrying read/write heads, one for each disk surface top and bottom, provide access to concentric tracks on the platters. The access arms generally move together and are sometimes referred to as a comb structure since they resemble the teeth of a comb.
A disk surface is typically formatted into a number of concentric tracks. The tracks may be subdivided into sectors, which are subdivided into blocks for storing user data and the like. The formatting enables a particular location on the disk surface to be reliably accessed by providing markers for accurately positioning a read and/or write head. A set of corresponding tracks on a set of disk platters is referred to as a cylinder. A set of contiguous tracks/cylinders on a platter/platters is referred to as a partition (see FIGS. 1B and 1C).
Of course, due to the physical properties of the DASD, data cannot be instantaneously written to a particular location on the rotating disk surface. A head cannot instantaneously move to the correct disk surface location to perform a write operation when desired, for example. Individual bits of data are written to respective individual locations in a serial fashion as the disk surface rotates under the head. Thus, when a data processing system writes data to a DASD, the writing process is naturally slower than when writing to an electronic device, such as volatile memory.
In a data processing system write to DASD operation, for fault tolerance and recovery reasons, before moving on to other tasks, the data processing system should wait for confirmation that the data has been written to the DASD. However, the inherently slower operation of the DASD could cause the data processor to be idle for significant periods of time waiting for the DASD to complete the write operation unless measures are taken to avoid this.
Write caching is one method used to avoid slowing down a data processing system when a write to non-volatile storage, for example, to a DASD, is performed. This is generally accomplished by temporarily placing the contents of the write command (data) into a volatile cache memory associated with the DASD before writing them to the DASD surface. Once they are written to the cache memory, an inherently faster process than writing to the DASD, a write confirmation message is sent back to the processor indicating that the contents were written so that the processor can continue with other activities. However, in fact, they were not really written to the surface of the DASD. The data is really only cached in volatile memory and will actually be written to the destination DASD surface some time later.
While such write caching has the effect of speeding up overall system operation under ordinary circumstances, there is the danger that should a power interruption, drive fault, or other error occur between the time the write confirmation message is sent to the processor and the time the data is actually written to the DASD surface, data will be lost. Measures are available to compensate for certain causes of such failures, such as uninterruptable power supplies (UPS""s) to prevent a power source interruption from shutting down the data processing system before data writing can be completed.
For example, U.S. Pat. No. 5,748,874 to Hicksted et al. (May 5, 1998) titled xe2x80x9cRESERVED CYLINDER FOR SCSI DEVICE WRITE BACK CACHExe2x80x9d describes a disk drive in a computer system equipped with a power storage unit that supplies power to the drive controller when there is a system power interruption, such as a power-down or a power failure. Once the controller is notified that system power has been interrupted, it immediately initiates a seek to a reserved location in the disk drive and stores the contents of the cache memory at the reserved location. After power has been restored to the system, the controller loads the contents of the reserved cylinders back to the cache memory and completes the pending write operations by writing all of the data items in the cache to their respective final locations in the drive.
Hardware manufacturers have generally addressed the problem of potential write cache data loss in one of three ways: just ignore the problem, due to the large performance gains seen with write caching, and accept a higher risk of data loss; use additional hardware to provide a non-volatile write cache (including uninterruptable power supply arrangements); or simply disable the write caching feature altogether where a data loss would be catastrophic.
The additional hardware solution may take the form of an uninterruptable power supply or the like, such as in the Hicksted et al. patent referenced above, or some other type of non-volatile memory, such as flash memory, for example. However, even these types of additional hardware measures do not solve the problem of a hard drive dead lock, where the drive must be reinitialized. Dead lock occurs when the hard drive is unable to process commands and/or communicate with the initiator. This is a state that is never supposed to occur, but it can happen. This may be the fault of the drive or the initiator. Usually, the only solution to this deadlock is to restart the drive and possibly the entire system by an internal reset or by power-cycling the equipment. In either case, if data meant to be written to the disk is contained in the drive cache when this occurs, data will be lost. This would not be a problem if the initiator were aware that the data has not been written, but in the case of a standard write cache, this is not the case. With the invention, the initiator is not notified that the data has been written until it has actually been written somewhere, i.e., to WTCB""s. Although the data may not have been written to its final destination, it is present on the physical media when the drive is restarted, so that a recovery procedure can locate the data and write it to the correct location. By contrast, in the case of a standard write cache, the data was in volatile memory and was, therefore, lost when the drive was shut down due to the dead lock.
Therefore, a need exists for a way to minimize idle processor time during a write to non-volatile storage, as is accomplished by write caching to volatile memory, but minimize the risk of data loss and minimize the need for additional hardware at the same time.
As mentioned earlier, a disk surface area is generally divided into portions during a process called formatting in which boundary data is written to the disk surface so that a precise position of a disk head can be confirmed for reading and writing. Typically, each disk surface, top and bottom, is divided and sub-divided into concentric tracks, sectors within the tracks, and blocks of data within the sectors, for example. A collection of corresponding disk tracks on the disk platters in a drive is referred to as a cylinder. A DASD generally has a plurality of cylinders, disks, and disk surfaces, and therefore, has a plurality of disk heads. On such multi-surface, multi-disk, multi-cylinder DASD""s, the heads may be arranged on arms of a mechanism referred to as a comb structure.
The time required for the arms of the head mechanism to arrive over the correct cylinder after being commanded to that cylinder location is called xe2x80x9cseek timexe2x80x9d or simply xe2x80x9cseek,xe2x80x9d and the time required for the head to reach the correct location on a track, by virtue of the disk rotation, is called xe2x80x9crotational latency timexe2x80x9d or simply xe2x80x9clatency.xe2x80x9d The time to read the block or blocks of data as they pass under the head is called the xe2x80x9ctransmission timexe2x80x9d or xe2x80x9cblock transfer time.xe2x80x9d These times are illustrated in FIG. 1A. These times are generally on the order of milliseconds, but clearly seek and latency are variable because they are dependent on where the block to be read or written is located relative to where the head is located when the command is given.
Latency and seek may be collectively referred to as xe2x80x9cidle timexe2x80x9d since reading and writing to the desired surface location is not done during this time and the DASD is, in this respect, idle. Generally, only one surface is read from or written to at a time, and therefore, only one head is active at a given time. To read from a different surface, a switching from one head to another is performed. Although this is a relatively rapid process, there is still some disk rotation that occurs during this switching. Therefore, if one were to write a block to a first disk surface and then switch to a second surface to write a next block of data, there would be a rotational offset between the two blocks equal to the distance the disks travel (rotationally) during the time required to switch between heads.
It is known to use hard disk drives as a cache storage for even slower devices, such as CD-ROM (compact disk read only memory) devices. For example, U.S. Pat. No. 5,884,093 to Berenguel et al. (Mar. 16, 1999) entitled xe2x80x9cHARD DISK CACHE FOR CD-ROM AND OTHER SLOW ACCESS TIME DEVICESxe2x80x9d describes a controller system for a CD-ROM drive or other slow access device such as a magneto-optical device using a conventional hard disk drive as a cache memory. The hard disk cache is partitioned to use a portion thereof to clone the most often used data blocks on the CD-ROM disk such as the directory/file allocation table, while the balance of the hard disk is used to cache some or all of the balance of the CD-ROM disk using conventional cache memory least recently used rules. Three bus controllers for the host computer, CD-ROM drive and the hard disk cache are controlled by a microprocessor which runs a control program that implements the cloning and cache rules. The three bus controllers are connected by a DMA bus for faster transfer of data. The microprocessor controls the directions of the DMA transfers by data written to a control register.
Similarly, U.S. Pat. No. 5,806,085 to Berliner (Sep. 8, 1998) entitled xe2x80x9cMETHOD FOR NON-VOLATILE CACHING OF NETWORK AND CD-ROM FILE ACCESSES USING A CACHE DIRECTORY, POINTERS, FILE NAME CONVERSION, A LOCAL HARD DISK, AND SEPARATE SMALL DATABASExe2x80x9d describes a non-volatile caching system and a method for implementing the system, applied to rotating magnetic media, such as hard disk drives. The system retains data even in the event of system shut-down and re-boot, caching data from large, randomly accessed files, such as databases, in a space-efficient manner on the magnetic media. A conversion routine converts CD-ROM file names or network file names to local hard disk drive file names and back. A mini-database is created for each cached file on the hard disk drive. The mini-data base maps randomly-accessed blocks of data within the cached file on the local hard disk drive.
From the above, it is apparent that there is a need for a write caching system for DASD""s and the like, which avoids the problem of potential data loss while at the same time minimizing any adverse impact on system time utilization efficiency, complexity and cost.
It is, therefore, a principal object of this invention to provide for fail-safe write caching.
It is another object of the invention to solve the above-mentioned problems so that write caching can be accomplished with reduced risk of data loss should an equipment or other failure occur, while at the same time minimizing any impact on cost, complexity and efficiency.
These and other objects of the present invention are accomplished by the method and apparatus disclosed herein.
According to one aspect of the invention, fail-safe write caching is provided for a direct access storage device (DASD) without the need for any additional hardware by utilizing specially arranged portions of the disks to write the cached data during DASD idle time before finally writing the data to its intended ultimate disk location. Advantageously, a DASD made according to this aspect of the invention has a competitive advantage over DASD""s which do not provide a fail-safe write cache and incur the risk of losing cached data, those which provide fail-safe write caching through the use of additional hardware and thereby incur greater manufacturing cost, or those which do not implement any write caching at all and incur a lower performance level. These specially arranged portions of the disks have been given the name write-twice cache blocks (WTCB""s) by the inventor.
According to another aspect of the invention, a xe2x80x98virtualxe2x80x99 write cache is created by virtue of the WTCB""s. Once a block from the write cache memory has been written to a WTCB, that data can be removed from the write cache memory, and thereafter, the WTCB acts as the write cache for that data. Later, the data is read from the WTCB and written to the final destination. The net effect of the WTCB acting as a write cache is that more write commands can be cached than could otherwise be stored in the volatile write cache memory. Thus, the effect is the creation of a virtual cache.
According to another aspect of the invention, a fail-safe recovery method is provided in case a power interruption or a drive fault occurs, for example.
According to another aspect of the invention, a new type of data block is provided, as well as a strategy for placing the new data block and using it.
Advantageously, a system according to the invention solves the data integrity problem inherent in write caching. The data integrity problem occurs when a drive has indicated that a particular block was written to the media platter, but it has not really been written. In such a case, there is a period of time between when the drive has received the data and when the data is written, that the drive media is not really in the state that the initiator (the one that sent the data) believes it is. If, for some reason, the drive cannot write the data to the correct location, due to a power failure, drive fault, etc., then the drive will stay in this inconsistent state. This can lead to crashes/faults/etc. in the main system. The invention solves this problem by writing the data to a temporary location before indicating to the initiator that the data has been written. Then, if a failure does occur, during recovery, the data that was written to the temporary storage location on the disk media can be read and written to the correct location.
Advantageously, a system according to the invention does not require any additional hardware to be implemented.
Further, advantageously, a system utilizing write-twice cache blocks (WTCB""s) according to the invention comes with only limited possible performance drawbacks. One possible performance drawback is a drop in sequential read or write performance. Since the WTCB""s reduce the number of standard sectors on a track, the number of sectors read on each revolution is reduced. However, in an exemplary embodiment, the reduction is slight, on the order of 1% to 2% fewer sectors read per revolution.
Another possible performance drawback is an increase in bus overhead, because with an arrangement according to the invention in contrast with the prior systems, the disk drive write cache memory will not send back a command complete response immediately after receiving a write command from an initiator. Instead, a separate communication connection is made between the disk drive and the command initiator after the data has actually been written to a WTCB. Because establishing the connection takes some small but finite amount of time, overhead on the bus is increased. However, these small performance costs pale by comparison with the performance cost a loss of data could cause, which is minimized according to the invention.
According to another aspect of the invention, two properties of hard drives are taken advantage of: hard drives are a non-volatile storage medium; and the main performance bottleneck of hard drives is drive latency.
According to an aspect of the invention, the time between when the head mechanism has arrived over the correct cylinder (seek) and when the correct location on the track is reached (latency) is used to write portions of the write cache to the disk. In particular, the portions of the write cache are written to special blocks, referred to as Write-Twice Cache Blocks (WTCB) herein. Each WTCB holds: one block of data from the write cache, the data destination address, a time stamp which uniquely identifies the block of data as the latest entry, and a list of which WTCB""s contain write cache information which has not yet been written to the respective ultimate destination.
According to another aspect of the invention, the WTCB""s are located on each cylinder such that they are spaced evenly apart.
For example, a first WTCB is placed on a topmost track, i.e., a track of the topmost disk surface in the cylinder. Each successive WTCB is located on a next lower disk surface track, until a lowermost disk surface track is reached. The process continues from the first track until all WTCB""s have been positioned.
According to another aspect of the invention, on a cylinder basis, the minimum distance between the end of a WTCB and the start of the next one must be greater than the time to do a control switch from one head to another. The reason for this will become clear from the method flow set forth in the detailed description.
The flow of an exemplary embodiment of the invention is as follows. When a write command is sent to the drive, the data will be requested from the initiator and placed in the write cache, assuming sufficient room exists. At this point, status for the command will NOT be sent back to the initiator, in contrast to how it is done in a standard write cache scheme. Instead, when a current operation, either a read or a write, finishes its seek, blocks of the write cache will be written to the WTCB""s on that cylinder.
In order to write any of the WTCB""s, sufficient latency must exist to allow a control switch from the currently selected head to the head with the shortest latency WTCB that does not already contain active data, time to write at least one WTCB block, and time to switch back to the original track. When the drive""s scheduler issues a seek command, it has calculated the rotational distance measured in Servo ID samples (SID""s) between the prior command and the one it is starting. Once the drive arrives at the destination cylinder, it continues to read the SID""s until the desired starting sector is reached. Once the drive arrives at the destination cylinder, the remaining latency can be calculated by using the difference between elapsed time and the total rotational distance calculated earlier.
Alternatively, the current SID can be compared with the current command""s SID location to determine the latency. The factors that determine the remaining latency include the prior command""s SID and cylinder location as well as the current command""s SID and cylinder location. If enough latency exists, multiple WTCB""s may be written by switching heads in a round robin fashion.
An advantageous orientation of the WTCB""s according to an aspect of the invention, is designed to minimize WTCB latency. By utilizing the latency from the operation, which would usually be spent idly, the write cache can be advantageously stored on the disk with no additional operating time.
When all of the blocks for a write command have been written to WTCB""s (this may take several operations), status for the command will be returned to the initiator. At this point, should a power failure or drive fault occur, the necessary information to recover will be on the media.
According to an aspect of the invention, when the data for a write command is finally placed at its destination, its outstanding WTCB entries will be removed from the list.
According to an aspect of the invention, the recovery procedure is straight forward. When the drive is given the command to stop operation, it will flush the write cache and it will write a flag to its reserved data area, indicating that there are no outstanding write cache blocks.
According to an aspect of the invention, when the drive is restarted, it will check this flag and start recovery if necessary. In order to perform recovery, the drive will need to examine every WTCB on the media. Given the layout scheme provided, one cylinder""s WTCB""s could be examined per revolution. Therefore, recovery time would be dependent on the number of cylinders and the operating RPM.
According to an aspect of the invention, having the WTCB""s spaced around the entire track is not a requirement. Instead, multiple cylinders could be used, requiring that short seeks (one or two cylinders) be performed to write some WTCB""s. This would reduce the time required for recovery at the expense of reducing the average number of WTCB""s that could be written. This can be worked out based on the performance and recovery requirements for a particular implementation. According to an aspect of the invention, once the latest WTCB is found, the list of outstanding WTCB""s will be used to write the data contained in the list of outstanding WTCB""s to the correct destinations, putting the drive into a consistent state.
These and other aspects, objects and advantages of the invention will become apparent from the detailed description of exemplary embodiments set forth below.