1. Field of the Invention
The present invention relates to the data processing environment, and more particularly to data caching within such an environment.
2. Background and Related Art
The business world has become extremely reliant upon information technology, and computers now form the backbone of an ever-increasing number of large organizations. While personal computers (PCs), running office applications such as word processors, spreadsheets, databases, etc., have probably the highest visibility, data storage and transaction processing systems have a huge role to play behind the scenes. For example, such a system will typically form the basis of a supermarket""s stock control. As another example, a bank will store details of all its customers and their accounts using such a system. Thus, every time a customer withdraws money via an ATM, or accesses one of the machine""s other services, their transaction may be recorded in a data storage subsystem.
FIG. 1 shows a data processing system according to the prior art. A host 10, which may for example be an RS/6000 available from the IBM Corporation, includes an adapter card 20 which will typically plug into a PCI slot within the machine. The adapter is accessed via device driver 30 and includes an on-board fast-write cache 40. The host is connected to a network 50 through which it accesses a plurality of customer disks 61 to 64. These disks may form a Redundant Array of Independent Disks (RAID). Each disk may store, for example, 10 Gigabytes of data and is divided into a number of logical blocks. When access to a portion of this data is required, this is specified via a logical block address (LBA).
The host machine may process thousands of transactions 5 an hour. As each transaction arrives at the host, it is stored in local memory (not shown). The device driver 30 is informed and creates a transaction control block (TCB) which includes information such as transaction type (e.g., read or write), the relevant disk number, the address on the disk (LBA), the length of the relevant data, etc. This information is then copied into a control block structure local to the adapter.
In a system without cache memory, the adapter subsequently looks in the appropriate TCB to determine the transaction type and other pertinent details (see above) and then requests data from a specified customer disk or retrieves data from local memory and writes it to a customer disk. This achieved, a completion status is returned to the host which is then ready to process the next transaction. For example, in a banking system if a customer wishes to take money out of their current account, it is first necessary to verify via the disk that they have the funds to do so. It is then necessary to subtract the appropriate amount from their account balance and to write the new total. Thus, a number of read and write transactions are invoked.
A conventional disk rotates at approximately 10,000 rpm. When a command is received from the adapter, the logical block address specified is mapped to a physical cylinder, head number and block number; the combination of cylinder and head numbers defines the track number. The first step is for the disk drive housing the disk to move its actuator to the correct cylinder on the disk. This is known as seek time. In parallel with the seek, the required head is selected so that, when the seek completes, the required track is present at the selected head. Subsequently, the actuator has to wait for the disk to rotate round to the specified block number before it can begin reading or writing to the disk. This is known as latency. The combination of seek time and latency means that an LBA access can take on average 10-15 ms. Since disk access in such a system is typically quite heavy, this time is simply unacceptable.
Data caching is used to alleviate this problem. As previously mentioned, the adapter 20 of FIG. 1 has an on-board fast-write cache 40. As before a transaction 5 is stored in host memory and the device driver creates a TCB for it. This is used to determine the transaction type; and if it is a write, then the relevant data is retrieved from host memory. This time however the data is stored in the cache for background processing and a completion status returned to the host 10 subsequent to the transfer into the on-board cache (but prior to the background processing). Such processing can take place at a later date when transaction data can be written to the appropriate customer disk from the on-board cache (known as destaging); meanwhile, the host and adapter can be receiving and dealing with further transaction requests.
It will be appreciated that the cache 40 is typically a volatile random access memory (RAM). Hence if the host is powered down (whether on purpose or by accident) or the host simply dies, or indeed the adapter fails, the cached data is lost. This is not a problem if the data contained within has already been destaged to the appropriate customer disk; but if it hasn""t, then the transaction could be lost forever since the host 10 has already received a completion status for it (i.e., as soon as the transaction was written to the on-board cache).
Thus, many storage subsystems have introduced an element of fault tolerance. A typical solution is to mirror the data to a second cache 85 on another adapter 80 plugged into a second host 70. If one adapter then becomes unavailable, the other can flush the data to the appropriate customer disk, thereby preserving data integrity. This is known as an owner/partner configuration.
Although adapters 20 and 80 have been shown as being on different machines, it is quite likely that they will both be plugged into one machine. If this is the case, then at least one cache or memory must have battery backup (not shown) so that data is preserved in the event of power loss on the host. Battery-backed RAM obviously has a performance advantage in terms of lower latency and higher bandwidth; however, it also comes fraught with problems. One issue is the very limited space available to house a battery on a PCI adapter card. Furthermore, NiCd batteries are toxic, yet other rechargeable technologies cannot survive the high temperatures encountered in an adapter environment. Adapters can easily reach 60xc2x0 C. and at that temperature the battery capacity degrades more quickly. Thus, such batteries have to be recharged periodically. They have a shelf life of approximately two years; but, in use, they last only two days. Periodic replacement of the batteries is inconvenient. Moreover, if for example a machine is unplugged on a long holiday weekend, the data is likely to be lost as the batteries will simply not hold out until the following Tuesday.
Cache with a battery-backup may be termed non-volatile since the data is not automatically lost if power is lost. There is a further problem associated with such a solution in that in the event of a failure the medium has to then be manually moved to a replacement adapter, with consequent delay and risk of physical damage and data loss (e.g., via electrostatic discharge).
Moreover, it is likely that the host 70 including the second adapter and on-board cache will also be receiving and processing its own set of transactions 7. It will therefore require host 10 to mirror its data. Thus, cache utilization (i.e., for transaction processing) will be half the size of the smallest cache, with half being used by its host to store transaction data, and the other half being used to mirror the other host""s data. Such a setup therefore significantly reduces a system""s ability to cache data.
There are additional problems associated with the owner/partner configuration of FIG. 1. Complex communication protocols are involved which are difficult to implement in a reliable way. For example, if the owner adapter fails, then the partner will take control. If, however, the owner resurfaces, there is an issue to resolve as to which adapter is caching and processing the transactions which originally belonged to the owner. Furthermore, many customers use four or even eight adapters, but this then prohibits the use of write caching. Processor utilization on the adapters is also high when mirroring data between adapters. Thus, overall system performance is reduced.
Rather than data being mirrored to a second adapter, it may instead be written to a dedicated disk (not shown) on the network which is accessible by all adapters in the system. In a system using RAID, it is common to designate one of the disks as a hotspare. This can be swapped in in the event of a disk failure within the array. It will be appreciated that the hotspare can be used as the dedicated cache disk in order to save expense. If a customer disk within the array then fails, data caching can be foregone and the hotspare substituted in in place of the failed disk. U.S. Pat. No. 5,708,668 discloses the use of such a disk.
It will be apparent that caching to such a dedicated disk will have the high latency and low bandwidth problems associated with writing directly to a customer disk. Writing directly to the customer disk is unlikely however to be sequential. U.S. Pat. 5,708,668 also teaches writing sequentially to the cache disk and incorporates a performance gap to reduce latency. This is appended to customer data and is of a predefined length. Its purpose is to ensure that by the time this performance gap has been written, another transaction is ready for processing immediately. However, if a transaction is available before the performance gap is completely written, latency is still a problem. Furthermore, this performance gap is mainly intended for the situation where there is a queue of transactions. It is therefore not helpful when the workload is light or not constant since the gap will not necessarily be long enough.
Another technique, described in U.S. Pat. No. 5,748,874, uses a reserved area of a disk for storing the contents of a controller cache. In this patent, cache data is only transferred to the reserved area on interruption of power to the controller.
It is an object of the present invention to provide an method, apparatus, and system for caching data that overcomes most or all of the problems described above as existing in the art. Accordingly, the present invention provides a method for caching data in a data processing system including a host computer and a storage subsystem including at least one customer disk and a cache disk, said method comprising the steps of: receiving write transactions specifying data to be written to the at least one customer disk; caching said write transactions in a volatile memory of said storage subsystem; writing said cached write transactions to the cache disk, said writing step including: when available, writing transaction data sequentially to the cache disk; and only when transaction data is unavailable, writing padding data sequentially to the cache disk.
Since transaction data is written on the cache disk whenever it is available, latency and seek time are greatly reduced. A specific address at which to write the data is no longer specified. Instead, such data is written at the point directly under a relevant actuator head of the disk drive housing the cache disk at that current moment in time. At all other times, padding data is written. Thus, even if the workload is light or just not constant, the data can still be written sequentially on the disk.
The cache disk eliminates the need for a battery-backup to the storage subsystem""s volatile memory and hence all the problems that are associated with that (as described above). It also removes the need for a second adapter""s volatile memory to be mirroring the first""s transactions and thus having significantly less volatile memory space for storing its own set of transactions.
In the preferred embodiment assuming that no customer data is received, the padding data is written on a current track until the end of that track. The subsequent track is then switched to and remained upon until transaction data is available. Note, this may mean writing the padding data and then overwriting the padding data with new padding. This, however, ensures that a large portion of the disk is not wasted with unhelpful data.
Preferably, the cache disk is divided into at least two regions. The first of these is filled with transaction and padding data as appropriate, before the next region is written to in the same manner. At a point subsequent to the first region being filled with data, but before the last of the disk regions is completely filled with data, the first region is invalidated, such that it is ready to receive new data by the time the other regions are full.
According to the preferred embodiment, invalidation involves ensuring that all data contained within the first region has been destaged from the storage subsystem""s volatile memory to the appropriate customer disks. Separating the two functions of writing and invalidating means that writes to one region can continue sequentially without being disrupted by the invalidation process taking place with regard to another region.
Preferably the transaction data is pre-pended with a header, which is used to distinguish the transaction data from the padding data on the cache disk. In the preferred embodiment, this header includes information about the transaction data such as the disk to which it is to be written, the LBA on that disk, the length of the data itself, a timestamp, etc. This information is used in the event of an unavailability of the data in volatile memory of the storage subsystem to destage the data contained on the cache disk to the appropriate customer disks. Thus, access to the customer data is always possible.
According to another aspect, the invention provides a method for caching data in a data processing system including a host computer and a storage subsystem including at least one customer disk and a cache disk, said cache disk being divided into at least two regions, the method comprising the steps of: receiving write transactions specifying data to be written to the at least one customer disk; caching said write transactions in a volatile memory of said storage subsystem; writing said cached write transactions to the cache disk, said writing step comprising: filling the first of said at least two regions with transaction data, before writing to another of said at least two regions.
In yet another aspect, the invention provides an apparatus for caching data in a storage subsystem including at least one customer disk and a cache disk, said apparatus comprising: a volatile memory; means for receiving write transactions specifying data to be written to the at least one customer disk; means, responsive to the receipt of said write transactions, for caching said write transactions in said volatile memory; and means for writing said cached write transactions to the cache disk, said writing means comprising: means for writing write transaction data, when available, sequentially to the cache disk; and means for writing said padding data sequentially to the cache disk only when transaction data is unavailable.
In yet another aspect, the invention provides a storage subsystem for caching data including the apparatus described above, at least one customer disk and a cache disk.