1. Field of Invention
The present invention relates generally to computing systems, and more particularly, to a method for providing a single operational view of virtual storage allocation without regard to processor or memory cabinet boundaries.
2. Description of Related Art
Technological evolution often results from a series of seemingly unrelated technical developments. While these unrelated developments might be individually significant, when combined they can form the foundation of a major technology evolution. Historically, there has been uneven technology growth among components in large complex computer systems, including, for example, (1) the rapid advance in CPU performance relative to disk I/O performance, (2) evolving internal CPU architectures, and (3) interconnect fabrics.
Over the past ten years, disk I/O performance has been growing at a much slower rate overall than that of the node. CPU performance has increased at a rate of 40% to 100% per year, while disk seek times have only improved 7% per year. If this trend continues as expected, the number of disk drives that a typical server node can drive will rise to the point where disk drives become a dominant component in both quantity and value in most large systems. This phenomenon has already manifested itself in existing large-system installations.
Uneven performance scaling is also occurring within the CPU. To improve CPU performance, CPU vendors are employing a combination of clock speed increases and architectural changes. Many of these architectural changes are proven technologies leveraged from the parallel processing community. These changes can create unbalanced performance, leading to less than expected performance increases. A simple example; the rate at which a CPU can vector interrupts is not scaling at the same rate as basic instructions. Thus, system functions that depend on interrupt performance (such as I/O) are not scaling with compute power.
Interconnect fabrics also demonstrate uneven technology growth characteristics. For years, they have hovered around the 10-20 MB/sec performance level. Over the past year, major leaps in bandwidth to 100 MB/sec (and greater) levels have also occurred. This large performance increase enables the economical deployment of massively parallel processing systems.
This uneven performance negatively affects application architectures and system configuration options. For example, with respect to application performance, attempts to increase the workload to take advantage of the performance improvement in some part of the system, such as increased CPU performance, are often hampered by the lack of equivalent performance scaling in the disk subsystem. While the CPU could generate twice the number of transactions per second, the disk subsystem can only handle a fraction of that increase. The CPU is perpetually waiting for the storage system. The overall impact of uneven hardware performance growth is that application performance is experiencing an increasing dependence on the characteristics of specific workloads.
Uneven growth in platform hardware technologies also creates other serious problems; a reduction in the number of available options for configuring multi-node systems. A good example is the way the software architecture of a TERADATA(copyright) four-node clique is influenced by changes in the technology of the storage interconnects. The TERADATA(copyright) clique model expects uniform storage connectivity among the nodes in a single clique; each disk drive can be accessed from every node. Thus when a node fails, the storage dedicated to that node can be divided among the remaining nodes. The uneven growth in storage and node technology restricts the number of disks that can be connected per node in a shared storage environment. This restriction is created by the number of drives that can be connected to an I/O channel and the physical number of buses that can be connected in a four-node shared I/O topology. As node performance continues to improve, we must increase the number of disk spindles connected per node to realize the performance gain.
Cluster and massively parallel processing (MPP) designs are examples of multi-node system designs which attempt to solve the foregoing problems. Clusters suffer from limited expandability, while MPP systems require additional software to present a sufficiently simple application model (in commercial MPP systems, this software is usually a DBMS). MPP systems also need a form of internal clustering (cliques) to provide very high availability. Both solutions still create challenges in the management of the potentially large number of disk drives, which, being electromechanical devices, have fairly predictable failure rates. Issues of node interconnect are exacerbated in MPP systems, since the number of nodes is usually much larger. Both approaches also create challenges in disk connectivity, again fueled by the large number of drives needed to store very large databases.
The foregoing problems are ameliorated in an architecture wherein storage entities and compute entities, computing over a high performance connectivity fabric, act as architectural peers. This architecture allows increased flexibility in managing storage and compute resources. However, this flexibility presents some unique problems. One such problem maintaining the speed and flexibility offered by the architecture, while still assuring secure storage of data.
In traditional architectures, efficient storage of data is enabled by the technique of write back caching. Data normally written to the disk by the CPU is first written into a write back cache. The data is then written to the disk during idle CPU cycles. This technique improves performance because a write to the write back cache can occur faster than to the disk or to RAM.
The use of a write back cache for disks also adds a degree of risk, because the data stays in the volatile memory of the disk device for a longer period of time before it is written to the disk media. Even though the period of time involved is typically a few seconds at most, the data may be lost if there is a crash or system failure before the data can be written to non-volatile storage.
Write back caching can be used with highly distributed architectures as well. However, when write back cache protocols are implemented in such architectures, they require considerable communication and transaction overhead between the compute nodes and the storage media, reducing the speed and efficiency of the system. What is needed is a protocol for efficient write-back caching of data in distributed architectures. The present invention satisfies that need.
The present invention describes a method and apparatus for write-back caching in a data storage and processing system. The method comprises the steps of receiving a write request including write data from a compute node in a first I/O node, forwarding the write data from the first I/O node to a second I/O node, and sending an acknowledgment message to the compute node from the second I/O node after the write data is received by the second I/O node. After the data is written into non-volatile storage of the first I/O node, a purge request or command is sent to the second I/O node to purge the write data from the volatile memory of the second I/O node. In one embodiment, the purge request is not sent until the first I/O node receives a second write request, in which case, the purge request is sent in the same interrupt as the write data for the second write request. The processing system comprises a first and a second I/O node, each with means for receiving a write request from the compute node and forwarding that write data to the other I/O node. Each I/O node also comprises a means for sending an acknowledgment message back to the compute node directly, without sending the acknowledgment through the I/O node that sent the write data. The result is an I/O protocol that reduces the number of interrupts required to store data, while still implementing write back caching to improve storage speed and turnaround. The invention also can be described in terms of a program storage device, such as a hard disk, floppy disk, or a CD, which tangibly embodies instructions stored thereon for performing the instructions to practice the invention.