1. Field of the Invention
This invention relates to distribution of storage data within computer nodes of a clustered computing environment, and more particularly, to an apparatus and method for I/O request shipping operable using peer-to-peer communications within host bus adapters of the clustered systems.
2. Discussion of Related Art
A cluster is, in general, a collection of interconnected whole computers utilized as a single computing resource whereby a communication network is used to interconnect the computers within the cluster. A cluster typically contains several computers. From the viewpoint of a computer within this collection of computers, the rest of the computers and their respective attached resources are deemed remote, whereas its own attached resources are deemed to be local.
Resource sharing is one benefit of a computing cluster. A computer within the cluster can access the resources of another computer within the cluster, and the computers of the cluster can thereby share any resource in the cluster. Combining the processing power and storage resources of the cluster into one virtual machine increases the availability and capacity of resources within the cluster. For example, if one resource, such as a processor, in the cluster were to fail, another processor within the cluster could take over the load of the failed processor. To the requester, the failure of the processor is transparent because another peer processor services its request load.
A common application of such a clustered environment is for the sharing of disk storage resources. For example, in high volume transaction processing applications (e.g., a database transaction system), a large number of processors may be added to a computing environment all of which share access to common storage devices containing the shared database. The transaction processing load may therefore be distributed over a large number of processors operating in parallel to perform the requisite transactions. Problems arise where multiple computers, operating in parallel, share data and storage devices. Clearly, a level of coordination is required to assure that each of the computers is aware of updates in the storage devices made by others of the computers in the cluster. In environments that share common storage resources, two fundamental architectures have arisen to coordinate the shared access to storage devices: file level shared access control and block level shared access control. That is, information may be distributed between disks at the file level or at the block level.
Entire files may be distributed throughout a cluster's storage subsystem by storing the files on a local disk or storing the files on remote storage subsystems. Software executed by the host coordinates the communications between the host computer requesting the file and the local or remote storage subsystem containing the file. This software executed by the host is implemented within each host's operating system or can be a software layer operating on each host to coordinate access to the files.
As entire files may be distributed throughout a cluster's storage subsystem, a file can be partitioned into a plurality of individual blocks that can similarly be distributed throughout a cluster's storage subsystems. This allows the parts of the file to be concurrently accessed locally and/or remotely. Software executed by the host coordinates the communications between the host computer requesting the blocks and the local or remote storage subsystem containing the blocks. As previously stated, this software is presently implemented within each host's operating system or as a software layer operating on each host in a cooperative distributed manner.
Both block level distribution and file level distribution can be performed with physically shared disks. In a physically shared disk architecture, each node (computer) in the cluster has direct access to all the disks within the cluster to thereby provide "any-to-any" connectivity among hosts and disks. A layer of host software provides the coordination to allow hosts to access data from any disk within the cluster. One such example is the Oracle Parallel Database Server running in a DEC VAX/Alpha cluster.
The Oracle Parallel Database Server maintains the consistency of the database by utilizing a proprietary protocol, the distributed lock manager (DLM), to allow nodes to access the shared storage concurrently. Utilizing DLM software in a cluster of physically shared disks allows all computer nodes to have access to all disks directly through their own I/O subsystem so that each disk appears to be physically local. Each computer node can cache and/or lock shared disk-based structures utilizing the DLM software. For example, if one node wants blocks X, Y, and Z within disks A and B, it must first ask the DLM software for permission. The DLM will grant permission only after it has endured that blocks X, Y, and Z are current. The DLM ensures that if another node has made recent changes to blocks X, Y, and Z and locally cached the modifications, the DLM will ask it to flush the modifications to disks A and B first.
Physically shared disks are simple to manage, provide fast data access, and are the dominant approach in the market today. However in large configurations, expensive switches and multiplexing devices are required to maintain any-to-any connectivity between nodes. Due to the expensive switches and interconnects, this architecture is expensive to scale. In particular, each computer or disk added to such a physically shared disks architecture may require, in turn, addition of a larger, more complex, more costly switching or multiplexing devices.
Both block level distribution and file level distribution can also be performed with logically shared disks. In a logically shared disk architecture disks are not shared physically, but distributed across the nodes of the cluster, each node owning a subset of the total disks. File level shared access control or block level shared access control at the host level retrieves data on a cluster where there is primarily networked connectivity between the computer nodes of the cluster. That is in this type of cluster, the application data are partitioned within the nodes in the cluster so that each node has direct access to physically local disks but must establish network connectivity with other nodes to retrieve files or blocks from remote disks on another node in the cluster.
To retrieve files from a remote resource, software in the host intercepts file or block level I/O requests and determines whether the particular file or block is stored locally or remotely. If local, the software passes the request down to the local file system or block I/O driver. If remote, the software passes the request to the node owning the remote disk via the inter-node communication network.
The key benefit of logically shared disk architecture is the ability to scale the number of nodes by simple replication of the required subsystems. Unlike the physically shared disk architecture that requires complex switches and multiplexing devices, the logically shared disk architecture enables simple and inexpensive scaling of the cluster capacity and size, since any-to-any connectivity between computer nodes and disks need not be maintained. Additional storage devices are accessed by non-local computers of the cluster via existing network interfaces interconnecting the computers of the cluster. In like manner, each additional computer has access to all storage in the cluster either locally or via existing network connections among the computers of the cluster.
Physically shared disk architectures are most prevalent in spite of the higher costs in view of their higher performance as compared to logically shared disk architectures. Many environments therefore have significant investments in application programs and associated "middleware" (intermediate layers of software) which are designed presuming the simple, flexible, any-to-any connectivity of physically shared disks.
"I/O shipping" is a technique that has evolved to allow such application programs and middleware to operate in an environment that in fact does not provide physically shared disks. Rather, I/O shipping methods are used to emulate the physically shared disk architecture using a low-level layer of host software. In essence, I/O shipping is a technique to implement logically shared disks even though any-to-any connectivity does not exist.
I/O shipping is presently performed at a block driver layer of the host software to preserve the simplicity of management of physically shared disks while enjoying the economic and scalability benefits of the logically shared disk architecture. I/O shipping receives block level requests from higher layers of software, which presume an any-to-any connection architecture underlies their operation. The I/O shipping layer processes I/O requests locally if the local disks are appropriate for the requested action and passes the I/O request to other host systems if the requested blocks are not stored on the local disk. I/O shipping thus allows continued use of existing software that presumes physically shared disks to work in a cluster with logically shared disks. That is, I/O shipping determines whether the block request made by a higher level application, that assumes all disks are physically shared, can be retrieved locally or must be retrieved remotely and therefore require the I/O request be "shipped" to another host computer. To higher level software layers I/O shipping in essence emulates physically shared disk using logically shared disks.
All the above known cluster configurations suffer from a common problem in that the disk sharing control and coordination is performed within the host systems and therefore imposes an overhead load on the host systems. The degree of overhead processing varies somewhat depending upon the specific architecture employed. Nevertheless, all the above noted prior techniques impose a significant overhead-processing load on the computers of the cluster. Consequently, a need exists for an improved apparatus and method to provide cluster computing disk sharing (or more generally resource sharing) with high I/O throughput performance, low host system processing overhead, and lower cost/complexity as compared to prior host-based techniques.