1. Field of the Invention
The present invention pertains to data transfers in multiprocessor computing systems, and, more particularly, to a method and apparatus for supporting concurrent system area network inter-process communication and input/output (xe2x80x9cI/Oxe2x80x9d).
2. Description of the Related Art
Even as the power of computers continues to increase, so does the demand for ever greater computational power. In digital computing""s early days, a single computer comprising a single central processing unit (xe2x80x9cCPUxe2x80x9d) executed a single program. Programming languages, even those in wide use today, were designed in this era, and generally specify the behavior of only a single xe2x80x9cthreadxe2x80x9d of computational instructions. Computer engineers eventually realized that many large, complex programs typically could be broken into pieces that could be executed independently of each other under certain circumstances. This meant they could be executed simultaneously, or xe2x80x9cin parallel.xe2x80x9d Thus, a computing technique known as parallel computing arose. Parallel computing typically involves breaking a program into several independent pieces, or xe2x80x9cthreads,xe2x80x9d that are executed independently on separate CPUs. Parallel computing is sometimes therefore referred to as xe2x80x9cmultiprocessingxe2x80x9d since multiple processors are used. By allowing many different processors to execute different processes or threads of a given application program simultaneously, the execution speed of that application program may be greatly increased.
In the most general sense, multiprocessing is defined as the use of multiple processors to perform computing tasks. The term could apply to a set of networked computers in different locations, or to a single system containing several processors. However, the term is most often used to describe an architecture where two or more linked processors are contained in a single enclosure. Further, multiprocessing does not occur just because multiple processors are present. For example, having a stack of PCs in a rack serving different tasks, is not multiprocessing. Similarly, a server with one or more xe2x80x9cstandbyxe2x80x9d processors is not multiprocessing, either. The term xe2x80x9cmultiprocessingxe2x80x9d, therefore, applies only when two or more processors are working in a cooperative fashion on a task or set of tasks.
In theory, the performance of a multiprocessing system could be improved by simply increasing the number of processors in the multi-processing system. In reality, the continued addition of processors past a certain saturation point serves merely to increase communication bottlenecks and thereby limit the overall performance of the system. Thus, although conceptually simple, the implementation of a parallel computing system is in fact very complicated, involving tradeoffs among single-processor performance, processor-to-processor communication performance, ease of application programming, and managing costs. Conventionally, a multiprocessing system is a computer system that has more than one processor, and that is typically designed for high-end workstations or file server usage. Such a system may include a high-performance bus, huge quantities of error-correcting memory, redundant array of inexpensive disk (xe2x80x9cRAIDxe2x80x9d) drive systems, advanced system architectures that reduce bottlenecks, and redundant features such as multiple power supplies.
There are many variations on the basic theme of multiprocessing. In general, the differences are related to how independently the various processors operate and how the workload among these processors is distributed. Two common multiprocessing techniques are symmetric multiprocessing systems (xe2x80x9cSMPxe2x80x9d) and distributed memory systems. One characteristic distinguishing the two lies in the use of memory. In an SMP system, at least some portion of the high-speed electronic memory may be accessed, i.e., shared, by all the CPUs in the system. In a distributed memory system, none of the electronic memory is shared among the processors. In other words, each processor has direct access only to its own associated fast electronic memory, and must make requests to access memory associated with any other processor using some kind of electronic interconnection scheme involving the use of a software protocol. There are also some xe2x80x9chybridxe2x80x9d multiprocessing systems that try to take advantage of both SMP and distributed memory systems.
SMPs can be much faster, but at higher cost, and cannot practically be built to contain more than a modest number of CPUs, e.g., a few tens. Distributed memory systems can be cheaper, and scaled arbitrarily, but the program performance can be severely limited by the performance of the interconnect employed, since it (for example, Ethernet) can be several orders of magnitude slower than access to local memory.) Hybrid systems are the fastest overall multiprocessor systems available on the market currently. Consequently, the problem of how to expose the maximum available performance to the applications programmer is an interesting and challenging exercise. This problem is exacerbated by the fact that most parallel programming applications are developed for either pure SMP systems, exploiting, for example, the xe2x80x9cOpenMPxe2x80x9d (xe2x80x9cOMPxe2x80x9d) programming model, or for pure distributed memory systems, for example, the Message Passing Interface (xe2x80x9cMPIxe2x80x9d) programming model.
However, even hybrid multiprocessing systems have drawbacks and one significant drawback lies in bottlenecks encountered in retrieving data. In a hybrid system, multiple CPUs are usually grouped, or xe2x80x9cclustered,xe2x80x9d into nodes. These nodes are referred to as SMP nodes. Each SMP node includes some private memory for the CPUs in that node. The shared memory is distributed across the SMP nodes, with each SMP node including at least some of the shared memory. The shared memory within a particular node is xe2x80x9clocalxe2x80x9d to the CPUs within that node and xe2x80x9cremotexe2x80x9d to the CPUs in the other nodes. Because of the hardware involved and the way it operates, data transfer between a CPU and the local memory can be 10 to 100 times faster than the data transfer rates between the CPU and the remote memory.
Thus, a clustered environment consists of a variety of components like servers, disks, tapes drives etc., integrated into a system wide architecture via System Area Network (xe2x80x9cSANxe2x80x9d) Fabric. A SAN architecture employs a switched interconnection (e.g., ServerNet or InfiniBand) between multiple SMPs. A typical application of a SAN is the clustering of servers for high performance distributed computing. Exemplary switched interconnections include, but are not limited to, ServerNet and InfiniBand, a technical specification promulgated by the InfiniBand Trade Organization.
Currently, two types of data transfer are currently being used for moving data across various components of a cluster. The first called IPC, is mainly involved in providing inter-process communication by performing memory-to-memory transfers. More particularly, IPC is a capability supported by some operating systems that allows one process to communicate with another process. A process is, in this context, an executing program or task. In some instances, a process might be an individual thread. IPC also allows several applications to share the same data without interfering with one another. The second type of data transfer is involved with at least one I/O device e.g., inter-node memory-to-disk and disk-to-disk transfer of data.
FIG. 1 illustrates one physical architecture of a computing system 100 currently available to realize the three logical interconnections between two Nodes that may arise from device data transfers. Each node 110 is shown including only a single CPU 125, but may include several CPUs 125. The computing system 100 is a xe2x80x9chybridxe2x80x9d system exhibiting characteristics of both SMP and distributed memory systems. Each node 110 includes shared memory 115, provided by the shared disk(s) 120, accessible by all the CPUs 125 in the computing system 100 and private memory 130, provided by the private disks 135, for each individual CPU 125.
The three types of logical interconnections for internodal data transfer are:
memory to memory, e.g., from the host memory 140 in one node 110 to the host memory 140 in the other node 110;
memory to disk, e.g., from the host memory 140 in one node 110 to a shared disk 120 or a private disk 135 in the other node 110; and
disk to disk, e.g. from a shared disk 120 or a private disk 135 in one node 110 to a shared disk 120 or a private disk 135 in the other node 110.
As can be seen from FIG. 1, all three logical connections will occur over the peripheral component interconnect (xe2x80x9cPCIxe2x80x9d) buses 145. Under the protocols defining the operation of the PCI bus 145, each internodal data transfer will need to arbitrate with other computing resources for control of the PCI bus 145. Furthermore, if the CPU 125 were to need access to other devices, e.g., the device 150, sitting on the PCI bus 145, it too would be required to arbitrate.
This quickly results in the PCI Bus 145 becoming a bottleneck for performance. The old approach represented in FIG. 1 suffers from the following drawbacks:
only memory-to-memory or disk-to-disk memory transfers are possible at any given time;
memory-to-memory transfer access speeds are limited to PCI speeds (assuming serial interconnect speeds ramp up);
access of memory would prevent access of other devices on the PCI bus by other devices;
peer-to-peer access would result in non-accessibility of other devices on both PCI buses (e.g., the PCI buses 145, 155); and
allows only one inter-node transaction to occur at any given time. Hence, there is a need for a technique that will permit concurrent access for memory-to-memory transfers between nodes, memory to device transfers within a node and for memory-to-disk or disk-to-disk transfers between nodes.
The present invention is directed to resolving, or at least reducing, one or all of the problems mentioned above.
A new technique for transferring data between nodes of a clustered computing system is disclosed. In one aspect, the invention includes a cluster node comprising a system bus; a memory device; and an internodal interconnect. The internodal interconnect is electrically connected to the system bus and includes a remote connection port. The internodal interconnect is capable of transferring data from the memory device and through the remote connection port. In a second aspect, a the invention includes method for internodal data transfer in a clustered computing system. Each of at least two clusters includes an internodal interconnect electrically connected to a system bus and a memory device to the system bus. The method itself comprises requesting a data transfer and then transferring the requested data. The requested data is transferred from the memory device in a first cluster node to the memory device in a second cluster node via the internodal interconnects in the first and second cluster nodes.