A parallel processing computer cluster is made up of networked computers that form nodes of the cluster. Each node of the cluster can contain one or more processors, each including one or more cores. A computational task, received from a requesting system, is broken down into sub-tasks that are distributed to one or more nodes for processing. If there are multiple processors and/or cores the computational task is further decomposed. Processing results from the cores are collected by the processors, and then collected by the node. From the node level, results are transmitted back to the requesting system. The methods of breaking down and distributing these sub-tasks, and then collecting results, vary based upon the type and configuration of the computer cluster as well as the algorithm being processed.
One constraint of current parallel processing computer clusters is presented by inter-node, inter-processor and inter-core communication. Particularly, within each computer node, a processor or core that is used to process a sub-task is also used to process low-level communication operations and make communication decisions. The time cost of these communication decisions directly impact the performance of the processing cores and processors, which directly impact the performance of the node.
Within a computer system, such as a personal computer or a server, a PCIe bus, known in the art, provides point-to-point multiple serial communication lanes with faster communication than a typical computer bus, such as the peripheral component interconnect standard bus. For example, the PCIe bus supports simultaneous send and receive communications, and may be configured to use an appropriate number of serial communication lanes to match the communication requirements of an installed PCIe-format computer card. A low speed peripheral may require one PCIe serial communication lane, while a graphics card may require sixteen PCIe serial communication lanes. The PCIe bus may include zero, one or more PCIe format card slots, and may provide one, two, four, eight, sixteen or thirty-two serial communication lanes. PCIe communication is typically designated by the number of serial communication lanes used for communication (e.g., “x1” designates a single serial communication lane PCIe channel and “x4” designates a four serial communication lane PCIe channel), and by the PCIe format, for example PCIe 1.1 of PCIe 2.0.
Regarding the PCIe formats, PCIe 1.1 format is the most commonly used PCIe format; PCIe version 2.0 was launched in 2007. PCIe version 2.0 is twice as fast as version 1.1. Compared to a PCI standard bus, PCIe 2.0 has nearly twice the bi-directional transfer rate of 250 MB/s (250 million bytes per second). A 32-bit PCI standard bus has a peak transfer rate of 133 MB/s (133 million bytes per second) and is half-duplex (i.e., it can only transmit or receive at any one time).
Within a parallel application, a message-passing interface (MPI) may include routines for implementing message passing. The MPI is typically called to execute the message passing routines of low-level protocols using hardware of the host computer to send and receive messages. Typically, MPI routines execute on the processor of the host computer.
In high performance computer clusters, cabling and switching between nodes or computers of a computer cluster may create significant issues. One approach to simplify cabling between nodes is blade technology, well known in the art, which uses a large backplane to provide connectivity between nodes. Blade technology has high cost and requires special techniques, such as grid technology, to interconnect large numbers of computer nodes. When connecting large numbers of nodes, however, grid technology introduces data transfer bottlenecks that reduce cluster performance. Furthermore, issues related to switching technology such as costs and interconnect limitations are not resolved by blade technology.