Recent developments in distributed computing platforms, called “compute clusters,” show that considerable progress has been made in increasing data processing speed and reducing platform sizes. In general, a compute cluster comprises a number of interconnected nodes, each of which is capable of performing a number of independent data processing tasks. A node can be a processor, memory, computer server, storage server, an external network connection or any other data processing or data transmitting device. Compute clusters are typically employed in either data reduction applications or data generation applications. In data reduction applications, large input data sets, such as data provided by a scientific instrument, are processed by identifying patterns and/or producing aggregate statistical descriptions of the input data. For example, in order to analyze and interpret the large amounts of image data obtained from an optical scan of a microarray, the image data can be reduced to smaller aggregate statistical descriptions. In data generation applications, small input data sets typically provide initial conditions for simulations that generate large output data sets that can be further analyzed or visualized. Combustion models, weather prediction, and computer graphics applications that generate animated films are examples of data generation applications.
Compute cluster applications are typically partitioned into hundreds, thousands or even millions of tasks by identifying specific individual tasks that can each be independently performed. Applications are often partitioned by a message-passing interface computer program and execution environment. Tasks can be distributed to different nodes based on the following criteria: (1) the order in which each task is received, (2) the configuration of the nodes in the cluster, (3) the computational demand of each task, (4) the amount of memory needed for each task, (4) the amount of data transmitted between nodes, and (5) the input/output requirements of the application.
Compute cluster nodes are typically interconnected via a network of high-speed, low latency, electrical interconnections that transmit data between nodes through a switch fabric. FIG. 1A illustrates a representation of a 4-node switch-fabric architecture 100. In FIG. 1A, physical nodes are interconnected via a switch fabric 102, where each physical node is represented by a first virtual node and a second virtual node. The first virtual node represents an input connection with the switch fabric 102, and the second virtual node represents an output connection with the switch fabric 102. For example, an input connection between physical Node 0 and the switch fabric 102 is represented by a rectangle 104 and a directional arrow 106, and an output connection between the switch fabric 102 and the physical Node 0 106 is represented by a rectangle 108 and a directional arrow 110. Switch fabrics provide interconnections so that nodes can simultaneously transmit data to different nodes in the compute cluster. For example, the switch fabric 102 provides interconnections so that the Node 1 can be simultaneously transmit data to the Node 2 and the Node 3, as indicated by dashed-line directional arrows 112 and 114.
The data processed by each node is typically partitioned into smaller fixed-sized packets that are then distributed through the switch fabric to particular nodes for processing. FIG. 1B illustrates an example implementation of the switch fabric 102, shown in FIG. 1A. In FIG. 1B, the switch fabric 102 includes input and output line cards, such as an input line card 118 and output line card 120, a permutation network 122, and an arbiter 124. Data is first transmitted from the nodes to the input line cards. The input line card partitions the data streams into fixed size packets. The packets are then transmitted to the switch fabric 102 and distributed to one or more first-in-first-out electronic-based data structures called “virtual queues.” The arbiter 124 receives information regarding the packets stored at the head of each virtual queue and accordingly configures the interconnections within the permutation network 122 to distribute a first batch of packets stored at the head of each virtual queue to particular nodes. The output line cards assemble the packets received by the permutation network 120 and transmit the assembled packets to nodes for processing. After the arbiter 124 has distributed the first batch of packets, the arbiter 124 reconfigures the permutation network 120 in order to distribute a second batch of packets stored at the head of each virtual queue for processing.
In general, switch fabrics uniformly distribute data between nodes. However, compute clusters often have a number of nodes that exchange large amounts of data more frequently than other nodes, and the low latency interconnections provided by switch fabrics have limited bandwidths. As a result, the amount of data that can be transmitted between nodes is not well matched to the data transfer needs of the particular nodes at each point in time, resulting in data processing delays. In addition, arbiters can delay data processing, because arbiters typically rely on receiving information regarding all packets located at the head of each queue before distributing a batch of packets. Manufacturers, designers, and users of compute clusters have recognized a need for an interconnection architecture that provides large bandwidth, high-speed interconnections, and a switch fabric that does not rely on an arbiter to distribute packets.