The present invention relates to data communications networking.
Despite recent advances in the processing power of individual computers and the speed of accessing them over high-speed communication links, there will always be some computing problems that are larger than any individual computer can handle in a reasonable time on its own. Thus, it is common in some fields such as the design of an aircraft's airframe and the exploration of subterranean petroleum fields to assign a relatively small group of tightly coupled processors, e.g. two to 20 processors, to perform such projects. However, in some cases, the project is too big for such groups of processors to handle.
Some large-scale computing projects (LSCPs) that have been or are being handled by multiple thousands of processors include projects being conducted under the name Search for Extraterrestrial Intelligence (SETI). To further the SETI projects, interested individuals install programs on their personal computers (PCs) instructing their PCs, when otherwise idle, to process portions of data being collected by radiotelescopes. In another example, thousands of individuals installed programs on PCs which were then used to decode a widely used encryption algorithm.
However, in both such cases, the goals of the project were achieved only because the LSCPs were capable of being parsed into smaller micro-projects capable of being handled by processors that operate substantially independently from each other in a loosely coupled network. Neither of these LSCPs required a high degree of collaboration between processors, in the manner that is required to design an airframe or model a subterranean petroleum field, for example. Moreover, neither of the LSCPs requires sharing of information between processors in real-time to support a real-time service.
There are many instances where a high degree of collaboration between processors are required, in real-time for providing services in real-time. One common feature of the LSCPs for both the airframe design and petroleum exploration examples above is that large amounts of image data must be processed in multiple dimensions. Thus, for each point of the image, data representing variables in the three spatial dimensions are processed, as well as variables within other dimensions of time, temperature, materials, stress, strain, etc. The number of variables that are processed multiplied by the number of points within the image (the “resolution”) determines the size of the LSCP, such that the size of the LSCP grows geometrically when the number of variables and points are increased.
The simulation of the actual world to a user of a processing system as a “virtual world” is another LSCP which requires a high degree of collaboration between processors and real-time sharing between them to provide real-time services. In particular, a high degree of collaboration and real-time sharing are required to provide a virtual world which simulates the actual world and actual sensory experiences from locations around the world, while providing interactive play. In order to make experiences believable to the user, much sensory data needs to be collected in real-time from actual world sites, and “recreated” when the user “visits” the corresponding virtual site in the virtual world. Thus, data representing experiences such as sights (e.g. current images of the actual world site), current sounds, and indications of temperature, humidity, presence of wind, and even smells must be collected and made available to the user.
Because of lack of a processor network capable of supporting it, such virtual world is an unfulfilled need. It is estimated that the processing requirements for such virtual world would exceed the capabilities of the fastest supercomputer in the world, which is currently the “Earth Simulator”, a supercomputer in Japan having a speed of 82 Teraflops/sec, and a latency of 10 μs. The Earth Simulator is believed to be incapable of supporting such virtual world because of high latency, among others. High latency can be caused by high protocol overhead in messaging between processors. Thus, a need exists to provide a network of processors which communicate via a low overhead communication protocol having reduced latency, so as to permit increased collaboration between processors and improved sharing of information in real-time.
FIGS. 1A and 1B illustrate conventional topologies of data communications networks. FIG. 1A illustrates a topology including a cross-bar switch 10, while FIG. 1B illustrates a topology of a hierarchical network 20. The cross-bar switch includes an array of switch fabric elements 14, each having a buffer, for transferring messages between selected ones of a plurality of devices D0 through D3 at an input end 16 and selected ones of a plurality of devices D0 through D3 at an output end 18 of the cross-bar switch 10. As indicated in FIG. 1A, sixteen switch fabric elements 14 are needed to provide full input-output connectivity between four devices D0 through D3. FIG. 1A illustrates a use of the cross-bar switch 10 in transferring messages on a plurality of paths 12 between selected ones of the devices D0 through D3 at the input end 16 and the output end 18. For example, D0 transmits a message to D3 on a path 12, while D1 transmits a message on a path 12 to D2, and so on.
The hierarchical network 20 includes a set of four first stage buffers 22 for buffering communications from each of four devices D0 through D3 and a set of four first stage buffers 24 for buffering communications from each of four communicating elements D4 through D7. The four buffers 22 and the four buffers 24 are connected to two second stage buffers 26 which function to transfer communications between the first stage buffers 22 and the first stage buffers 24.
From the point of view of connectivity, both the cross-bar switch 10 and the hierarchical network 20 provide the same function. Any one of the devices attached to the network can communicate with any other device. However, from the point of view of maximum simultaneous throughput, the cross-bar switch 10 is superior because it includes many switch fabric elements 14 each having a buffer. The theoretical capacity of the cross-bar switch 10 equals the number of switch fabric elements minus one. Stated another way, the theoretical capacity in a 4×4 cross-bar switch such a shown in FIG. 1A is fifteen messages. In actuality, the maximum usable capacity is a percentage of the theoretical capacity, but is still generally within a range of about 60-70% of the theoretical capacity. Hence, about 10 messages can be communicated simultaneously by the cross-bar switch. In another example, suppose that the cross-bar switch 10 interconnects eight devices, having 8×8=64 switch fabric elements. Then, the maximum usable capacity becomes about 0.60* 64=38 simultaneous messages. By contrast, the capacity of the hierarchical network 20 is limited by the number of buffers at the highest level of the network. In the example shown in FIG. 1B, the maximum number of messages that can be simultaneously transmitted over the network 20 is two because there are only two second stage buffers 26. Comparing the two types of networks 10 and 20, the hierarchical network has a maximum capacity (2) which is only about 5% of the maximum capacity (38) of the cross-bar switch 10.
On the other hand, a hierarchical network 20 has superior economy to a cross-bar switch 10 because it has so much fewer switch elements (in the form of first and second stage buffers) and much fewer interconnections between buffers as well. In a hierarchical network 20 which interconnects eight devices as shown in FIG. 1B, only ten buffers 22, 24 and 26 are needed, in place of 64 switch fabric elements that are needed for an 8×8 size cross-bar switch similar to switch 10. While the cross-bar switch 10 provides superior connectivity, it is expensive to implement, as it requires many more switch fabric elements 14 than the hierarchical network 20 requires buffers 22, 24, 26.
Accordingly, it would be desirable to provide a network having a cross-bar switch topology for interconnecting a large number of communicating elements, having high capacity for transmitting simultaneous messages, while reducing the number of switch fabric elements required to implement such network.
FIG. 2 illustrates a configuration of a bridge 30 that is background to the present invention but which is not admitted to be prior art. As shown in FIG. 2, a plurality of devices BE0 through BE3 are connected for communication by a bridge 30 to a switching network 32. The bridge 30 converts messages received from devices BE0 through BE3 using a first communication protocol into messages for transmission onto a switching network 32 using a second communication protocol. Devices BE0 . . . BE3 are desirably those shown and described as “broadband engines” in commonly owned, co-pending U.S. patent application Ser. No. 09/815,554, filed Mar. 22, 2001 (hereinafter, the '554 Application). The '554 Application is hereby incorporated by reference herein. The BEs have a built-in capability of communicating over a first communication protocol such as that described in the '554 application as the IOIF protocol. However, BEs lack the capability of directly communicating over another protocol stack such as the communication protocol licensed under the name Infiniband™ by the Infiniband Trade Association®. The switching network 32 is desirably a high speed, high capacity, serial communications network, having a topology of a cross-bar switch such as that described above with reference to FIG. 1A. Communications over the switching network 32 are required to be formatted for transport according to the physical link layer 34 of the Infiniband protocol stack. Bridge 30 includes an IOIF adapter 36 for communicating with BEs and an Infiniband adapter 38 for converting communications from IOIF protocol to Infiniband protocol for transport over switching network 32.
Protocol stacks are logically divided into “layers” according to the well-known Open Systems Interconnect (OSI) reference model. According to the OSI reference model, a protocol stack includes, from the bottom up, a physical layer which conveys the bit stream at the electrical level, e.g., the voltages, frequencies and other basic operation of the hardware which supports the network. Next, a data link layer operates above the physical layer. The third layer of the stack, the network layer, handles routing and forwarding of messages at the packet level from one node to other nodes to which it is directly connected. Usually, a fourth layer of the protocol stack, the transport layer, operates above the network layer, the transport layer controlling connections between non-directly connected devices of the network, and providing a mechanism for tracking the progress of transferring packets of a multiple-packet communication across the network.
The management of these protocol stack layers is represented in FIG. 2 as follows. Connected to switching network 32 is an Infiniband physical layer controller 40, managing the physical connection of the bridge 30 to the switching network 32 and performing the transmission and reception of signals. The Infiniband link layer controller 42 operates above the physical layer controller 40, managing link characteristics and providing network layer function. The Infiniband adapter 38 provides transport layer function, controlling communications involving multiple packets and connections between devices BE0 . . . BE3, etc., and other devices across the network 32. The Infiniband adapter 38 is robust, having the capability of maintaining connections such that few communications are dropped or prevented from succeeding.
The bridge 30 used for converting communications between the IOIF protocol and the Infiniband protocol to permit BEs to communicate with devices over the switching network 32 has a serious disadvantage. The upper layers of the Infiniband protocol stack, i.e., all layers above the network layer, have high latency. Stated another way, a multi-packet message being transmitted across the bridge 30 and switching network 32 is slowed down by the operation of the Infiniband adapter 38. As shown and described below relative to FIGS. 3A and 3B, high latency results from the Infiniband protocol needing several preparatory operations to be performed before allowing a message to be transmitted.
The high latency of the Infiniband protocol is undesirable. Large-scale computing projects require simultaneous processing by a large number of BEs, while also requiring the continuity and uniformity of shared memory to be maintained. High latency greatly reduces the efficiency of cooperative computing projects, effectively limiting the number of processors which can cooperate on a large-scale computing project.
Accordingly, it would be desirable to provide a bridge capable of supporting multiple protocol stacks, such that a more streamlined, low latency protocol stack is available for use, as appropriate, when devices such as BEs need to cooperate together on computing projects. In addition, the bridge should still support the upper layers of the Infiniband protocol stack when needed.
BEs communicate with each other over an input output interface (“IOIF”) to which they are attached. When BEs are directly attached to the same IOIF, the BEs are said to be “local” to the IOIF, or just “local BEs”. When BEs are not directly attached to the same IOIF, communications between them must traverse one or more networks, e.g., a switching network. In such case, the BEs are said to be “remote” from each other, or just “remote BEs”.
An IOIF communication protocol governs communications over the IOIF. The IOIF protocol provides a high degree of supervision of message traffic, which is beneficial for tracking communications across the IOIF.
It is desirable for BEs to utilize the IOIF protocol to communicate messages between remote BEs disposed at locations of the network requiring many machine cycles to reach. One goal is that communication between such remote BEs occurs without high latency. As mentioned above, high latency limits the ability of processors within a network to cooperate together on a large-scale computing project.
A particular example of communicating over a network using the IOIF communication protocol is illustrated in FIGS. 3A and 3B. The information shown in FIGS. 3A and 3B is background to the present invention, but is not admitted to be prior art. In FIGS. 3A and 3B, elapsed time runs in a vertical direction from the top to the bottom. FIG. 3A illustrates a read operation performed by a BE 50 acting as a master device. In this operation, BE 50 reads from a BE 54 acting as a slave device across the IOIF 52. As shown in FIG. 3A, a read command 56 is issued by the BE 50 when it has permission to present the command. This occurs after initial permission-establishing protocol signals 57, 58 and 59 are presented in that order. Following receipt of the read command, the BE slave device 54 performs operations 60, 61 and 62, and then returns an acknowledgement (ACK) 64. The BE slave device 54 also prepares the requested read data for presentation to the IOIF 52. However, the BE slave device 54 waits to provide the read data across the IOIF 52 until the ACK 64 has been delivered to the BE master device 50. In a cycle 68, subsequent to the delivery of the ACK 64, the data requested by the read command is delivered to the BE master device 50.
Similarly, FIG. 3B illustrates a write operation performed by a BE 50 acting as a master device to a BE 54 acting as a slave device, across the IOIF 52. As shown in FIG. 3A, a write command 156 is issued by the BE 50 when it has permission to present the command, after initial permission-establishing protocol signals 57, 58 and 59 are presented in that order. Then, at 155, the master BE 50 prepares data for transmission across the IOIF 52. Following receipt of the write command, the BE slave device 54 performs operations 160, 161 and 162, and then returns an acknowledgement (ACK) 164. During this time, the IOIF 52 also provides a “data credit” to the master BE 50, allowing it to transmit the write data across the IOIF 52. However, the master BE 50 must wait until the ACK 64 has been delivered before the write data can be transferred to the slave BE 54. In cycle 168, subsequent to the delivery of the ACK 64, the write data is delivered across the IOIF 52 to slave BE 54.
In large-scale networks, it is desirable to communicate messages between nodes with sufficient address bits to uniquely identify every node on the network. Otherwise, such networks must be broken up into smaller subnetworks having independent addressing domains, and a latency cost will be incurred when traversing various addressing domains between communicating devices. However, the number of addressing bits used by a physical hardware layer of a communicating device is always limited. It would be desirable to provide a way of converting communications between communicating devices from having a limited number of address bits to having a larger number of address bits used for communications in the large-scale network.
Moreover, communicating devices may need read access to any data stored in any directly accessible memory device of a large-scale network. It would be desirable to provide a way for a communicating device to perform a global remote direct memory access (global RDMA) from any other memory available to it on the network.