1. Field of the Invention
This present invention relates generally to the field of network connected multiprocessor systems, and, more specifically, to a mechanism for improving the performance of data transmission through the network of such systems.
2. Discussion of the Prior Art
Network connected multiprocessor systems typically comprise nodes which communicate through a switching network. The nodes may be uni-processor workstations or bus based shared memory multiprocessor systems (SMP). A node may also be an I/O subsystem such a disk array which itself contains an I/O processor. In such systems, a variety of traffic may be communicated over the inter-connection network: shared memory coherence and data messages, TCP/IP packets, disk blocks, user level messaging, etc.. Each type of traffic relies on certain properties of the network to provide the type of service the producers and consumers of that traffic expect. In some cases latency is critical, such as with shared memory coherence traffic and with some types of user level messages. In other cases, throughput is more critical than latency, such as with disk accesses. In some cases, a quality of service guarantee in terms of latency or bandwidth is required. The challenge for the interconnection network in such systems is to provide appropriate characteristics for each data type. Except for the quality of service case, this typically involves balancing requirements of one data type against those of another. Various types of inter-connection networks have been devised to address the general problem of providing good latency and throughput for a variety of traffic types. Most of these techniques have been developed in the context of packet switched (as opposed to circuit switched) networks. In these networks, the original message to be transmitted is decomposed into two smaller units. At one level, a message is broken into packets which may be fixed or variable in length. At the next level, packets are broken into fixed sized xe2x80x98flitsxe2x80x99. A flit is the fundamental data unit that can be interleaved at the lowest level of the network (i.e. the switching elements and the physical wires that connect them). The flit is also the level at which most techniques for enhancing network latency and throughput have been deployed.
The earliest packet switched networks were xe2x80x9cstore and forwardxe2x80x9d networks. In a store and forward network entire packets are passed from switching element to switching element. A subsequent packet is not transmitted until the entire packet in front of it completed transmission. A later enhancement to this basic approach was xe2x80x9cwormhole routingxe2x80x9d. With wormhole routing the notion of a flit was introduced. Now, instead of waiting for an entire packet to be received into a switching element before forwarding, the first flits of the packet could be transmitted to a down stream switching element even before the later flits have been received from the up stream switching element. In this way, a packet could be stretched across the entire network through a xe2x80x98wormhole routexe2x80x99. Wormhole routing significantly improves latency in lightly loaded networks, however it can severely degrade network throughput by blocking links that unrelated traffic could use, had not a wormhole route been in the way. A third type of network called xe2x80x9cvirtual cut-throughxe2x80x9d alleviated the blocking problem by providing enough buffering in the switching elements so that when a route is blocked an entire packet is guaranteed space so that it can be safely tucked entirely within the switching element. Of course, this guarantee comes at the expense of considerable space on the switching element, if it is to work efficiently.
A more recent development in packet switched networks for multi-processors is xe2x80x9cvirtual channelsxe2x80x9d. Each physical channel, i.e., wire link between switching elements, is conceptually partitioned amongst multiple xe2x80x98virtualxe2x80x99 channels. Each virtual channel includes a physical virtual channel buffer on the switching element. The virtual channels are multiplexed across common physical channels, but otherwise operate independently. A blocked packet flit on one virtual channel does not block packet flits on a different virtual channel over a common physical channel.
Virtual channel networks provide better network utilization and reduce average communication latency by allowing data on one virtual channel (or lane) to overtake data on a different virtual channel when there is contention downstream on one channel but not on another. Another desirable property is guaranteed ordering of transmissions on each channel and the ability to prioritize different data types. One factor that mitigates the improvement in network utilization is fragmentation of bandwidth on network links due to underutilized network flits. This can occur when data types assigned to different virtual channels are smaller than the flit. If these data types are communicated frequently, but not frequently enough to allow multiple of them to be packed into a flit, the flits become under utilized which can result in network under-utilization. Furthermore, there is a motivation to make flits large to increase the payload to overhead ratio, which only exacerbates the problem. Also, if the flit size is optimized for communication of large objects such as IP packets, the network may not be suitable for communication of smaller objects such as cache lines.
It would be highly desirable to provide a network interface scheme that improves utilization in virtual channel networks and provides greater flexibility in how different data types are handled by the network.
It is an object of the invention to provide a network interface scheme that improves utilization in virtual channel networks and provides greater flexibility in how different data types are handled by the network.
According to the invention, there is provided a second level of virtual channels at the network interface, and particularly, the designation of many second level channels which may share a single first level channel on the network. First level channels operate within the network at the switch level and are only used for network decongestion. In general, virtual channels also provide prioritization of different types of communication, which capability can become essential when, for example, one data type blocking another leads to deadlocks. However, in the design of the invention, prioritization is handled by the second level channels which operate at the network interfaces, not at the switch level. The network switching elements are oblivious to the existence of second level channels.
Thus, the provision of a two level virtual channels network interface scheme decouples the issue of network congestion from the issue of how different types of communication are to be handled (with respect to the role of virtual channels in the network). The nodes of the system may be randomly assigned to a first level virtual channel and each node defines independently its own second level virtual channels. Alternatively, nodes may be assigned to virtual channels on the basis of how the system is partitioned, which reduces network interference between partitions. The number of second level channels and how they are managed may differ from node to node, as long as all the nodes that communicate with one another maintain a consistent policy amongst themselves. Each message type that a node recognizes is assigned to its own second level virtual channel. The network interface is responsible for packing the different message types from the second level channels onto its assigned first level channel. The ability to pack more than one message type on a single first level channel allows for more efficient use of network flits, which is how two level virtual channels can provide higher network utilization than single level virtual channels.
According to this scheme, three (3) message classes are supported by the mechanism: Latency Sensitive, Bandwidth Sensitive and Bi-Modal. Latency sensitive messages are messages of a size smaller than some M byte limit. Bi-Modal messages are those messages that comprise a latency sensitive component and a bandwidth sensitive component. The first N bytes of the message is considered latency sensitive and the remainder of the message is considered bandwidth sensitive. It is understood that both M and N are configurable through a trusted software agent. An application may specify a particular message class for any message that it sends, but the network interface hardware will reclassify a latency sensitive message to a bandwidth sensitive one if it is longer than M bytes. Furthermore, for a given flit of size L, the network interface hardware restricts M and N to be less than L (i.e., M less than L; N less than L). A latency sensitive message is further reclassified to bandwidth sensitive if M is greater than L (M greater than L).
Further according to this scheme, first level channels are divided into two (2) channel classes: Latency Sensitive and Bandwidth Sensitive. Within each class each first level channel is assigned a unique priority. Flits on higher priority first level channels overtake flits on lower priority channels. Second level channels provide a dedicated connection between two system nodes. The end points of a second level channel may or may not reside on different system nodes. When two end points are connected, a second level channel is formed and assigned a globally unique second level channel id. First level channels flow control flits at the link level. Second level channels flow control packets at the network interface level (i.e. end to end). The transport agent breaks messages into packets, if necessary, and passes them to the second level virtual channels specified by the agent above the transport level (e.g., Non-Uniform Memory Access xe2x80x9cNUMAxe2x80x9d controller, session layer of a TCP/IP stack, etc.). If a latency sensitive message has a length  greater than M, the transport agent rejects the request and returns an error condition code. Second level channels split bi-modal messages into two parts. The first N bytes of the message are passed to a first level latency sensitive channel and the remainder is sent to a first level bandwidth sensitive channel nearest in priority.
Advantageously, the system of the invention achieves greater network utilization in systems that require fine grain communication such as coherence controllers or, in systems that do course grain communications such as TCP/IP.