1. Technical Field of the Invention
This invention relates generally to data communications and more particularly to high-speed wired data communications.
2. Description of the Related Art
With the continued evolution of semiconductor processing technologies, systems on a chip (SOC) are now integrating numerous embedded processors to perform a number of specialized functions. SOCs are in turn being ganged together on system boards to further leverage the processing power of the SOCs for various applications that require processing intensive performance, such as high performance networking and communications systems, servers, graphics and high-definition streaming video.
While the performance of these embedded processors continues to increase at a pace that doubles every 18 months (current clock frequencies are approaching and soon to exceed 1 GHz), the I/O process (the input and output of data between processors and other devices) continues to be a drag on the ability of systems designers to exploit these advances in processor performance. Over the years, numerous I/O buses have been developed in an attempt to facilitate the I/O process between processors and other devices. Some examples of these buses are the ISA (industry standard architecture) bus, VL-Bus (VESA (Video Electronics Standard Association) Bus), SPI, 1394, USB 2.0, 1 Gbit Ethernet, the AGP (Accelerated Graphics Port) bus, the LPC (Low Pin Count) bus, and the peripheral component interface buses PCI-32/33 and PCI-X. These buses often must be bridged together to support a varying array of devices on a chip or between chips. However, trying to integrate these buses increases system complexity, increases circuit size because additional circuitry must be devoted to arbitration bridge functions, and generally delivers less than optimal performance overall.
One relatively new approach to providing higher performance I/O processing between complex processing devices is to employ a packetized, high-speed, low-latency, point-to-point mezzanine bus protocol that can interface with legacy buses like those mentioned above, as well as next generation buses including AGP 8×, Infiniband, PCI 3.0, and 10 Gbit Ethernet. Two examples of these newer high-speed bus protocols include Rapid I/O and the HyperTransport (HT) bus, previously known as the Lightning Data Transport (LDT) bus. Revision 1.04 of the HT specification is made available by the HyperTransport Technology Consortium's web site at www.hypertransport.org. The Rapid I/O specification is available at www.rapidio.org/home.
The HT protocol uses a software transparent load/store memory mapped addressing scheme to provide an interconnect between CPU, memory, and I/O devices that is high speed, low latency, packetized and that is scalable from a Symmetrical Multiprocessing (SMP) server architecture down to desktop personal computers and embedded systems. The protocol layer includes the I/O commands, three virtual channels in which they are transported over dual, independent unidirectional point-to-point links. The links can be 2, 4, 8 or 16 bits wide. The protocol is packet based, with all packets being multiples of 4 bytes and permitting a maximum payload of 64 bytes.
All HyperTransport commands are either four or eight bytes long and begin with a 6-bit command type field. The most commonly used commands are Read Request, Read Response, and Write. The basic commands are summarized in the table of FIG. 1, listed by the virtual channel 100 within which the command 102 is transported over the HT link. The format for a Posted Sized Write command (or control packet) is illustrated in the table of FIG. 2. The format for a Response command is illustrated in the table of FIG. 3, and the format for a payload or I/O data packet is illustrated in the table of FIG. 4.
A virtual channel is an abstract connection through a single medium. An example of the implementation of virtual channels between two devices is shown in FIG. 5. Virtual channels 502 and 508 are realized by introducing separate flow controls for each abstract channel, and adding buffers 502a,b and 508a,b on each side of the physical medium 500 (i.e. a source 504 and target 506).
Because of the separate flow controls and the fact that transactions can be split into different categories, it is possible to introduce different levels of priority. Moreover, it is the means by which deadlocks are prevented. A deadlock is a condition where forward progress cannot be made due to agents conflicting with one another for resources. The classic example of a deadlock involves a circular dependency with two agents that both require the same two resources, but in a different order. If enough virtual channels are used in a system (along with their requisite buffer resources) to eliminate dependencies, deadlocks should be avoided.
The HT link has been implemented in numerous application specific topologies. One such configuration is the HT chain, one example of which is illustrated in FIG. 6. The chain is a series connection via the HT link 616 of multiple HyperTransport input/output host bridge 602 and or tunnel devices 604, 606, 610 and 612 through a host processor 602 and a single physical channel over HT link 616. Typically, all transactions are initiated by host 602 or the host bridge 608. FIG. 6 illustrates a more commercial example of an HT host processor 700 coupling the two processors 702 and 704 to a number of other tunnel devices through series HT link 716 and I/O hub 712.
Another possible application is the HT switch, where a HyperTransport I/O switch handles multiple HyperTransport I/O data streams and manages the interconnection between the attached HyperTransport devices. For example, a four-port HyperTransport switch could aggregate data from multiple downstream ports into a single high-speed uplink, or it could route port-to-port connections. A switched environment allows multiple high-speed data paths to be linked while simultaneously supporting slower speed buses.
One popular arrangement of processing resources that is particularly useful in applications requiring significant processing power, (e.g. server and mass storage applications), is a symmetric multiprocessor (SMP) arrangement that shares a memory between several processing resources over a shared multiprocessor bus (MP bus). As illustrated in FIG. 8, in an SMP system 800 (sometimes integrated as a system on a chip (SOC)), the physical memory 812 is both physically and logically contiguous. All of the processing units 802, 804, 806 and 808 can access any location in the memory with virtually uniform access times over the MP bus 814.
SMP systems also typically incorporate instruction/data caches 832, 834, 836 and 838 to buffer.“mostly accessed” instruction/data. This decreases the time required to access this instruction/data by avoiding a memory access to the larger and therefore much slower memory resource 812 most every time it fetches instructions and data (usually implemented off-chip as dynamic random access memory (DRAM)). To exploit the spatial locality of instructions and data, cache memories are designed to bring instructions/data from the memory resource in blocks or lines. Typically, the memory content is updated on a block by block basis. A level two (L2 cache 210) cache may also be implemented on chip to further reduce the number of accesses to the off-chip memory 812.
One of the additional complexities characteristic of SMP systems, where each processor caches blocks of content from the memory resource, is that of coherency. That is, if two of the processors 802, 804, 806 808 cache the same data line in their caches 832, 834, 836 838 respectively, and then one of the processor wants to update its own copy, then the other processor will read stale data if it reads its version of the line in its cache after the update. To prevent this from occurring, coherence protocols are typically implemented over the MP bus 814 to ensure that the data is ultimately always coherent within the shared memory 812.
Applications such as server and high-speed communications systems continue to demand more processing power. The SMP architecture, however, does not scale well above a certain number of processing units because the link (typically a shared multiprocessing bus) that endeavors to service the memory accesses for all of the processors with equal access to the entire memory becomes a bottleneck. Not only doe the MP bus 814 service the memory access requests of all of the processors, but it also must handle I/O traffic that is initiated by the chip and that must be transported between the processing units and I/O buses and/or devices on or off chip (not shown).
To scale processing resources even further to achieve greater processing power, an alternate multiprocessing architecture is often employed called non-uniform memory access (NUMA). This architecture provides a processing resource with its own locally associated physical memory, but also provides that processing resource access to the local physical memories of all of the other processing resources in the system as well. The memory access is non-uniform, as opposed to the SMP architecture described above, because the access time seen by a processing resource will be significantly shorter for local memory accesses than it will be for remote accesses made to the local memories of other processing resources. This architecture is also known as a distributed shared memory (DSM) system, because while the memory is physically broken up into separate local physical memories, they are shared logically as a contiguous memory from the perspective of each processing resource in the system. A representative DSM architecture is illustrated in FIG. 1.
Thus, each processing node 901, 903 and 905 can have an SMP processor resource 908, 910 and 912 coupled to its own physical memory 902, 904 and 906 respectively. Each processing node is coupled through an I/O port of its processor (not shown) to an interconnect fabric 914. Because the whole memory (i.e. the combination of memory resources 902, 904 and 906) is seen as logically contiguous by each processor 908, 910 and 912, each processor can access any shared memory location by simply using the logical address of that location, which is mapped to a physical location within one of the physical memory resources 902, 904, 906. Thus, each processor can access its own local physical memory resource directly (i.e. within its home processing node), or it can access data lines within the memory resources of remote processing nodes through the interconnect fabric. The home node is therefore responsible for its own section of the entire logical memory space and all read/write requests to a logical address that maps to its section will be sent to and processed by that node.
Even greater processing capability can be achieved if the SMP architecture as generally described in FIG. 8 is implemented on a first level (often the chip level), with the DSM NUMA architecture being implemented between the SMP chips on an inter-chip level. Thus, the processor resources 902a–c of the NUMA architecture of FIG. 9 can be implemented as the SMP processor systems 800 of FIG. 8, with the physical memories 912a–c of the NUMA architecture in FIG. 9 each being equivalent to the memory 812 in FIG. 8. The memory buses 940a–c of the NUMA architecture in FIG. 9 are equivalent to the memory bus 940 of the SMP system 800 in FIG. 8.
Of course, the complexity of maintaining coherence between cache and memory becomes significantly greater two-tiers of coherency must be maintained. Not only does the coherence have to be maintained on the SMP multiprocessor system level (i.e. intranode cache coherence), it must also be maintained between the processing nodes 900a–c at the NUMA level of the architecture (internode cache coherence). Those of average skill in the art will recognize that the SMP system 800 of FIG. 8 will also have one or more I/O ports (not shown) by which the SMP system 800 can communicate with external processing and I/O devices.
The I/O ports required of the SMP architecture on the intranode level as a result of their implementation within a two-tired multiprocessing scheme are more sophisticated as they provide an interface by which the status of memory blocks are lent to remote nodes by way of internode coherent memory transactions. This interface must also be able to translate internode level coherent memory transactions into the requisite local intranode coherent memory transactions, especially when the two coherency protocols are not identical.
Clearly, between the sheer volume of I/O and coherent memory transactions that must be serviced between the nodes, a bus that does not have sufficient bandwidth would quickly degrade the throughput of the structure and mitigate the advantage of processing power the architecture might otherwise provide. The DSM NUMA architecture of FIG. 9 could therefore benefit from the implementation of the interconnect fabric 914 as a high speed packetized I/O link such as the HT link described above. However, for the processing nodes to access memory data lines from anywhere in the distributed memory, the packetized I/O link must be able to transport coherent memory transactions necessary to maintain coherency between the various physical memories associated with each of the processing nodes.
At present, the HT packetized I/O bus protocol does not support the transport of coherent memory transactions. Therefore, a need exists for a high speed packetized I/O link that can transport the coherency transactions necessary to implement a two-tiered NUMA multiprocessing system, while meeting the bandwidth requirements necessary to leverage the speed at which the processing resources of such systems can presently operate and at which they will operate in the future. There is further a need for an interface between such a packetized I/O bus and the processing nodes coupled thereto by which coherency at the node level can be maintained and by which coherent memory transactions at the intemode level can be translated to coherent memory transactions at the intranode level and vice versa.