The term computer architecture in a very broad sense connotes the interconnection of a core set of functional units that include a processing subsystem that executes instructions and acts upon data, a memory subsystem that cooperates with the processing subsystem to enable selected data and instructions to be stored and transferred between the two subsystems, and an input/output (I/O) subsystem that allows at least the processing subsystem to exchange data and instructions with the network and peripheral environment external to the computer. This core set of functional units can be configured into different computer system topologies using various communication interconnection arrangements that govern the interchange of communications between the functional units. For example, a processor and its memories can be locally coupled in a circuit card or it could be geographically spread over a system chassis via a back plane interconnection.
The Personal Computer (PC) represents the most successful and widely used computer architecture. Architecturally, not much has changed since the PC was first introduced in the 1980s. At its core, a typical PC is comprised of a single circuit board, referred to as a motherboard, that includes a microprocessor which acts as the central processing unit (CPU), a system memory and a local or system bus that provides the interconnection between the CPU chip and the system memory chips located on the motherboard and the I/O ports that are typically defined by connectors along an edge of the motherboard. One of the key reasons for the success of the PC architecture was the industry-standardized manner by which the components were interconnected.
A more recent example of a popular chassis-based computer architecture can be found in the area of high performance computing (HPC). One of the architectural innovations in the HPC area has been the adoption of server blade configuration where one or more blades—such as server blades, memory blades, I/O blades, PC blades are plugged into a common rack that is based on industry standards. Instead of putting all of the chips for a computer system on a single motherboard, the functional elements of the computer system are broken out into smaller circuit cards referred to as blades that are then coupled together by a backplane that routes the largest amounts of data among different blades. In most of these HPC blade configurations, the backplane fabric for the common rack has been implemented by a standardized parallel bus interconnection technology such as the PCI bus. Breaking out the functional components onto blades permits more flexibility in terms of configurations of components, while the use of a standardized interconnection such as the PCI bus permits blades from different providers to be configured together in the same common rack. Like the successful PC architecture, the use of a standardized local or system bus interface such as the PCI bus has been critical to the success of the blade architecture for HPC and server computer systems.
One of the parameters that have a significant impact on the system performance and implementation is the memory access method used by processors. There are two fundamental architectures to access memory. One of the architectures is the Von Neumann architecture wherein one shared memory is used to store instructions (program) and data with one data bus and one address bus between processor and memory. This architecture requires instructions and data be fetched sequentially introducing a limitation in operation bandwidth which is often termed the “Von Neuman Bottleneck”. The second architecture to access memory is referred to as the Harvard architecture which uses physically separate memories and dedicated buses for their instructions and data. Instructions and operands can therefore be fetched simultaneously. Both architectures involve a bus or buses to transfer information between the processor and memory. It will be appreciated by those skilled in the art that regardless of the processor and memory speeds, the speed of information transfer between the processor and memory can substantially impact the performance of the computer system.
While there have been significant strides with respect to the available CPU power, memory capacity, and memory speeds for the individual components of a computer system, progress in processor-memory interconnections and memory access in terms of the speed of the local or system parallel bus has lagged far behind. Processors and memories that can operate at upwards of 3 GHz clock are known, but local system buses that can operate as a parallel bus interconnection at speeds that match the processor speeds are very rare as such high speed buses are difficult to implement. For example, the system bus, referred to as the front side bus, that is used to externally interface to a Pentium 4 microprocessor chip operates slower than the speed of the processor. Conventionally, I/O devices external to the motherboard communicate over a slow speed I/O bus, such as the Peripheral Component Interconnect (PCI) Bus, that is connected to a chipset on the motherboard, referred to as a bridge, which in turn communicates with the CPU over the front side bus. While this approach has worked well when I/O devices communicate at speeds that are much slower than the speeds of processors and main memory, current developments in I/O technologies, such as Infiniband and Multi Gigabit Ethernet, can deliver I/O communications at rates approaching upwards of several gigabits per second. These developments have blurred the conventional distinctions between CPU-memory and CPU-I/O transactions and negated the rationale for relegating I/O communications to a separate, slower legacy I/O bus such as the PCI bus.
One of the challenges in attempting to increase the speed of I/O buses, such as the PCI bus and PCI Extended (PCI X) bus, is that a parallel bus arrangement is prone to problems of clock skew between data flowing in the separate parallel data paths that may, for example, differ from each other by a very small path length. Clock recovery and data reconstruction prove to be increasingly problematic and unreliable as path lengths, data transfer speeds and/or the number of parallel paths are increased. Additionally, parallel buses take up considerable circuit board real estate.
Prior art solutions to the problems posed by increasing speeds on parallel buses for both front side buses and I/O buses have involved, for the most part, the use of proprietary protocols that are specific to a given provider of microprocessor chips and chipsets. For example, an advanced version of the front side bus on the Athelon 64/FX/Opteron, by Advanced Micro Devices, can operate at speeds approaching 1 Ghz for a theoretical bandwidth of 14400 MB/s for a parallel bus that is 32 bits wide. Unfortunately, this is a proprietary solution that is incompatible with the general trend of migrating to the adoption of industry wide standards that encourage vendors to develop products which are interoperable with other vendors' solutions so as to reduce time and cost to market for new products.
The problem created by this divergence between processor speeds and memory access speeds is well known and has been referred to in the prior art as the memory gap or memory wall problem. See, e.g., Cuppa et al., “Organizational Design Trade-Offs at the DRAM, Memory Bus and Memory Controller Level: Initial Results”, University of Maryland Systems & Computer Architecture Group Technical Report UMD-SCA-1999-2, November 1999. The memory gap problem is further compounded by the need to address a large memory capacity. One solution employed in the prior art to overcome the memory wall/memory gap problem is to eliminate the parallel bus interface between the processor and memory and use a serial backplane interface instead of a parallel bus like the PCI bus.
One early attempt to establish a standardized serial backplane interface between processors and memories was the Scalable Coherent Interface. Gustayson, D. and Li, Q., “The Scalable Coherent Interface (SCI)”. IEEE Communications (August 1996). Unfortunately, this proposal was not widely adopted.
More recently, proprietary high-speed serial interfaces between processors and memory have been developed by chip manufacturers, such as the AMD® HyperTransport and the Intel® Fully buffered Dimm (FB DIMM). Other alternatives have been proposed in the form serial chip-to-chip interfaces such as described by Trynosky, “Serial Backplane Interface to a Shared Memory,” Application Note: Virtex-II Pro FPGA Family, XILINX, Nov. 30, 2004 or and multiple single byte serial processor to memory interfaces as described by Davis, “The Memory Channel,” Summit Computer Systems, Inc. Sep. 19, 2004.
The migration from parallel to serial interfaces among components in a computing architecture is not unique to the processor/memory interface. Serial interfaces have also become the standard for almost all I/O communication channels, including back planes. Advanced Switching Interconnect (ASI) switching fabrics that utilizes hierarchies and multiple high speed clocked serial data lanes channels or proprietary packet switched DMA techniques as described, for example, in U.S. Pat. No. 6,766,383. Industry standard I/O protocols, such as Infiniband, Fibre Channel and Gigabit Ethernet, can deliver I/O communications at rates approaching upwards of several gigabits per second.
While the speeds of a serial I/O protocol theoretically could approach the speeds needed for the processor/memory interface, the communication overhead associated with serial I/O protocols has curtailed any serious attempts to consider using serial I/O protocols as a basis for a processor/memory interface. Serial I/O communication protocols generally have larger packet and address sizes that are better suited for accessing large amounts of data stored on disk or over a network. The larger packet and address sizes results in an increased communication overhead penalty. The processor/memory interface conventionally has required the ability to transfer data between the processor and memory for a single address location, a requirement for which the overhead of I/O transfers and protocols has been seen as massive overkill. In addition, there are many more transmission blocking and memory contention concerns that need to be addressed for I/O communications than for processor-to-memory interfaces.
Some alternatives that utilize a serial I/O interface protocol for backplane connections instead of parallel bus interconnection technologies have been proposed. U.S. Publ. Appl. No. 20050091304 discloses a control system for a telecommunication portal that includes a modular chassis having an Ethernet backplane and a platform management bus which houses at least one application module, at least one functional module, and a portal executive. In this patent application, a 1000 BaseT (Gigabit Ethernet) backplane provides a packet-switched network wherein each of the connected modules acts as an individual node on a network in contrast to a conventional parallel bus connection such as a PCI bus.
U.S. Publ. Appl. No. 20060123021 discloses a hierarchical packaging arrangement for electronic equipment that utilizes an Advanced Telecommunication Computing Architecture (TCA) arrangement of daughter boards in the for an Advanced Mezzanine Card (AMC) that are interconnected with a hierarchical packet-based interconnection fabric such as Ethernet, RapidIO, PCI Express or Infiniband. In this arrangement, the AMCs in each local cube are connected in a hierarchical configuration by a first, lower speed interface such a Gigabit Ethernet for connections within the local cube and by a second, higher speed interface such as 10 G Ethernet for connections among cubes.
The problems of Ethernet switched backplane architectures in terms of latency, flow control, congestion management and quality of service are well known and described, for example, by Lee, “Computation and Communication Systems Need Advanced Switching,” Embedded Intel Solutions, Winter 2005. These issues have generally discouraged the adoption of serial I/O protocols for communications between processors and memory even as such serial I/O protocols are being used in the smaller physical dimensions of a circuit board or a computer or communication rack or cabinet having multiple cards/blades interconnected by a backplane. Instead, the trend has been to increase the capacity of individual chips and the physical size of each of the server blades in order to accommodate more processors and memory on a single chip or circuit board, thereby reducing the need for processor and memory interconnection that must be mediated across the backplane.
As processor speeds, memory speeds and network speeds continue to increase, and as the external I/O is increasingly capable of delivering data at rates exceeding gigabit speeds, the current architectures for arranging the subsystems within a computing and communication architecture are no longer efficient. The problem of memory access like the Von Newman and Harvard architectures, in the light of multiple processor cores with in a chip further aggravates the processor and memory interconnect technology. There is therefore a need for a computing and communication chip architecture that is not constrained by the current architectural limitations and can provide a solution that is compatible with industry configuration standards and is scalable to match the speed, capacity and processing core requirements of a converged computing environment of the next generation computers and communications equipment.