In essence, a conventional computer system that realizes the ‘Von Neumann Architecture’ comprises a core set of communicatively interconnected functional units which may be viewed as the fundamental operational blocks of a computer system. The functional units, singly or in combination with other functional units, are capable of performing one or more operations. The interconnections can be physical, logical or both. The core functional units include a processing subsystem that executes instructions and acts upon data, a memory subsystem that cooperates with the processing subsystem to enable selected data and instructions to be stored and transferred between the two subsystems, an input/output (I/O) subsystem that allows at least the processing subsystem to exchange data and instructions with the network and peripheral environment external to the computer and a bus system over which the data and instruction interchange occur.
This set of functional units can be configured into different computer system structures using various communication interconnection arrangements that govern the interchange of communications and the interactions between the functional units. Each such structure has associated with a computer system architecture and a computer system organization. System architecture represents those attributes of the structure that are related to the logical execution of a given program on the system. The instruction set, the word length, data types, bus protocol, memory addressing, I/O modalities and other attributes that factor into the design of software for the particular system may be considered features of a specific system architecture. Computer organization, on the other hand, refers to a topology comprising hardware units and their interconnections that are operative to realize one of more of the system architectures. For example, the Central Processing Unit (CPU), the main memory organization, and the I/O and bus systems may be interconnected to realize the Personal Computer (PC) architecture as an example of one of the many kinds of computer architectures.
The Personal Computer (PC) represents the most successful and widely used computer architecture. Architecturally, not much has changed since the PC was first introduced in the 1980s. From a system organization perspective, a typical PC is comprised of a single circuit board, referred to as a motherboard, that includes a microprocessor which acts as the central processing unit (CPU), a system memory and a local or system bus that provides the interconnection between the CPU and the system memory and I/O ports that are typically defined by connectors along an edge of the motherboard. One of the key reasons for the success of the PC architecture is the standardized manner by which the components are interconnected.
A more recent example of another computer architecture based industry standards is the server blade based system architecture popular in the high performance computing (HPC) arena. The server blade architecture is based upon a computer organization where circuit boards or cards containing circuitry, referred to as blades, are adapted to deliver specialized functionality and are co-located within a unitary housing and coupled together by a backplane. Typically, the blades can be replaced during operation, but without interruption of the computer's operation, by other blades of the same or different functionality. Exemplary blades may include server blades, memory blades, I/O blades, PC blades, management blades, and storage blades. The backplane routes large amounts of data among different blades. In most of these HPC blade configurations, the backplane fabric is implemented by a standardized parallel bus interconnection technology such as the PCI bus.
The fundamental operational blocks of a computer system may be organized in the form of multiprocessor based, multi-core based, single-instruction-multiple-data (SIMD) or multiple-instruction-multiple-data (MIMD) capable parallel processor interconnections, message passing structures and other arrangements well known in the art. Each such computer organization supports a computer architecture requiring data operations involving one or more central-processing units (CPUs) and a general-purpose “main memory.” Any computer organization is likely to include at least a few basic arithmetic logic units as part of the at least one CPU that are configured to communicate with memory using a memory access operation(s) generally transparent to the program running on the CPU.
The technology enabling the memory access operation is often referred to as memory access technology (MAT) and is transparent to the program or code executing on the CPU. The term “memory” itself conventionally denotes a plurality of memories forming a memory hierarchy to allow the CPU the fastest access possible to the largest amount of memory and the fastest transfer rate. The memory hierarchy includes at least one general-purpose, relatively low-cost “main memory.” Memories in the memory hierarchy that are above the main memory are typically small, high-cost memories that provide relatively fast access and transfer times than the main memory. General purpose registers and the various levels of cache memories comprising, Static RAM (SRAM) for example, are fast memories. Fast memories are generally co-located with the arithmetic logic unit (ALU) within the CPU package to allow fast access and transfer rates by the CPU.
Conventional computer architectures are configured to dynamically move data within the various levels of memory in the memory hierarchy responsive to the data requirements of the CPU unfolding during program execution. The main memory is the first memory in the memory hierarchy which can be explicitly assessed under program control. Accesses and transfers from memories higher up in the memory hierarchy than the main memory, are generally independent of program control although a program can indirectly control movement of data to and from these memories by appropriately structuring the program to influence temporal and spatial locality of instructions and data that need to be fetched and stored in the fast memory. Main memory includes, for example, Dual Inline Memory Module (DIMM), Dynamic Random Access memory (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM). The main memory and all memories above the main memory in the memory hierarchy are directly accessible by the CPU. The memories below the main memory may be accessed as input/output (I/O). Hard disk drives, flash drives, peripheral device memories and network accessible storage are examples of such lower level memories. Transfers and access from and to these memories is relatively slow but they make a large memory capacity available at lower cost. The main memory may store data and/or instructions and while the main memory is hierarchically lower than the fast register and cache memory that are generally consider part of the CPU, the main memory represents a balance between access times, transfer rates, capacity and cost and is a workhorse among all the memories in the memory hierarchy.
Moore's law conjectures that transistor densities on board a processor chip double every 18 months or so, thereby doubling the clock rates of the processor chips. The pace of evolution of processor clock rates remains unmatched by memory clock rates which double over a much longer period of time. Consequently, data transfer rates from the main memory to the processor remain much slower than the rates at which the processor can process the fetched data. This is a phenomenon known as the bandwidth bottleneck in which advances in memory and bus technologies have lagged behind advances in CPU speed. Processors and memories that can operate at upwards of 3 GHz clock are now common, but local system buses that can operate as a parallel bus interconnection at speeds approaching the processor speeds are unknown. While there have also been significant strides with respect to the available memory capacity, technologies to effectively exploit the capacity without constraining CPU throughput remain elusive. For example, the system bus on a Pentium 4 microprocessor, referred to as the front side bus, operates upwards of 800 MHz, while the processor operates at multiple GHz clock speeds. This bandwidth bottleneck caused by the latencies introduced by memory access and transfer over current parallel bus memory access technologies severely limits the total throughput a contemporary CPU can deliver. The problem created by this divergence between processor speeds and memory access speeds is well known and has been referred to as the memory gap or memory wall problem. See, e.g., Cuppa et al., “Organizational Design Trade-Offs at the DRAM, Memory Bus and Memory Controller Level: Initial Results”, University of Maryland Systems & Computer Architecture Group Technical Report UMD-SCA-1999-2, November 1999.
One of the methodologies employed in the prior art to overcome the memory wall is to directly interconnect the CPU and the main memory so that data and instructions move over relatively short distances. Parallel bus architectures are conventionally the most common means for communications between the CPU and the main memory. An arrangement that positions main memory as close as possible to the CPU provides maximum bandwidth at minimum latency by reducing bus-related latencies. Capacity of available memory can be increased to a certain extent by expanding the bus-width between the CPU and the main memory so as to allow a larger amount of memory to be addressed as well as to increase the overall throughput. However, limitations in pin counts available for coupling the CPU to the main memory severely curtail the size of memory that can be so coupled.
One prior art technique attempts to bridge the processor-memory performance gap by using a three dimensional integrated circuit technology that allows various memory sub-modules to be located proximate to the CPU in layered arrangements within a single package and interconnected to the CPU by short vertical wires. An exemplary model is described in Cristianto C. Liu, Illya Ganusov, Martin Burtscher, and Sandip Tiwari, “Bridging the Processor-Memory Performance Gap with 3D IC Technology,” IEEE Design and Test of Computers, November-December 2005, pp. 556-564. While this technique has the potential to deliver gains in terms of speed of memory access and transfer, the technique is still restricted by the size (alternatively the capacity) of memory that can be cost-effectively implemented within a monolithic package given the logic density and heat dissipation issues that may need to be resolved.
In addition to packaging related issues, there are other parallel bus design issues that depend on the distance separating the CPU from the main memory. Depending on whether the CPU and a relevant main memory are resident on the same board, on different boards, or part of different systems, bus-related latencies and the resultant degradation in the throughput of the CPU may be significantly different. Parallel bus architectures have inherent limitations that restrict the separation between the CPU and the main memory and also limit the number of parallel lanes (i.e. the width) of the parallel bus. For example, signals traveling on separate traces are prone to degradation by signal attenuation, noise, crosstalk and clocking skew. In addition, the parallel traces can take up a large amount of the circuit board real-estate. The energy expended in pushing the data bits at high data rates through the traces of the bus can lead to increased ground bounce and noise problems. The parallel traces for a parallel bus may need to be constructed with special path-lengthening convolutions to equalize minute differences in path lengths introduced by routing the bus along a curved path on the circuit board. The variation in the path lengths of the traces of the parallel bus will introduce timing discrepancy between signals whose effects are exacerbated at high data transfer rates. Moreover, since each physical trace is bi-directional, the bus has to switch between transmitting and receiving which inherently adds to the bus-latency.
One solution to the memory wall/memory gap problem is to replace the parallel bus interface between CPU and main memory with serialized bus technology. Serialized bus technology generally involves paired, uni-directional, point-to-point interconnects which carry packetized data. The data or command word intended for the parallel bus architecture, is first recast into a plurality of packets which are serially transferred over one of the point-to-point interconnects and reconstructed into the data or command word at the receiving end. To obtain higher throughput, multiple serial links configured in the form of a narrow bus may be used. Each link is clocked independently of the rest making the set of links more skew tolerant than conventional parallel bus technology.
An early attempt to establish a standardized serial interface between processors and memories was the Scalable Coherent Interface. Gustayson, D. and Li, Q., “The Scalable Coherent Interface (SCI)”. IEEE Communications (August 1996). Unfortunately, this proposal was ahead of its time and was not widely adopted.
Several proprietary high-speed serial interfaces between processors and memory have been developed by chip manufacturers. Exemplary serial bus implementations include the AMD® HyperTransport and the Intel® Advanced Switching Interconnect (ASI) switching fabrics that utilizes hierarchies and multiple high speed clocked serial data channels or proprietary packet switched Direct Memory Access (DMA) techniques as described, for example, in U.S. Pat. No. 6,766,383. HyperTransport protocol requires a root-complex and operates in a master-slave mode. This protocol also requires an external clock to be transmitted with the communications thus making it unsuitable for out-of-the-box system-to-system communication over a network. Another prior art attempt to address the memory bottleneck is the recent fully buffered DIMM (FB-DIMM) memory access technology. FB-DIMM buffers the DRAM data pins from the channel through an advanced memory buffer (AMB) and uses point-to-point links with serial signaling to eliminate the stub bus. This serial bus architecture allows DIMM modules to be connected in series to allow a throughput upwards of 8.2 Gbs with a DDR2-800, for example. The serial signaling is similar to PCI-Express and like PCI-Express restricts the distance at which the main memory modules are located from the processor chip.
The migration from parallel to serial interfaces among components in a computing architecture is not unique to the processor/memory interface. Serial interfaces have also become the standard for almost all I/O communication channels. Industry standard I/O protocols, such as RapidIO, Infiniband, Fibre Channel and Gigabit Ethernet, can deliver I/O communications at rates approaching upwards of several gigabits per second.
While the speeds of a serial I/O protocol theoretically could approach the speeds needed for the processor/memory interface, these serial I/O communication protocols generally have larger packet and address sizes that are better suited for accessing large amounts of data stored on disk or over a network. The larger packet and address sizes results in an increased communication overhead penalty. In addition, there are different kinds of transmission blocking and memory contention concerns for I/O communications than for processor-to-memory interfaces.
U.S. Pub. App. No. 20050091304 discloses a control system for a telecommunication portal that includes a modular chassis having an Ethernet backplane and a platform management bus which houses at least one application module, at least one functional module, and a portal executive. In this patent application, a 1000BaseT (Gigabit Ethernet) backplane provides a packet-switched network wherein each of the connected modules acts as an individual node on a network in contrast to a conventional parallel bus connection such as a PCI bus.
U.S. Pub. App. No. 20060123021 discloses a hierarchical packaging arrangement for electronic equipment that utilizes an Advanced Telecommunication Computing Architecture (TCA) arrangement of daughter boards in the for an Advanced Mezzanine Card (AMC) that are interconnected with a hierarchical packet-based interconnection fabric such as Ethernet, RapidIO, PCI Express or Infiniband. In this arrangement, the AMCs in each local cube are connected in a hierarchical configuration by a first, lower speed interface such a Gigabit Ethernet for connections within the local cube and by a second, higher speed interface such as 10G Ethernet for connections among cubes.
The problems of Ethernet switched backplane architectures in terms of latency, flow control, congestion management and quality of service are well known and described, for example, by Lee, “Computation and Communication Systems Need Advanced Switching,” Embedded Intel Solutions, Winter 2005. These issues have generally discouraged the adoption of serial I/O protocols for communications between processors and memory that would typically be limited to the smaller physical dimensions of a circuit board or a computer or communication rack or cabinet having multiple cards/blades interconnected by a backplane. Instead, the trend has been to increase the capacity of individual chips and the size of each of the server blades in order to accommodate more processors and memory on a single chip or circuit board, thereby reducing the need for processor and memory interconnection that must be mediated across the backplane.
As processor speeds, memory speeds and network speeds continue to increase, and as the external I/O is increasingly capable of delivering data at rates exceeding gigabit speeds, the current architectures for arranging the subsystems within a computing and communication architecture are no longer efficient. There is therefore a need for a computing and communication architecture that is not constrained by the current limitations and can provide a solution that is compatible with industry configuration standards and is scalable to match the speed and capacity requirements of a converged computing environment internal, as well as external, to the motherboards of the next generation computers and communications equipment.