The present invention relates to digital data processing, and in particular to the design of high-speed communication buses for linking processors, memory and other components of a computer system.
A modem computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer""s hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the throughput) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor(s). E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant speed improvements by shrinking component size, reducing component number, and eventually, packaging the entire processor as an integrated circuit on a single chip. The reduced size made it possible to increase clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrated circuitry, the demand for ever faster computer systems has continued. Hardware designers have been able to obtain still further improvements in speed by greater integration (i.e., increasing the number of circuits packed onto a single chip), by further reducing the size of circuits, and by various other techniques. However, designers can see that physical size reductions can not continue indefinitely, and there are limits to their ability to continue to increase clock speeds of processors. Attention has therefore been directed to other approaches for further improvements in overall speed of the computer system.
Without changing the clock speed, it is possible to improve system throughput by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this approach practical. However, one does not simply double a system""s throughput by going from one processor to two. The introduction of multiple processors to a system creates numerous architectural problems. For example, the multiple processors will typically share the same main memory (although each processor may have its own cache). It is therefore necessary to devise mechanisms that avoid memory access conflicts, and assure that extra copies of data in caches are tracked in a coherent fashion. Furthermore, each processor puts additional demands on the other components of the system such as storage, I/O, memory, and particularly, the communications buses that connect various components. As more processors are introduced, there is greater likelihood that processors will spend significant time waiting for some resource being used by another processor.
All of these issues and more are known by system designers, and have been addressed in one form or another. While perfect solutions are not available, improvements in this field continue to be made.
Of particular interest herein is the design of communications buses. In simple computer systems, all major components such as processor, memory, storage controllers, and I/O are connected on a single multi-drop communications bus. Physically, such a multidrop bus is a common set of parallel conductors, and each component is connected to the these conductors through logic drivers or gates. The architecture of such a bus permits any arbitrary component connected to the bus to send data to any other arbitrary component connected to the bus, although it is not necessarily the case that all possible combinations are actually used. Since only one component may send data at any time, the component sending data must first obtain control of the bus, a process known as arbitration. The bus typically has an address portion for specifying the receiving device(s), and a data portion for specifying the data being transferred. It may also have various control lines.
The clock speed at which a multi-drop bus operates is limited by the number of attached devices, their physical configuration with respect to one another, the speed at which individual devices operate, and other factors. For this reason, many computer systems have multiple buses. In particular, processors and memory may be coupled to a relatively high-speed bus, while storage and I/O devices may be coupled to a slower bus. Since processors and memory typically require a higher speed, and are physically close enough to support higher speed bus operation, isolation of processors and memory from the lower speed devices such as storage and I/O by using a special processor-memory bus supports bus operation at higher speed and improves system performance.
However, the demand for increased system throughput continues. It is desirable to increase the number of processors in a computer system to increase system throughput. However, the high-speed multi-drop processor-memory bus was intended for a relatively small number of attached components. As the number of processors attached to such a bus increases, it becomes difficult or impossible to operate the bus at the higher clock speeds necessary to support communication among the various components. Moreover, the simple creation of wider buses or of additional (parallel) buses is not always a practical solution. Wider or additional buses means that each processor must have additional I/O pins, where the number of I/O pins is already extremely constrained.
Some designers have attempted to address this problem using hierarchical buses, in which each processor is assigned to a node, all processors within a node being on the same local bus coupled to a node controller, wherein the node controller handles communications with devices in other nodes through a separate remote bus. However, these designs require a great deal of complexity on the part of the node controller, with attendant cost and collateral issues.
There is a need for an alternative high-speed communication path architecture in a computer system for supporting communication among larger numbers of processors and memory.
It is, therefore, an object of the present invention to provide an enhanced multi-processor computer system.
Another object of this invention is provide an enhanced processor-to-memory communication path for a multiprocessor computer system
Another object of this invention is to support an increased number of processors in a multiprocessor computer system.
Another object of this invention is to reduce the number of I/O pins and other hardware required to support communications in a multiprocessor system.
An internal communication network for supporting data communication among multiple processors and memory within a computer system comprises a command portion for transmitting addresses and commands, having a unidirectional input bus portion for transmitting commands to a central command repeater unit, and a unidirectional broadcast bus portion for broadcasting commands from the central command repeater unit. The input portion comprises a plurality of links running from different devices, wherein each link is less than the full width of the broadcast bus portion. A command is transmitted over the input portion in a plurality of bus cycles, and broadcast over the broadcast portion in a single bus cycle. Since multiple input links connect to the central command repeater, it is possible to keep the broadcast bus full notwithstanding the fact that multiple bus cycles are required to transmit an individual command on the input portion.
In the preferred embodiment, the links are arranged hierarchically. A series of unidirectional links runs between processors and local address repeater units (ARPs), and between the ARPs and the central command repeater, called an address switch unit (ASW), all of these links being half-width. A data transfer command propagates from a requesting device to its local ARP, to the ASW, requiring two bus cycles for each stage. From the ASW, the command is broadcast to all component devices on the network in a single bus cycle by transmitting to all ARPs or directly attached memory on a separate set of full-width unidirectional links. The ARPs then repeate the transmission to all attached processors or other units on another set of uni-directional links.
In the preferred embodiment, the ASW globally arbitrates the command bus. A request by a processor to transmit an address to the ARP must be granted first by the ASW. Once granted, the command will propagate in a pre-defined number of clock cycles to the ASW through the ARP, and thus addresses are not buffered in the ARP (although they are held in a register for a small number of cycles during re-propagation). The command is then broadcast, again at pre-defined bus cycles from initial bus grant.
In the preferred embodiment, addresses/commands and data are transmitted on essentially separate paths having different topologies, and at different times, and are arbitrated separately. The data portion of the network comprises a set of bidirectional links from the processors to a local data switch unit (DSW). The local DSW is further linked directly to memory via bi-directional links. In fact, the data portion of the network contains multiple independent data paths supporting multiple simultaneous data transfers, all of which are supported by a single logical hierarchical address bus portion. No address is transmitted with the data; rather, a tag is transmitted which identifies the command with which the data is associated.
Consistent with commonly understood terminology, the network is herein referred to collectively as a xe2x80x9cbusxe2x80x9d or xe2x80x9cmemory busxe2x80x9d (the latter to distinguish it from I/O buses). The portion of the network which transmits addresses and commands is sometimes referred to as the xe2x80x9caddress busxe2x80x9d, while the data portion is referred to as the xe2x80x9cdata busxe2x80x9d, and other portions of the network are similarly designated. It will be understood that the communications network described herein is not physically the same as a classical multi-drop bus, although it performs the analogous function.
The bus described herein acts has characteristics of a pipeline, wherein high clock speed (and high throughput) is achieved by staging bus operations over a number of cycles. The full-width broadcast bus is kept full because multiple commands can be received by the central repeater in an overlapping fashion on the multiple input links. The use of multiple hierarchical links enables the entire memory bus to operate at high clock speed. Furthermore, because the repeaters and ASW do not buffer data or determine destination through complex directories, the design is greatly simplified vis-a-vis typical prior art hierarchical designs. All of these considerations make it possible to support a relatively large number of processors at a high throughput.