1. Field of the Invention
The present invention relates to the design of digital circuits. More specifically, the present invention relates to an apparatus and a method for sequencing memory operations to and from memory devices connected to an asynchronous switch fabric. Example memory devices are random access memories (RAMs) and last-in, first-out (LIFO) memories also known as stack memories.
2. Related Art
It is often necessary in computing and communication equipment to send data from many sources to many destinations. This need appears in the central processing unit of computer systems where information may flow: from a register file to any one of a number of arithmetic or logical elements or to a memory controller; from one arithmetic element to another; or from an arithmetic element or memory controller to the register file. This need also appears in the input-output systems of computers where information must flow between and among various units, including processors, memories and secondary storage devices.
One common means for satisfying this need is known as a bus. A bus consists of a number of wires that extend between all communicating units. Each unit that wishes to send data places the data on the data bus so that any of the receiving units may receive it. Such bus structures are widely used both inside central computing units and in the input-output systems for computers.
There are a number of drawbacks to such a bus structure. First, each destination must attach some transistors to the bus in order to sense the state of the bus, and because there are many destinations, these sensing transistors collectively represent a large electrical load. Second, each source must attach driving transistors to the bus to drive data onto the bus, and even though all but one such drive transistor per bus wire is shut off when the bus changes state, the many inactive drive transistors connected to the bus also place considerable electrical load on the wires in the bus. Third, the bus wires themselves tend to be physically long and thus intrinsically represent further electrical load. The combined load on the bus wires from drivers, receivers and the wires themselves results in communication paths that are generally slow in comparison with other logical structures. Furthermore, only a single piece of information can flow per bus cycle, which limits the achievable communication rate.
One alternative to bus structure is the cross-bar switch. For each bit of communication, a cross-bar switch provides a grid of “horizontal” and “vertical” conductors, wherein each source drives a horizontal conductor and each destination senses the state of a vertical conductor. At each intersection of horizontal and vertical conductors in the cross-bar, a transistor or other switching element connects the conductors. This grid structure is repeated for as many bits as are to be transmitted at any one time.
The cross-bar switch has several advantages over the bus structure. First, each source drives only the capacitive load on the horizontal wire, which amounts to one receiving switch mechanism per destination. The many drivers that would have to be connected to each wire in a bus structure are here replaced by a single driver on the source wire. Because this driver drives only the source wire and its switches, it can be as large as desired, and can thus drive its load very quickly. Moreover, the wire for each destination has a load of only one sensing transistor, though it may be connected to many inactive intersection switches. Thus, the cross-bar switch divides the inherent loading in a simple bus into two parts, the horizontal wire pathway, and the vertical wire pathway, thereby speeding up the flow of information.
A further advantage of the cross-bar switch is that it can deliver several pieces of information concurrently. Several different sources can each deliver information to several different destinations at the same time provided no two sources and no two destinations are the same, because each such communication uses a different switch to connect its horizontal source wire to its vertical destination wire. That is, two or more switches may be active at any one time provided that no two switches in the same row or in the same column are active.
The disadvantage of the cross-bar switch lies in its large number of switching transistors. While each bit of the bus structure has only one drive element per source and one receiving element per destination, the number of switch points in a cross-bar switch is the product of the number of sources and the number of destinations. Not only do these many switch points require chip area and consume power, but also they require control information. The difficulty of controlling so many switches turns out to be a disadvantage in implementation.
A second alternative to the bus structure is to use point-to-point wiring between each source and each destination. Point-to-point wiring is returning to more common use in modem systems because it simplifies the electrical properties of the transmission lines used. In a point-to-point system, each destination must be prepared to receive signals along transmission lines that begin at each source, so that the number of receivers at each destination equals the number of sources. Similarly, each source must be able to send information to each destination. Thus, the number of sending and receiving mechanisms required is the same as the number of switch points in the cross-bar switch. The point-to-point mechanism can be thought of as a physical rearrangement of the cross-bar switches, wherein the horizontal and vertical wires in the cross-bar have become very short, and each switch at an intersection is replaced by a transmission line running from one source to one destination.
The point-to-point mechanism can be very fast. However, like the cross-bar it suffers from the need for a great deal of control information. Moreover, it is generally hard to find space for the large number of transmission lines required.
A third alternative to simple busses is to use some kind of network interconnection scheme. The Ethernet for example, is essentially a bus structure that uses itself for control, and transmits data serially. Other networks, including those with complex computer-controlled switches are well known and widely used. Such switches appear, for example, in the Internet. Generally, however, their control is very complex and their throughput is much less than that of an equivalent bus structure.
In an effort to overcome these problems, designers have created a structure that provides high throughput through a tree-structured multiplexing-and-amplifying system (see the related application by inventors Ivan E. Sutherland, William S. Coates and Ian W. Jones, entitled “Switch Fabric For Asynchronously Transferring Data Within A Circuit,” having Ser. No. 09/685,009, and filing date of Oct. 5, 2000). Because the stray capacitance of any wire in commonly used circuitry (such as CMOS) can store data, it is possible to store many values in a multiplexer tree structure and additional values in an amplification tree structure. The invention in the related application uses this storage to permit several communications to proceed concurrently in different parts of the structure. In this related invention, a new communication can be launched as soon as the wires it requires are no longer needed for the previous communication.
Instead of using 7a single-level bus structure, one embodiment in the related application uses a multiple-level structure. Consider, for example, a single-level bus structure for 64 sources and 32 destinations. Each of the 64 sources must have suitable drive transistors that can put data onto the bus. Thus, the drive structure to the bus is, in effect, a multiplexer with 64 inputs. Similarly, each of the 32 destinations must have a sensing transistor connected to the bus so that any of them can accept data values from the bus. Thus, the output structure is, in effect, a 32-way fan-out from the bus to the 32 destinations.
In CMOS technology, multiplexers with many inputs can be broken into tree structures of multiplexers with fewer inputs. Although such tree structures of multiplexers contain more levels of logic than a single multiplexer, they can nevertheless be faster because each level of logic is simpler. In fact, in the book Theory of Logical Effort, by Ivan Sutherland, Bob Sproul and David Harris, Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 1999, chapter 11.4.1 teaches that in CMOS circuits the fastest multiplexing structure is a tree in which each level joins approximately four inputs. Thus, the 64-input multiplexer of the example might better be replaced with a three-level tree. The first level gathers groups of four sources together onto several short “level-1” busses; in the example there would be 64/4=16 such level-1 busses. The second level of 4-input multiplexers gathers together groups of four such level-1 busses into somewhat longer “level-2” busses; the example requires 16/4=4 such level-2 busses. Finally, a third level of 4-input multiplexers gathers these level-2 busses together into a single “level-3” bus, which need be only long enough to reach all of the inputs from the nearest part of the level-2 busses.
Furthermore, a series of amplifiers can be used to deliver a particular signal to many destinations. Such a set of amplifiers can easily be arranged into a tree structure, much like the multiplexer tree but in reverse. In the example of 32 destinations, the information on the level-3 bus might be amplified and sent to two level-4 busses. Four amplifiers on each such level-4 bus might amplify the signal again, delivering it to a total of eight level-5 busses. Again, four amplifiers on each level-5 bus might be used to amplify the signal, each delivering its output to four destinations. In spite of the fact that more stages of amplification are involved, such structures are faster than a single stage of amplification can be.
These multi-level structures have an advantage of speed, but they require extra wires to accommodate the different bus levels. Thus, the design of such a structure is always a compromise between the desired speed and the space cost of extra wiring.
A further point must be made here: it requires energy to change the value on any wire in a CMOS system. Thus, delivering information to all destinations always, will consume more power than would be required to deliver the same information only to its intended destination, leaving static the state of wires that do not participate in that particular communication. The invention in the related application takes advantage of this potential saving in power.
Returning to the example of 64 sources, at the same time that the level-2 bus delivers information to the level-3 bus, a new source can deliver information to the level-1 bus provided the new information is kept from overwriting the previous command data. By overlapping in time the actions of different levels, the structure can achieve higher data throughput rates. In fact, the throughput of such a structure is limited mainly by its ability to turn the multiplexers on and off quickly enough.
Furthermore, consecutive communications from the same source to the same destination can overlap in time. For example, as soon as the first has cleared the level-1 bus, the second may use that bus. Naturally, a small time gap between communications is required; in the limit, however, there may be as many communications underway as there are levels in the tree-structures.
Similarly, one can store information in the structure that amplifies and delivers data from the main bus to the destinations. Such an amplification structure consists of several levels of amplification, each fanning out to a next set of amplifiers and finally to the destinations themselves. Each such level can also serve as a place to store information. Thus, for example, one can overlap in time the delivery of a data item from the level-3 bus to the first level of amplification, the level-4 bus, while delivering the previously transmitted data item from the level-6 bus to its final destination.
A further advantage of the invention in the related application is that it can operate asynchronously in time. For example, a data element launched from a particular source to a particular destination can flow along a certain path through the multiplexing structure, through the highest-level bus—also known as the “trunk”—and thence through the amplifying structure to its destination. While it is in flight, some other data element launched from a different source and at an unrelated time may take its own route to its own particular destination. Two such communications will not interfere with each other except where they require a common communication path. The invention in the related application permits each to proceed as far as it can without interfering with others, dealing with such potential interference by controlling only the sequence in which the conflicting communication actions may use the common path.
Yet a further aspect of the invention in the related application involves automatically stalling the communication mechanism when a source is not ready to provide information or a destination is not ready to receive it. Because the interconnection structure contains storage at every level, actions already underway may proceed without waiting for a stalled source or destination irrelevant to their action. Delay in one source need not retard the communications emanating from a different source, nor need delay in accepting previous data at a destination retard delivery to other destinations, except, of course, as such other communications require the use of pathways common to the stalled communication.
Naturally, the control of such a switching structure with internal storage presents its own set of challenges. One part of the invention described in the related application involves a simple set of control structures, which, also configured hierarchically, asynchronously control the concurrent flow of data through the switching structure from source to destination. The “switching directive” for each communication action includes a “source address,” indicating the particular source for this communication and a “destination address,” indicating the particular destination that is to receive this data item. A stream of such address pairs thus controls the dynamic operation of the data-switching network of the invention in the related application.
The asynchronous nature of this switching structure is an advantage when addressing elements with first-in, first-out (FIFO) semantics. If a read instruction appears before data has been written to a FIFO element, the instruction simply stalls until the data has been written. Additionally, the reads and writes to a FIFO element will always be ordered in the sequence directed by the instruction stream.
However, reading from and writing to elements that do not preserve FIFO semantics, such as a random access memory (RAM) device or a device with last-in, first-out (LIFO) semantics such as a stack, present a problem in this asynchronous architecture. The problem arises because the read and write ports of these devices are connected to different locations in the switch fabric—the read port of the device is connected as a data source for the switch fabric, while the write port of the device is connected to a destination address of the switch fabric—and the switch fabric does not preserve instruction order at these different locations.
This can cause what are known as read-after-write hazards and write-after-read hazards. For example with a RAM device, a read following a write instruction to the same memory address might return the previous data rather than the newly written data. Similarly, if read/write instruction order is not preserved, then a write following a read instruction to the same memory address could cause the read to return the newly written data value rather than the previous data value in that memory location as the instruction order indicated.
Consider, for example, that a push instruction has previously written data to the stack. While these data are on the stack, assume that a push instruction followed closely by a pop instruction is in the instruction stream. It is possible for the pop to arrive at the stack element prior to the associated push instruction, thereby popping the wrong data from the stack. Such non-deterministic behavior can be undesirable in many applications of the switch fabric.
What is needed is an apparatus and a method to preserve instruction order of reads and writes to memory devices connected to the asynchronous switch fabric.