Portions of the disclosure of this patent document contain material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office file or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The present invention relates to logical circuit design, and in particular the invention is directed to an asynchronous pulse logic circuit.
2. Background Art
VLSI (Very Large Scale Integration) system design is the process of implementing and realizing a system specification, the architecture, as an electronic circuit. We shall assume that the architecture is given to us and that the fabrication is not our concern. Longtime tradition divides the design process into two stages beyond computer architecture: implementation of the architecture by a micro-architecture and realization of the micro-architecture by a physical circuit design. The border is an artificial demarcation drawn for political purposes. The VLSI border traditionally serves to separate high-level logical reasoning from electronic-circuit design, tasks usually performed by different people, or at least by different software systems.
From Physics to Computer Science
It has slowly been realized that, as Carver Mead suggested, VLSI system design contains aspects of both software design and electrical engineering. In VLSI, the imagination of the mathematician and enthusiasm of the programmer finally meet with the pragmaticism of the engineer. c, we are told, is the speed limit; xcex is the accuracy that we can build things with. But most of us would rather ignore the problems of others. So when we imagine and program a VLSI system, we do not allow c and xcex to constrain our imagination or to damp our enthusiasm. We design our systems as if c and xcex did not exist, and then we tell the engineer, xe2x80x9cImplement this.xe2x80x9d When the wafers return, we say that the poor performance is not our fault: we cannot be blamed for any failure to deal with c and xcex since we left this task to our friend, the engineer.
Asynchronous digital design
Poor performance is usually unacceptable for a VLSI system. Optimists have long studied asynchronous design techniques, hoping that they have found at least a partial solution to the design problem. While it is true that proponents of asynchronous design like claiming that asynchronous circuits offer speed and power advantages, the main advantage of asynchronous design is more subtle than these: it is the designer""s ability of easily composing circuits that operate at different points in the design space (characterized by speed, power, and design effort) without destroying the beneficial properties of any of the circuits.
A system is asynchronous if, in short, it does not use a clock for sequencing its actions. What unites all methods of asynchronous circuit design is that they all strive for making the speed of computing dependent on the operations that are being carried out. A slow operation is allowed to take longer than a fast one; the system continues to the next operation only once the previous one is completed. It is as if we could assemble a troika consisting of an Arabian, a Shetland pony, and a draught horse, without losing the useful qualities of the individual horses. If we should try this with real horses, the harness would act much as the clock does in a synchronous system and render the exercise pointless. But the asynchronous troika may be able to pull its load better than even a well-matched synchronous team,:because the horses are not harnessed together by the clock - the draught horse does not have to keep up with the Arabian, and we do not have to feed the big horses if we only have need for the pony. By allowing us to divide up a system into smaller, more independent pieces, the asynchronous design technique simplifies the large-system design problem.
Asynchronous design-styles
In a synchronous system, it is easy to know when a computation is done. When the clock edge arrives, we read out the results of the computation. If it is not finished by then, we say that the system is wrong and throw it on the trash heap. (Or - less violently - adjust the clock speed.) The computation must necessarily be done by the time the clock edge arrives, or else the synchronous model would not make sense.
In contrast, the chief difficulty in asynchronous design is knowing when a specific computation is done. If we encode data in the same way as in a synchronous system, e.g., using two""s-complement numbers, and start an operation, and the number xe2x80x9c5xe2x80x9d should appear on the result bus of our asynchronous system, how are we to know that it signifies the result of the present computation, and not of the previous? Worse, might it not be the bitwise combination of the results of the previous and current computations?
Bundled-data design
The early asynchronous computers were designed in what we shall call the bundled-data style. Designing in this style, the designer assumes that he can build a delay that matches whatever the delay is of the computation that he is really interested in. This matched delay is used as an xe2x80x9calarm clockxe2x80x9d that is started when xcex6(x) is started and that rings when we can be sure that xcex6(x) has been completely computed. The design style is called bundled data because the data travels in a xe2x80x9cbundlexe2x80x9d whose timing is governed by the control signal that we called the xe2x80x9calarm clock.xe2x80x9d As one might guess, arranging for the matched delay is the Achilles"" heel of the bundled-data style. If the delay is too short, the system will not work; if too long, then it will work slowly. Especially if computation times are data-dependent, the matched delay can easily become a designer""s nightmare. The matched delay mechanism""s working rests on a form of a priori knowledge of relative timing; we shall call making use of such knowledge a timing assumption.
Delay-insensitive design-styles
Originally conceived of at about the same time as the bundled-data design-style, delay- insensitive logic design attempts using the data bits themselves for sequencing. By making every input transition (change in logic level) cause, either in itself or within a cohort of input transitions, an output transition or a detectable pattern of output transitions, we can at least make interfaces between processes delay-insensitive.
Systems built using the delay-insensitive philosophy range from the speed-independent investigated by D. E. Muller in the 1950""s, which work under the assumption that all wire delays are negligible compared with the operator delays (which may be of any length), to the truly delay-insensitive, in which both operator delays and wire delays may be arbitrary. Martin has shown that, using a reasonable operator model, truly delay-insensitive systems are of little use; the work in our research group has mainly been within the quasi delay-insensitive (QDI) model, which is essentially Muller""s speed-independent model with information added for distinguishing between wires whose delays must be short compared with the operator delays and wires whose delays may be arbitrarily long.
Assembling a working system out of QDI parts is almost frighteningly easy: start from a correct sequential program, decompose it into communicating processes, compile these processes into circuits, put the pieces together, and everything works. The chief advantage of this design method is that once we have decomposed, the design style is completely modular: there is no implicit use of global information (i.e., no clock), and the different parts can be designed independently.
There is one difficulty with QDI design: the requirement that the circuits work properly even if all operator delays were to vary unboundedly is a difficult one to satisfy; our satisfying it involves inserting much circuitry whose only purpose is checking for the occurrences of transitions that we may know would in any case take place. We should say that QDI systems must still be designed xe2x80x9cwithin reasonxe2x80x9d: it is possible to make things not work by designing them very poorly; likewise, it still takes considerable work and skill to achieve good performance.
The present invention is a class of circuits named asynchronous pulse logic circuit (APL) and methods for designing such circuits.
The present invention is a design style that allows making use of limited amounts of timing information, i.e., limited use of timing assumptions, without destroying the most important, system-simplifying property of QDI design, namely that of the data""s carrying its own timing information. The present invention does this by replacing two of the four- phase (return-to-zero) handshakes in a QDI circuit with pulses, thus breaking the timing dependencies that are the source of the performance problems of QDI circuits. One object of the present invention is that of improving the performance of modular asynchronous systems so much that it becomes possible to use asynchronous techniques for implementing large systems that perform well, yet are easy to design.
The APL scheme of the present invention takes a simple approach: we use a single-track external handshake, and we minimize the number of timing assumptions at the interfaces between processes; internally, in contrast, we design the circuits so that they generate predictably timed internal pulses. This is a separation of concerns: most of the variable parts of an APL circuit (i.e., those parts that vary depending on what CHP is being implemented) are arranged so that their delays do not matter much for the correct operation of the circuit; conversely, the pulse generator, whose internal delays do matter for the correct operation of the circuit, does on the other hand not vary much.
This is a great benefit from the invariability of the pulse length: since the pulse length varies so little (this is a different way of saying that the pulse repeater has a high length-gain), we commit only a minor infraction if we assume that the length is constant. The simplifying power of this assumption can hardly be overstated: once we have assumed that the pulse length is given, we need only verify that the circuitry generating the pulse and the circuitry latching the pulse work properly given that pulse length, and - this is the important part - we need not consider the effects of the inputs and outputs on the pulse length. This means that we can verify our timing properties locally. In effect, we have reduced a problem consisting of verifying the properties of the solution to a system of N coupled nonlinear equations into one involving N uncoupled nonlinear equations: we have gone from a task that seems insurmountable to one that is (in theory at least) easy.
One embodiment of the present invention is a class of circuit design called the single- track-handshake-asynchronous-pulse-logic (STAPL) circuit. STAPL serves as a new target for the compilation of CHP (Communication Hardware Process) programs. In STAPL circuits, the acknowledgement and data reset phases of the four-phase handshake protocol are removed. In place of these two phases is pulse generating circuitry that regulates timing assumptions that ensure the proper functioning of the circuits without these two phases. STAPL circuits have requirements that set the maximum single-track hold time and minimum single-track setup time of nodes in the circuits and guarantee that the minimum single-track setup time is greater than or equal to the maximum single-track-hold time. In one embodiment, a five-stage pulse generator is used to create a 10 transition count circuit.
Embodiments of the present invention include essential circuitry such as arbiter, state-holding circuitry, buffers, conditional and unconditional communication components, all implemented in accordance to the STAPL design style.
An object of the present invention is to improve the ease of design in circuits. In terms of ease of design, STAPL circuits are shown to be as easy to design as their QDI counterparts. STAPL circuits are more sensitive to sizing. It is not clear how important this is for the designer, since QDI sizing must also be verified before fabrication.
Another object of the present invention is improved circuit performance. In terms of speed, STAPL circuits are undoubtedly faster than QDI circuits. An embodiment of the present invention is a microprocessor, called the SPAM processor, which demonstrates the gain in performance that can be achieved by using STAPL circuits. The embodiment shows that something as large as a microprocessor can be designed with circuits that all run at 10 transitions per cycle, whereas it would be very difficult to do so in less than 18 with only QDI circuits. The reason for the difference is that STAPL circuits remove many waits that are necessary for maintaining QDI protocols and replace them with timing assumptions. Furthermore, STAPL circuits load their inputs less than do QDI circuits, because they generally do not need the completion circuitry that is needed in QDI circuits. The SPAM processor parts that we have simulated run three times as fast as similar parts from the MiniMIPS, a well-known prior art microprocessor.
In terms of energy consumption, STAPL circuits have most of the paths that are present in QDI circuits. This is because the logic is the same and much of the output completion is the same. There is no input completion, nor are there acknowledge wires, but on the other hand, the QDI circuits do not have pulse generators. One metric of evaluation is to compare STAPL and QDI circuits using the Et2 metric. This metric captures the fact that by our varying the supply voltage of a CMOS circuit, any speed improvement can be traded for roughly twice that improvement in energy. Hence, conservatively estimating on testing circuits shows (E2E, tt/3) the improvement in Et2 of STAPL circuits by a factor of about five. To first order, the change in At2 would be about the same, where A is the area of the circuit.
Other advantages of STAPL circuits include a simplified solution to the charge-sharing problem and less loading from p-transistors (no input-completion circuitry in most cases, and even when it is present, it has no p-transistors).