The present invention relates to digital data processing systems, and in particular to high-speed latches used in register memory of digital computing devices.
A modem computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer""s hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the throughput) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor(s). E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant speed improvements by shrinking component size, reducing component number, and eventually, packaging the entire processor as an integrated circuit on a single chip. The reduced size made it possible to increase clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrated circuitry, the demand for ever faster computer systems has continued. Hardware designers have been able to obtain still further improvements in speed by greater integration (i.e., increasing the number of circuits packed onto a single chip), by further reducing the size of circuits, and by various other techniques. However, designers can see that physical size reductions can not continue indefinitely, and there are limits to their ability to continue to increase clock speeds of processors. Attention has therefore been directed to other approaches for further improvements in overall speed of the computer system.
Without changing the clock speed, it is possible to improve system throughput by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this approach practical. However, one does not simply double a system""s throughput by going from one processor to two. The introduction of multiple processors to a system creates numerous architectural problems. For example, the multiple processors will typically share the same main memory (although each processor may have its own cache). It is therefore necessary to devise mechanisms that avoid memory access conflicts, and assure that extra copies of data in caches are tracked in a coherent fashion. Furthermore, each processor puts additional demands on the other components of the system as storage, I/O, memory, and particularly, the communications buses that connect various components. As more processors are introduced, there is greater likelihood that processors will spend significant time waiting for some resource being used by another processor.
Without delving into further architectural complications of multiple processor systems, it can still be observed that there are many reasons to improve the speed of the individual CPU, whether a system uses multiple CPUs or a single CPU. If the CPU clock speed is given, it is possible to further increase the work done by the individual CPU, i.e., the number of operations executed per unit time, by increasing the average number of operations executed per clock cycle.
In order to boost CPU speed, it is common in high performance processor designs to employ instruction pipelining, as well as one or more levels of cache memory. Pipeline instruction execution allows subsequent instructions to begin execution before previously issued instructions have finished. Cache memories store frequently used and other data nearer the processor and allow instruction execution to continue, in most cases, without waiting the full access time of a main memory access.
Pipelines will stall under certain circumstances. An instruction that is dependent upon the results of a previously dispatched instruction that has not yet completed may cause the pipeline to stall. For instance, instructions dependent on a load/store instruction in which the necessary data is not in the cache, i.e., a cache miss, cannot be executed until the data becomes available in the cache. Maintaining the requisite data in the cache necessary for continued execution and sustaining a high hit ratio (i.e., the number of requests for data compared to the number of times the data was readily available in the cache), is not trivial, especially for computations involving large data structures. A cache miss can cause the pipelines to stall for several cycles, and the total amount of memory latency will be severe if the data is not available most of the time. Although memory devices used for main are becoming faster, the speed gap between such memory chips and high-end processors is becoming increasingly larger. Accordingly, a significant amount of execution time in current high-end processor designs is spent waiting for resolution of cache misses.
Reducing the amount of time that the processor is idle waiting for certain events, such as re-filling a pipeline or retrieving data from memory, will increase the average number of operations per clock cycle. One architectural innovation directed to this problem is called xe2x80x9chardware multithreadingxe2x80x9d or simply xe2x80x9cmultithreadingxe2x80x9d. This technique involves concurrently maintaining the state of multiple executable sequences of instructions, called threads, within a single CPU. As a result, it is relatively simple and fast to switch threads.
The term xe2x80x9cmultithreadingxe2x80x9d as defined in the computer architecture community is not the same as the software use of the term. In the case of software, xe2x80x9cmultithreadingxe2x80x9d refers to one task being subdivided into multiple related threads. In the hardware definition, the threads being concurrently maintained in a processor are merely arbitrary sequences of instructions, which don""t necessarily have any relationship with one another. Therefore the term xe2x80x9chardware multithreadingxe2x80x9d is often used to distinguish the two used of the term. As used herein, xe2x80x9cmultithreadingxe2x80x9d will refer to hardware multithreading.
There are two basic forms of multithreading. In the more traditional form, sometimes called xe2x80x9cfine-grained multithreadingxe2x80x9d, the processor executes N threads concurrently by interleaving execution on a regular basis, such as interleaving cycle-by-cycle. This creates a gap in time between the execution of each instruction within a single thread, which removes the need for the processor to wait for certain short term latency events, such as re-filling an instruction pipeline. In the second form of multithreading, sometimes called xe2x80x9ccoarse-grained multithreadingxe2x80x9d, multiple instructions in a single thread are sequentially executed until the processor encounters some longer term latency event, such as a cache miss, which triggers a switch to another thread.
Like any innovation, multithreading comes with a price. Typically, multithreading involves replicating the processor registers for each thread in order to maintain the state of multiple threads. For instance, for a processor implementing the architecture sold under the trade name PowerPC(trademark) to perform multithreading, it will generally be necessary to replicate the following registers for each thread: general purpose registers, floating point registers, condition registers, floating point status and control register, count register, link register, exception register, save/restore registers, and special purpose registers. Additionally, the special buffers, such as a segment lookaside buffer, can be replicated or each entry can be tagged with the thread number (or alternatively, be flushed on every thread switch). Some branch prediction mechanisms, e.g., the correlation register and the return stack, should also be replicated.
The replication of so many registers consumes a significant amount of chip area. Since chip area is typically in great demand, the hardware designer must face difficult choices. One can reduce cache sizes, reduce the number of general purpose registers available to each thread, or make other significant concessions, but none of these choices is desirable. There is a need for an improved method of dealing with the proliferation of registers which accompanies multithreading.
It is therefore an object of the present invention to provide an improved multithreaded processor.
Another object of this invention is to provide an improved master-slave latch circuit for supporting hardware multithreading operation of a digital data computing device.
Another object of this invention is to reduce the size and complexity of latch circuitry for supporting hardware multithreading operation of a digital data computing device.
In a digital processor supporting hardware multithreading, a master-slave latch circuit stores information for multiple threads. The basic cell contains multiple master elements, each corresponding to a respective thread, selection logic coupled to the master elements for selecting a single one of the master outputs, and a single slave element coupled to the selector logic.
In the preferred embodiment, the circuit supports operation in a scan mode for testing purposes. In scan mode, cells are paired. One cell of each pair contains one or more elements which normally function as master elements, but which may also function as slave elements during scan mode operation. These dual function elements are coupled to master elements of the other cell of the pair. When operating in scan mode using this arrangement, the number of master elements in the pair of cells equals the number of slave elements, even though the number of master elements exceeds the number of slave elements during normal operation. This permits data to be successively scanned through all elements of the circuit, ensuring thorough testing.
In an alternative embodiment, elements function as in scan mode during a HOLD mode of operation, and a feedback loop controlled by a HOLD signal is added to each pair of master/slave elements. The feedback loop drives the master element with the value of the slave.