A modern computer system comprises a microprocessor, memory, and peripheral computer resources, i.e., monitor, keyboard, software programs, etc. The microprocessor comprises arithmetic, logic, and control circuitry that interpret and execute instructions from a computer program. FIG. 1 shows a prior art diagram of an example of a computer's microprocessor (20) that has, among other components, a central processing unit (“CPU”) (22), a memory controller (24), also known as a load/store unit, and on-board, or level 1, cache memory (26). The microprocessor (20) is connected to external, or level 2, cache memory (28), and the processor is also connected to the main memory (30) of the computer system. Cache memory is a region of fast memory that holds copies of data.
One goal of the computer system is to execute instructions provided by the computer's users and software programs. The execution of instructions is carried out by the CPU (22). Data needed by the CPU (22) to carry out an instruction are fetched by the memory controller (24) and loaded into the internal registers (32) of the CPU (22). Upon command from the CPU (22), the memory controller searches for data first in the fast on-board cache memory (26), then in the slower external cache memory (28), and if those searches turn out unsuccessful, then the memory controller (24) retrieves the data from the slowest form of memory, the main memory (30).
The time between a CPU request for data and when the data is retrieved and available for use by the CPU is referred to as the “latency” of the system. If requested data is found in cache memory, i.e., a data “hit” occurs, then the requested data can be accessed at the speed of the cache memory and the overall latency of the system is decreased. On the other hand, if requested data is not found in the cache memory, i.e., a data “miss” occurs, then the data must be retrieved from the relatively slow main memory, and the overall latency of the system in increased.
Because the CPU runs at significantly greater speeds than either cache memory or main memory, a significant portion of the CPU's time is spent waiting for data to be retrieved from one of the various forms of memory. In order to combat this performance-inhibiting phenomenon, various techniques have been employed to increase computing performance and efficiency. For example, many processors now incorporate superscalar architecture. Superscalar processors allow the simultaneous execution of multiple instructions. Additionally, processors now fetch multiple instructions, via an instruction fetch unit and an instruction scheduler, instead of executing one instruction and waiting for the next instruction to be fetched. A program sequence of instructions is referred to as a “process thread.”
Another technique that has been employed to increase computer performance involves combining multiple processors into a single system. Each processor is capable of executing a particular sequence of instructions in a program or program segment. This technique is often referred to as “horizontal” multi-threading.
An alternative processor performance enhancing technique is “vertical” multi-threading. Vertical multi-threading is a technique in which a single processing pathway, known as a “pipeline,” is used by more than one process thread. A capacity for vertical multi-threading exists because a process thread is not always actively executing. A process may be in a wait state awaiting either data or an event, such as a trap or interrupt. Because some applications have frequent cache misses, which result in heavy clock penalties, i.e., increased latency, a most desirable condition is that a second process thread should utilize the processor while a first process thread is waiting for the arrival of data or an event.
For example, in data processing applications with frequent cache misses, data is accessed through a secondary memory storage structure, often the main memory, resulting in heavy clock penalties, i.e., higher latency. During data accessing delays, a beneficial usage of the pipeline is to allow a second process thread to execute. The second process thread can take over the idle pipeline by saving all useful states of the first process thread in some location and assigning new states to the new process thread. When the second process thread becomes idle and the first process thread returns to processing, saved states are returned to the pipeline and the pipeline resumes its execution of the first process thread.
Vertical multi-threading requires that states for the first process thread be saved in some location before execution of the second process thread. Additionally, states for the second process thread must be saved in some location before returning to the execution of the first process thread.
A vertical multi-threading processor includes one or more execution pipelines that are formed from a plurality of multiple-bit flip-flops (discussed below). The flip-flops contain multiple storage bits. These individual bits of the flip-flops store data for one of the many process threads that are in a pipeline at any given time. When an executing process thread halts due to a stall condition, such as a cache miss, an active bit of the multiple-bit flip-flop at that stage is correspondingly stalled, removed from activity on the pipeline, and a previously inactive bit becomes active for executing a previously inactive process thread. Vertical multi-threading is thus attained by inserting multiple-bit flip-flops at sequential stages in a pipeline.
Referring to FIG. 2, a prior art multiple-bit flip-flop is shown. The multiple-bit flip-flop (34) is an integrated circuit device that has two representative blocks: a header block (also known as the driver block) (36) and a data storage block (38). The header block (36) is coupled to the data storage block (38) and it drives, i.e., controls, the flip-flop block (38). The data storage block (38) comprises a plurality of storage elements that hold data for multiple process threads.
The input signals to the header block (36) include a clock (“L4CLK”) signal, a scan enable (“SE”) signal, and a clock enable (“CE”) signal. The header block (36) outputs a scan clock (“SCLK”) signal, an inverse scan clock (“SCLK_L”) signal, a pulse clock (“PCLK”) signal, and an inverse pulse clock (“PCLK_L”) signal. The output signals from the header block (36) serve as inputs to the data storage block (38) in addition to a data (“DATA”) and scan chain in (“SI”) signal which come from circuitry external to the multiple-bit flip-flop (34).
L4CLK is a timing signal that is generated from a CPU clock frequency. L4CLK is provided to be used as a time basis for the header block (36) in generating different timing signals to the data storage block (38). SE, the scan enable signal, is used by the header block (36) to determine when the multiple-bit flip-flop (34) should enter into a scan mode. The scan mode is necessary when the contents of the data within the data storage block (38) need to be scanned. When SE is asserted, the header block (36) pulses SCLK and SCLK_L to indicate to the data storage block (38) to select the SI input and scan data using SCLK and SCLK_L as time references.
CE, the clock enable signal, is used by the header block (36) to determine when the multiple-bit flip-flop (34) should operate in normal (non-scan) mode. When CE is asserted, the header block (36) pulses the PCLK and PCLK_L to indicate to the data storage block (38) to select the DATA input and input data using PCLK and PCLK_L as time references. The above discussion regarding the scan mode and normal mode operations of the multiple-bit flip-flop (34) indicates that SE and CE are mutually exclusive and that only one can be asserted at any given time.
In order to facilitate vertical multi-threading using the multiple-bit flip-flop (34) with the discussed header block (36) and data storage block (38), a mechanism is needed to convey to the stages of the processor when to switch from one process thread to another process thread. The header block (36) comprises circuitry that is capable of driving a storage element in the data storage block (38) when selected by a switch while data in one or more storage elements that are not selected by the switch are held in their respective storage elements. This switch signal is generated by a state machine that is routed to different stages of the processor. Consequently, the signal for process thread switching is hard-wired into the CPU. Hard-wiring an additional signal into the CPU requires that the layout of the existing CPU be modified to accommodate the additional signal wire (or connection).