1. Field of the Invention
This invention is related to the field of microprocessors and, more particularly, to data load prediction in a multithreaded architecture.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term “clock cycle” refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term “instruction processing pipeline” is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
Another aspect of microprocessors which may impact performance is related to system memory accesses. Instructions and data which are to be utilized by a microprocessor are typically stored on fixed disk medium. Once a request is made by a user to execute a program, the program is loaded into the computer's system memory which usually comprises dynamic random access memory devices (DRAM). The processor then executes the program code by fetching an instruction from system memory, receiving the instruction over a system bus, performing the function dictated by the instruction, fetching the next instruction, and so on. In addition, data which is operated on by these instructions is ordinarily fetched from memory as well.
Generally, whenever system memory is accessed, there is a potential for delay between the time the request to memory is made (either to read or write data) and the time when the memory access is completed. This delay is referred to as “latency” and can limit the performance of the computer. There are many sources of latency. For example, operational constraints with respect to DRAM devices cause latency. Specifically, the speed of memory circuits is typically based upon two timing parameters. The first parameter is memory access time, which is the minimum time required by the memory circuit to set up a memory address and produce or capture data on or from the data bus. The second parameter is memory cycle time, which is the minimum time required between two consecutive accesses to a memory circuit. Upon accessing system memory, today's processors may have to wait 20 or more clock cycles before receiving the requested data and may be stalled in the meantime. In addition to the delays caused by access and cycle times, DRAM circuits also require periodic refresh cycles to protect the integrity of the stored data. These cycles may consume approximately 5 to 10% of the time available for memory accesses. If the DRAM circuit is not refreshed periodically, the data stored in the DRAM circuit will be lost. Thus, memory accesses may be halted while a refresh cycle is performed.
To expedite memory transfers, most computer systems today incorporate cache memory subsystems. Cache memory is a high-speed memory unit interposed between a slower system DRAM memory and a processor. Cache memory devices usually have speeds comparable to the speed of the processor and are much faster than system DRAM memory. The cache concept anticipates the likely reuse by the microprocessor of selected data in system memory by storing a copy of the selected data in the cache memory. When a read request is initiated by the processor for data, a cache controller determines whether the requested information resides in the cache memory. If the information is not in the cache, then the system memory is accessed for the data and a copy of the data may be written to the cache for possible subsequent use. If, however, the information resides in the cache, it is retrieved from the cache and given to the processor. Retrieving data from cache is faster than retrieving data from system memory where access latencies may be 100 times that of a first level cache.
Because latencies between the cache and processor are much less than between system memory and the processor, increasing the proportion of time that requested data is present in the cache is highly desirable. One possible method is to predict what data will be required and prefetch the data to the cache. If the prediction is correct, then the data will be readily available and the system memory access latency will have been eliminated. However, if the prediction is incorrect, access must be made to system memory and a load latency incurred.
An important feature of microprocessors is the degree to which they can take advantage of parallelism. Parallelism is the execution of instructions in parallel, rather than serially. Superscalar processors are able to identify and utilize fine grained instruction level parallelism by executing certain instructions in parallel. However, this type of parallelism is limited by data dependencies between instructions. By identifying higher levels of parallelism, computer systems may execute larger segments of code, or threads, in parallel. Because microprocessors and operating systems typically cannot identify these segments of code which are amenable to multithreaded execution, they are frequently identified by the application code itself. However, this requires the application programmer to specifically code an application to take advantage of multithreading or it requires that the compiler identify such threads.