1. Field of the Invention
This invention relates to the field of processors and more particularly to the use of a scout thread processor to prefetch data into caches for a main thread processor.
2. Description of the Related Art
Computer systems typically include, amongst other things, a memory system and one or more processors and/or execution units. The memory system serves as a repository of information, while a processor reads information from the memory system, operates on the information, and stores results to the memory system. A memory system can include one or more caches, main memory, and disk drives. Caches hold most recently accessed information and have low access latencies. Because main memory can have an access latency of 100 cycles or more, information is ideally stored in cache or in internal registers on the processor.
A cache is a small, fast memory, located close to the processor that holds the most recently accessed code or data. A cache hit occurs when the processor finds requested content (data/instruction) in the cache. In the case of a cache miss, the processor needs to load the content from the main memory. The typical wait time for a processor, before it resumes processing, is between fifty to one hundred cycles. Access times can be even longer if the processor must contend with other devices for accessing memory. The amount of time the processor is idle due to cache misses can be significant, for example, as high as 80%.
While the memory access latency is a design concern for computer system designers, processing power typically is not. Advances in Very Large Scale Integration (VLSI) technology provide an increased number of transistors on a single die over older technologies. There is now enough space on integrated circuits to put more than one processor on a single chip. These chips with multiple processors are called chip multi-processors (CMPs). Alternatively, the additional space can be utilized by multi-threaded processors utilizing symmetric multi-threading (SMT) wherein the multiple threads share pipeline resources. A parallelized program (one that contains multiple threads of execution) can take advantage of the CMP or SMT system to improve the performance of the program. A non-parallelized, single threaded program has no easy way to utilize the extra processors on a CMP or SMT system and thus has a performance disadvantage.
Scout thread processing has been proposed as technique to improve performance by reducing the occurrence of delays due to memory access latency. Scout thread processing utilizes the processing power of an otherwise idle processor. A scout thread can be executed on a processor several cycles ahead of a main thread that is executed on another processor or during a stall in the main thread. A processor that executes the scout thread is referred to as the scout thread processor. The main thread contains a sequence of instructions, typically from the executable file of the program. The scout thread contains a subset of the main thread's sequence of instructions. The scout thread does not include the entire set of main thread instructions, but includes only, for example, instructions that access memory and calculate addresses. Thus, the scout thread processing brings data into the cache, resulting in the main thread processor having fewer cache misses and therefore shorter latencies. Even if scout thread execution is only a few cycles ahead of main thread execution, those few cycles improve the main thread execution time. The scout thread “warms-up” the caches for the main thread, but otherwise has no visible side-effect.
One proposed way of creating a scout thread is to create a “slice” of the normal program that just contains the code to form the addresses and to do the pre-fetching of the data. A scout thread program includes a subset of the instructions in the main thread. For example, the scout thread can include program control and memory access operations but not floating point instructions from the main program.
Another proposed way of creating a scout thread is to utilize a hardware mechanism that automatically detects portions of the code to be executed on the scout thread processor. Circuitry is provided on the scout thread processor that identifies instructions performing address generation and executes those instructions. The synchronization of the main thread and the scout thread is triggered off of a cache miss—the scout circuitry uses information stored about address generation, executes a stream of instructions that will generate the next few addresses and fetches the corresponding data into the cache. This type of scout thread can execute on the same processor as the main thread and therefore benefit from information about which instructions (of the main thread program) to execute.