1. Field of the Invention
The present invention generally relates to computer systems, and in particular to an instruction fetching within a processor of a data processing system. Still more particularly, the present invention relates to a method and system for providing efficient instruction pre-fetching for a multithreaded program.
2. Description of the Related Art
The basic structure of a conventional computer system includes a system bus or a direct channel that connects one or more processors to input/output (I/O) devices (e.g., display monitor, keyboard and mouse), a permanent memory device for storing the operating system and user applications, and a temporary memory device that is utilized by the processors to execute program instructions.
When a user program is executed on a computer, the computer's operating system (OS) first loads the program files into system memory. The program files include data objects and instructions for handling the data and other parameters which may be inputted during program execution.
The operating system creates a process to run a user program. The process comprises a set of resources, including (but not limited to) values in RAM, process limits, permissions, registers, and at least one execution stream, which is commonly termed a “thread.” The utilization of threads in user applications is well known. Threads allow multiple execution paths within a single address space to run on a processor. This process is called “multithreading” and increases throughput and modularity in both multiprocessor and uniprocessor systems. For example, if a first thread of an executing program has to wait for the occurrence of an event, then the processor halts its execution, and the computer processor executes another thread to prevent stoppages in processor operation and thus optimize utilization of processor resources. The event which causes a switching of the execution from one thread to another is typically a long latency operation, such as disk/remote memory access or producer-consumer type data exchange. In a multiprocessor computer system, multithreaded programs may exploit the availability of multiple processors by running different threads of the application program in parallel. The wait associated with long latency operations is masked by the computation performed on other threads available to the processor. Parallel execution reduces response time and improves throughput in multiprocessor systems.
In a superscalar processor operating at high frequencies, execution of a program typically involves pre-fetching of instructions from the memory or instruction cache to enable a continuous flow of instructions to the processor's execution units. Instructions are “pipelined” utilizing an instruction fetching unit (IFU) that is a hardware component of the processor. The operational characteristics of the IFU are dependent on changes to the flow of instruction execution due to branches, the depth of processing core, and the memory access latency to fetch the new sets of instructions. Further, the IFU is hardware extensive and is typically not scalable for high frequency processor designs. Also, current IFUs typically fetch instructions in a unithread fashion, i.e., fetch all instructions for a first thread before fetching the instructions for another thread. With the movement towards multithreaded programs and multiprocessor computer systems, this later characteristic of IFU operation, along with the other limitations, results in a dampening of overall processing efficiency and reduced throughput.
Typically, instruction pre-fetching is used on single-threaded executions. Given that a multi-threaded execution involves maintenance of separate (and at times shared) address space among threads, the single-threaded pre-fetching technique is not easily extended to execution of a multithreaded program. Two approaches to providing multithreaded architectures are the von Neumann execution based multithreading and the dataflow based multithreading. For dataflow based multithreading, all inputs of a thread are fetched before the execution on that thread commences. Thus, on a probable context switch a set of fetch operations are issued to bring the thread (code and data) to the on-chip caches, and the whole thread has to be brought in. This approach is very hardware and compiler intensive because there needs to be a mechanism to determine possible input sources of the thread, and all inputs have to arrive before a thread can be scheduled for execution. Also, the performance is inhibited because of the required synchronization to ensure that all input sources have been received. Such threads tend to be small, and the number of inputs for each thread is small as well to reduce the performance degradation. However, the simpler pre-fetching scheme cannot be easily extended to current multithreading operations.
Von Neumann execution based multithreading is exemplified by a Simultaneous Multithreading technique. This type of multithreading uses a program counter to track the program execution, and each thread is assumed independent of another. That is, benefits of warm caches (due to execution on one thread) on the execution on another thread are limited. Such multithreading can benefit from simple pre-fetching schemes. U.S. Pat. No. 5,809,450 offers one proposed pre-fetch scheme. According to patent, the latency of a remote memory access is calibrated using an on-chip performance measurement scheme and is utilized to insert the pre-fetches at empirically determined places in the code. This approach is also hardware extensive, and results vary with the configuration of the processor system due to changes in the memory and network access latencies.
The present invention recognizes that it would be desirable to have a method, system and processor that enables greater efficiency in handling execution of multithreaded programs. A method, system, and processor architecture that provides more efficient pre-fetching of instructions for multithreaded program execution would be a welcomed improvement. It would be further desirable to have such a method which was also scalable to adapt to higher frequency processor designs without requiring significant hardware upgrades. These and other benefits are provided in the present invention as described herein.