1. Field of the Invention
The present invention relates generally to the field of processor technology. More specifically, the present invention relates to a method and apparatus for maintaining processor ordering in a processor.
2. Background Information
Various multithreaded processors and multi-processor systems have been considered in recent times to further improve the performance of processors, especially to provide for a more effective utilization of various processor resources and to speed up the performance of the overall system. In a multithreaded processor, by executing multiple threads in parallel, the various processor resources are more fully utilized which in turn enhance the overall performance of the respective processor. For example, if some of the processor resources are idle due to a stall condition or other delay associated with the execution of a particular thread, these resources can be utilized to process another thread. Consequently, without multithreading capabilities, various available resources within the processor would have been idle due to a long-latency operation, for example, a memory access operation to retrieve the necessary data from main memory that is needed to resolve the cache miss condition. In a multi-processor systems, tasks or workloads can be distributed among the various processors to reduce the workload on each processor in the system and to take advantage of the parallelism structure that may exist in certain programs and applications, which in turn improves the overall performance of the system. For example, a program or an application may contain two or more processes (also referred to as threads herein) that can be executed concurrently. In this instance, instead of running the entire program or application on one processor, the two or more processes can be run separately and concurrently on the various processors in the multi-processor system which will result in faster response time and better overall performance.
Multithreaded processors may generally be classified into two broad categories, fine or coarse designs, based upon the particular thread interleaving or switching scheme employed within the respective processor. In general, fine multithreaded designs support multiple active threads within a processor and typically interleave two different threads on a cycle-by-cycle basis. Coarse multithreaded designs, on the other hand, typically interleave the instructions of different threads on the occurrence of some long-latency event, such as a cache miss. A coarse multithreaded design is discussed in Eickmayer, R., Johnson, R. et al. xe2x80x9cEvaluation of Multithreaded Uniprocessors for Commercial Application Environmentsxe2x80x9d, The 23rd Annual International Symposium on Computer Architecture, pp. 203-212, May 1996. The distinctions between fine and coarse designs are further discussed in Laudon, J., Gupta, A. xe2x80x9cArchitectural and Implementation Tradeoffs in the Design of Multiple-Context Processorsxe2x80x9d, Multithreaded Computer Architectures: A Summary of the State of the Art, edited by R. A. Iannuci et al., pp. 167-200, Kluwer Academic Publishers, Norwell, Mass., 1994.
While multithreaded processors and multi-processor systems offer advantages over single-threaded processor and single-processor systems, respectively, there are certain challenges and issues associated with the design and implementation of these systems. There are some particular issues that arise with respect to the concept of multithreading and multithreaded processor design, especially with respect to the parallel or concurrent execution of instructions. One of the difficult issues that arise in connection with multithreading and/or multiprocessing systems is the coordination and synchronization of memory accesses by the different threads in a multithreaded and/or multi-processor environment. In particular, it is a complex problem to maintain processor ordering or memory ordering among the different threads and/or different processors in a processing system in which the different threads and/or different processors share a common memory. In this situation, the various threads and/or processors communicate using data or variables in a shared memory via various memory access instructions or commands such reads (loads) and writes (stores). Processor ordering or memory ordering is an important aspect of a multithreaded processor and/or a multi-processor system. Processor ordering or memory ordering refers to the ability of a system to perform or execute memory instructions correctly. Processor ordering or memory ordering is maintained properly if the value or data obtained by a read (load) instruction from a particular memory location is the same value that was written to (stored in) that particular memory location by the most recent write (store) instruction. Likewise, processor or memory ordering requires that an older load instruction cannot get data which is newer than the data obtained by a younger load instruction. The problem is further complicated by the fact that each of the processor in the system may execute both instruction and/or data speculatively and out-of-order. For example, assuming a program contains two store instructions and two load instructions in the following logical sequence order (the original program order):
Store 1: Store 100 X(store the value X in memory location 1000)
Load 1: Load 1000 (read the value stored at memory location 1000)
Store 2: Store 1000 Y (store the value Y in memory location 1000)
Load 2: Load 1000 (read the value stored at memory location 1000)
It can be appreciated that maintaining processor or memory ordering with respect to the four instructions in this example is not an easy task, considering that these four instructions may be executed speculatively out-of-order in multiple threads on multiple processors. Depending on the order in which these four instructions are executed, the results may or may not violate the processor or memory ordering rule.
According to one aspect of the invention, a method is provided in which store addresses of store instructions dispatched during a last predetermined number of cycles are maintained in a first data structure of a first processor. It is determined whether a load address of a first load instruction matches one of the store addresses in the first data structure. The first load instruction is replayed if the load address of the first load instruction matches one of the store addresses in the first data structure.