1. Field of the Invention
The present invention relates to techniques for improving computer system performance. More specifically, the present invention relates to instructions that prevent memory hazards in vector or parallel processing.
2. Related Art
There are many impediments to the parallelization of computational operations in parallel processing systems. Among these impediments, one of the more difficult problems to address is memory hazards, such as address hazards, in which different memory references refer to the same address. The potential for memory hazards often restricts exploitation of many features available in modern high-performance processors. For example, memory hazards may block instruction-level parallelism (ILP) by preventing load instructions from being hoisted above store instructions. Furthermore, memory hazards may block data-level parallelism (DLP) by preventing compilers from vectorizing loops, or may block thread-level parallelism by preventing threads from being spawned.
In the case of ILP, existing processors typically attempt to move loads upward in the instruction stream with the goal of initiating memory transactions as early as possible while the processor performs other work in parallel. For example, out-of-order processors often use hardware mechanisms to hoist loads. All such processors implement some form of dynamic (runtime) memory disambiguation in hardware, for example, by using a memory order buffer (MOB) to prevent a computer from erroneously moving a load ahead of a preceding store that turns out to be directed to the same address.
In contrast, in-order processors use a compiler to explicitly hoist loads. However, these compilers operate without the benefit of runtime information and, therefore, cannot always predetermine if moving a load ahead of a store will be safe. This uncertainty forces these compilers to be conservative in hoisting loads which greatly sacrifices performance. This also greatly limits performance in superscalar in-order computers, such as those that implement very-long-instruction-word (VLIW) architectures. To address this problem, some of these computers include hardware mechanisms that enable their compilers to more aggressively hoist loads. In particular, these mechanisms enable the compiler to speculatively hoist a load by providing a hardware-checking mechanism which either verifies at runtime that the movement of a load was legitimate or which generates an exception after a memory-hazard problem is encountered to allow software to repair the problem.
In the case of DLP, existing autovectorizing compilers cannot freely vectorize code for exactly the same memory-hazard-related reasons that scalar and superscalar processors cannot freely reorder loads and stores. In particular, aggregating a set of temporally sequential operations (such as loop iterations) into a spatially parallel vector creates essentially the same problem as reordering the loads and stores. In either case, the sequential semantics of the program are potentially violated. Just as compilers cannot always predetermine when it is safe to reorder loads above stores, a vectorizing compiler cannot predetermine when it is safe to group sequential operations into a parallel vector of operations. However, in the case of vector processors the ramifications are more than a mere incremental performance loss. The entire advantage behind vector processing is defeated. Consequently, vector processors are rarely built and those with short-vector facilities, such as Single-Instruction-Multiple-Data (SIMD) processors, are often underutilized. The underlying problem for these processors is that existing compilers are severely limited in their ability to automatically vectorize code due to their inability to statically disambiguate memory references.
Similarly, in the case of thread-level parallelism existing multithreading compilers are often prevented from spawning multiple parallel threads due to the potential for memory hazards. This limitation may not be a large problem for existing multi-core and multithreaded processors because they currently operate using coarse-grain threads and depend upon explicit parallelization by human programmers. Unfortunately, it is difficult to scale these manual parallelization techniques. Consequently, to facilitate fine-grain multithreaded execution (in which each iteration of a loop may be processed by a different processor or core), compilers will need to overcome memory-address-hazard problems to automatically parallelize programs.
Hence what is needed is a technique to facilitate vector or parallel processing in the presence of memory hazards without the above-described problems.