Conventional computerized devices include a processor (e.g., microprocessor) or other circuitry that is capable of executing machine language instructions stored in a memory system operating within the computerized device. A collection of such machine language instructions or code in the memory system can form a computer program or software application. A software developer may use a text editor to write the computer program in a low-level language such as assembly language. The software developer can then processes the assembly language with a program called an assembler to convert the assembly language instructions into machine language instructions that the processor can natively execute. Alternatively, a software developer can write a software application using a high-level language such as C or C++ and can operate a program called a compiler which reads the C or C++ code and which automatically generates assembly level instructions which the compiler (or another program such as a linker) then assembles into machine language instructions that the processor can execute. Assembly language statements or instructions and their machine language counterparts thus represent the lowest level of programming that a software developer or a compiler can create that can then be executed by a processor in a computerized device.
Some compilers are equipped with a capability of producing optimized code. As an example, depending upon the target processor for which a compiler is generating code, the compiler may reorder certain instructions in the code such that when the target processor executes such re-ordered instructions, the target processor will execute them somewhat faster than if they had not been reordered due to the optimization feature of the compiler. Optimizing compilers thus attempt to arrange instructions in a more optimal manner for the faster execution on the target processor. Optimization processing may determine that certain instructions should be placed ahead of others, even though the original high-level software code might have indicated a different order for the high-level instructions that map or correspond to the re-ordered low level machine instructions.
As an example or how an optimization procedure might reorder instructions, consider the following fragment of C code:                *ptr1=a;        b=*ptr2;The first statement “*ptr1=a” indicates that the value of variable “a” should be stored at a memory location identified by pointer 1 (*ptr1). The second statement “b=*ptr2” indicates that a value of variable “b” should be set to the contents of memory loaded from a memory location referenced by pointer 2 (*ptr2). For one type of microprocessor, a compiler might compile the aforementioned fragment of C code into the following assembly language or machine language equivalent set of instructions:        store R1->[R2]        load [R3]->R4.Upon execution by the target processor, the “store” instruction causes the target processor to place the contents of register R1 (i.e., containing the value of variable “a”) into a memory location defined in register R2. In other words, the store instruction causes the target processor to write a value to memory. The “load” instruction causes the target processor to obtain data from memory at the location defined in register R3 and to place this data into register R4 (which represents the “B” variable). That is, the load statement causes the target processor to read a value from memory.        
Due to the nature of how instructions are executed in certain target processors, it might be preferable (i.e., faster) for the target processor to begin execution processing of a load instruction prior to a store instruction which appears before a load instruction in the code. This may be the case, perhaps, because the target processor requires more processing cycles to completely execute a load instruction whereas a store instruction might take fewer cycles to execute. Another reason to perform a load before a store might be that internal processor resources used by the load instruction, such as a memory channel, might be available before the store instruction but might not be available immediately after the store instruction. Accordingly, if the target processor can begin execution of the load instruction first, followed by the store instruction, the net result might be that both instructions are executed in a shorter total amount of time as compared to if these instructions had been executed in the original order shown above.
Some optimizing compilers are aware of this fact and may thus automatically reorder the instructions during compiling of the high-level language source code such that the load instruction precedes the store instruction in the machine language code. In other cases, a compiler may be unaware of efficiencies gained from reordering such instructions and the processor itself may reorder the instructions during its instruction execution procedure (i.e., during runtime). In such cases, as the processor encounters a store instruction followed by a load instruction, the processor may be configured to reorder the instructions so that the processor always execute the load before the store regardless of the order in which the compiler originally arranged the instructions.
In either instance, where either the compiler or the processor reorders load and store instructions, certain problems may arise.
Referring to the example store and load code shown above, suppose that R2 and R3 happen to refer to the same memory address. In such cases, moving the load prior to the store instruction will result in the store instruction fetching an incorrect value. This problem is called a read-after-write (RAW) hazard. Certain optimizing compilers that are capable of reordering load and store instructions can attempt to perform an alias analysis technique on the code in an attempt to determine, at compile time, if R2 and R3 are disjoint or distinct from each other (i.e., that a read-after-write hazard does not exist). In some cases, the alias analysis technique used by such compilers can break down and cannot guarantee that R2 and R3 are disjoint. When alias analysis fails, for reasons of safety and correctness, the compiler must be conservative and forgo reordering the load and store instructions.
Alternatively, some compilers are equipped to produce code that performs an explicit check at runtime to determine if R2 and R3 are disjoint. During runtime, if such explicit checking code determines that the R2 and R3 memory references are unique (i.e., are distinct), then the code follows an execution path in which the load instruction is executed prior to the store instruction (i.e., the load and store are reordered to improve execution speed). If the explicit check code determines that memory references associated with R2 and R3 are possibly related to one another, then the code marks one of the memory references as “volatile” and the processor is not permitted to execute the load instruction prior to the store instruction when this volatile memory reference is used in such instructions.
If a conventional processor performs reordering of load and store instructions during runtime execution of code in order to increase performance, the processor can include circuitry that can check for read-after-write hazards within itself after reordering and executing the reordered load and store instructions. In other words, the processor can “speculatively” execute load instructions before store instructions and can then perform a read-after-write hazard check after the fact to determine if the load store processing occurred without memory reference errors. If the processor determines that a read-after-write hazard did occur, the processor can reset its state to a state that existed just before the reorder operation and can then re-execute that portion of speculatively executed code in the original order by first executing the store instruction followed by the load instruction. As an example, upon detecting a read-after-write hazard, the processor can re-execute the offending instructions in program order (i.e., in the order in which they appear in the program), or the processor can discard a speculative state associated with the load instruction and can reissue the load instruction for execution along with any instructions that were dependent upon the load.
When a processor reorders a load instruction to execute prior to a store instruction and then performs read-after-write hazard checking, this processing is not visible to the executing program. In other words, from the programmer's perspective, the processor appears to have executed the instruction in program order as produced by the complier. Processor designers have determined that in most cases a load instruction can be reordered to execute before a store instruction without incurring a read-after-write hazard. As such, conventional wisdom is that computer systems obtain better execution performance by having processors perform such reorder operations while only occasionally incurring the performance penalty of having to undo and re-execute the instructions in the original order.
The aforementioned load and store reordering techniques may be used in conventional “uniaccess” execution environments comprising a single processor, or in which a computerized device includes multiple processors that processes that do not access regions of memory related to one another. Uniaccess execution environments may also include multiprocessor computer systems in which different threads of a process are constrained or bound to a single processor or in which different processes that share memory are constrained to execute on the same processor. Such example configurations are referred to herein as “uniaccess” execution environments.
Other conventional computerized devices can include multiple concurrently operating processors. In a “multiaccess” execution environment, different software processes may concurrently execute on different processors, or a single multi-threaded process may execute by having different threads concurrently execute on different processors (e.g., in a multiprocessor equipped computer system). A multiaccess execution environment can include memory such as shared memory or other memory that is potentially accessible by the different threads of the multi-threaded process or by different concurrently and/or simultaneously operating processes. In other words, a multiaccess execution environment is one in which the same memory location may potentially be accessed by two different portions of processing code (e.g., threads or processes). Multiaccess execution environments thus afford the ability to allow simultaneous (e.g., in multiprocessor computer systems) access to the same region of memory by different threads or processes. Two processes or threads might require access to the same memory area, for example, to allow such processes or threads to synchronize or exchange state with each other or for other reasons.
One conventional synchronization scheme is known as Dekker's Algorithm and begins operation in a process by using a store instruction followed by a load instruction. As discussed above, if a conventional compiler or one of the processors operating such a process or thread attempts to optimize execution of the Dekker's Algorithm store/load code by reordering the load instruction to execute before the store instruction, the algorithm can fail to maintain synchronization between processes or threads in a proper manner. In multiaccess execution environments then, a more complicated version of the read-after-write hazard can exist in which two different processes may contain store and load instructions that reference a common memory location (i.e., that are non-disjoint) and thus reordering in such cases should not be allowed in either processor. The problem discussed above is thus how to be able to reorder and execute load instructions before store instructions to gain the performance benefit while still being able to detect memory references between multiple processes that can cause hazards.
Some conventional processors, such as Intel's 32-bit Pentium line of microprocessors (manufactured by Intel Corporation, Pentium being a registered trademark of Intel Corporation), solve this problem by providing a structure (e.g., circuitry) called a memory order buffer or MOB. The memory order buffer operates in a processor to track or “snoop for” write accesses to shared memory locations performed by other processors, for any shared memory addresses that the processor (i.e., the processor operating the memory order buffer) has previously speculatively read. If the processor using the memory order buffer detects any of such writes from other processors to previously speculatively read memory addresses, the processor is said to have detected a “memory ordering violation.” To overcome this violation, the processor can use the memory order buffer to cancel or restart the affected reads as well as all speculative actions that might have depended on those reads. Like the read-after-write recover circuitry discussed above (i.e., for uniaccess execution environments), the operation of the memory order buffer is inaccessible to a programmers and is thus invisible to the executing program's operation.
Other conventional processors such as Intel's 64-bit IA-64 line of microprocessors, of which the Itanium is a member, provide another structure called an Advanced Load Address Table (ALAT) that permits a compiler to explicitly “move up” or advance loads before other instructions (e.g., before stores) to permit the loads to execute earlier than would otherwise be possible. The ALAT is visible to the programmer and to the executing program. As such, the compiler must explicitly reorder the load instruction before the store instruction, as the processor containing the ALAT will not do so on its own. Since the compiler reorders these instructions for a target processor containing an ALAT, the compiler also produces special “check code” that is inserted into the code after reordering the load and store instructions. The purpose of the check code is to consult the ALAT to make sure that a value returned by an “advanced” (i.e., reordered) load instruction is still coherent. If the check code detects a memory violation, the code branches to a recover operation in order to correct the memory violation problem. The ALAT mechanism is also responsible for performing the read-after-write hazard checking discussed above.
Another conventional technology related to the present invention concerns the way in which processors in a computerized device access memory in order to read or write information (i.e., load or store data) to the memory. In many computer systems, a memory system such as random access memory (RAM) is logically divided into a set or series of pages of a predetermined size. A processor such as a central processing unit, microprocessor or other circuitry operating in such a conventional computer system includes a memory management unit or memory management unit that managed these pages. The memory management unit controls or governs access to pages of memory on behalf of program code being executed within a processor. Generally, when a processor causes the computer system to load program code and data into memory in order to begin execution of such code, the memory management unit allocates a number of pages of memory to the program (i.e., to store the program code and any data related to the program). The memory management unit may store the program code and data in pages of memory that span a wide range of physical memory addresses. However, the program code itself may contain instructions that reference the data associated with that program code over a set of logical memory addresses. Accordingly, typical conventional computer systems also include a page table that contains a series of page table entries. Each page table entry provides a mapping between a set of logical addresses to a set of physical addresses for each page of memory. A conventional memory management unit associated with a processor is generally responsible for maintaining the contents of the page table entries in the page table on behalf of programs executing on the processor associated with that memory management unit. In other words, when a processor loads and begins execution of program code that references data stored in memory at various logical addresses (addresses relative to that program), the memory management unit and operating system for that processor establishes and maintains page table entries that identify which pages of physical memory contain the physical addresses (and hence the actual code or data) that map to the logical addresses referenced by the program code.
A page table containing all page table entries can become quite large in a computer system since one or more processors can concurrently execute many different programs. This fact, in combination with the fact that a memory management unit operating within a processor must access page table entries in the page table stored in physical memory using a shared interconnection mechanism such as a data bus that can consume valuable processing time, has caused computing system developers to create a processor-internal page table entry buffer or cache called a “translation lookaside buffer” or translation lookaside buffer. An memory management unit can utilize the high-speed access characteristics of a processor-internal translation lookaside buffer containing recently accessed page table entries to increase program code execution speed.
A typical conventional translation lookaside buffer might contain between 16 to 64 page table entries that map recently used logical addresses (i.e., addresses used by a program executing in that processor) to relative physical memory addresses or pages. The set of physical pages mapped by an memory management unit in its associated translation lookaside buffer, which are thus readily accessible to the processor associated with that memory management unit, is called the translation lookaside buffer span. The translation lookaside buffer span is thus a small subset of the entire amount of memory address space accessible to a processor. When a program executes on a processor and references an instruction or data contained within a page of memory mapped by a page table entry that is not in the translation lookaside buffer, the memory management unit fetches the required page table entry from the page table resident in memory through a conventional data caching access technique that allows a processor to more rapidly access memory (via a data cache). In other words, to increase access to memory, conventional processors using memory cache techniques that provide processor-resident memory caches. A memory management unit in a processor performs accesses to page table entries for insertion into its translation lookaside buffer using the same techniques as other memory accesses (i.e., using caching).
In yet another conventional technology that is related to the present invention, in multiprocessor computer systems, each independently operating processor (e.g., each CPU or microprocessor) can maintain its own processor-resident memory cache. The memory cache provides a processor with high-speed access to data within the cache, as compared to having to access the required data from main memory over a bus. The cache also reduces bus traffic, and thus improves overall system throughput. Since multiprocessing computing system environments can allow processors to share data, computing system developers have created cache coherency protocols that ensure that the contents of processor-local caches and their corresponding main memory locations are properly maintained or synchronized if, for example, two caches maintained in respective separate processors contain references to a common or shared memory location.
One example of a conventional cache coherency protocol is the MESI cache coherence protocol, where “MESI” stands for Modified, Exclusive, Shared and Invalid. These terms represent the possible states of a processor cache line in a cache. MESI protocols, and others like it, are generally considered “snooping” protocols that maintain coherency for cache entries between all processors “attached” to a memory subsystem by snooping or monitoring the memory transactions of other processors. In doing so, if a first processor operates a MESI protocol to “snoop” a memory transaction of another processor and detects that this transaction reflects a change to a memory location associated with an entry in the first processor's cache, the first processor can appropriately update its cache entry based on the modified memory location to ensure that is cache accurately reflects the contents of the associated memory location (i.e., to maintain cache coherency).