Multiprocessor systems, such as symmetric multi-processors, provide a computer environment wherein software applications may operate on a plurality of processors using a single address space or shared memory abstraction. In a shared memory system, each processor can access any data item without a programmer having to worry about where the data is or how to obtain its value; this frees the programmer to focus on program development, e.g., algorithms, rather than managing partitioned data sets and communicating values. Interprocessor synchronization is typically accomplished in such a shared memory system between processors performing read and write operations to "synchronization variables" either before and after accesses to "data variables".
For instance, consider computer program example #1 wherein a processor P1 updates a data structure and processor P2 reads the updated structure after synchronization. Typically, this is accomplished by P1 updating data values and subsequently setting a semaphore or flag variable to indicate to P2 that the data values have been updated. P2 checks the value of the flag variable and, if set, subsequently issues read operations (requests) to retrieve the new data values. Note the significance of the term "subsequently" used above; if P1 sets the flag before it completes the data updates or if P2 retrieves the data before it checks the value of the flag, synchronization is not achieved. The key is that each processor must individually impose an order on its memory references for such synchronization techniques to work. The order described above is referred to as a processor's inter-reference order. Commonly used synchronization techniques require that each processor be capable of imposing an inter-reference order on its memory reference operations.
Computer program example #1 P1 P2 Store Data, New-value L1: Load Flag Store Flag, 0 BNZ L1 Load Data
The inter-reference order imposed by a processor is defined by its memory reference ordering model or, more commonly, its consistency model. The consistency model for a processor architecture specifies, in part, a means by which the inter-reference order is specified. Typically, the means is realized by inserting a special memory reference ordering instruction, such as a Memory Barrier (MB) or "fence", between sets of memory reference instructions. Alternatively, the means may be implicit in other opcodes, such as in "test-and-set". In addition, the model specifies the precise semantics (meaning) of the means. Two commonly used consistency models include sequential consistency and weak-ordering, although those skilled in the art will recognize that there are other models, such as release consistency, that may be employed.
In a sequentially consistent system, the order in which memory reference operations appear in an execution path of the program (herein referred to as the "I-stream order") is the inter-reference order. Additional instructions are not required to denote the order simply because each load or store instruction is considered ordered before its succeeding operation in the I-stream order. Consider computer program example #1 above. The program performs as expected on a sequentially consistent system because the system imposes the necessary inter-reference order. That is, P1's first store instruction is ordered before P1's store-to-flag instruction. Similarly, P2's load flag instruction is ordered before P2's load data instruction. Thus, if the system imposes the correct inter-reference ordering and P2 retrieves the value 0 for the flag, P2 will also retrieve the new value for data.
In a weakly-ordered system, an order is imposed between selected sets of memory reference operations, while other operations are considered unordered. One or more MB instructions are used to indicate the required order. In the case of an MB instruction defined by the Alpha.RTM. 21264 processor instruction set, the MB denotes that all memory reference instructions above the MB (i.e., pre-MB instructions) are ordered before all reference instructions after the MB (i.e., post-MB instructions). However, no order is required between reference instructions that are not separated by an MB, except in specific circumstances such as when two references are directed to the same address.
 Computer program example #2 P1: P2: Store Data1, New-value1 L1: Load Flag Store Data2, New-value2 BNZ L1 MB MB Store Flag, 0 Load Data1 Load Data2
In above example, the MB instruction implies that each of P1's two pre-MB store instructions are ordered before P1's store-to-flag instruction. However, there is no logical order required between the two pre-MB store instructions. Similarly, P2's two post-MB load instructions are ordered after the Load flag; yet, there is no order required between the two post-MB loads. It can thus be appreciated that weak ordering reduces the constraints on logical ordering of memory references, thereby allowing a processor to gain higher performance by potentially executing the unordered sets concurrently.
Most computer systems use virtual memory to effectively manage physical memory of the systems. In a virtual memory system, programs use virtual addresses to address memory space allocated to them. The virtual addresses are translated to physical addresses which denote the actual locations in physical memory. A common process for managing virtual memory is to divide the virtual and physical memory into equal-sized pages. A system disk participates in the implementation of virtual memory by storing pages of the program not currently in physical memory. The loading of pages from the disk to physical memory is managed by the operating system.
When a program references an address in virtual memory, the processor calculates the corresponding main memory physical address in order to access data at that address. The processor typically includes a memory management unit that performs the translation of the virtual address to a physical address. Specifically, for each program there is a page table containing a list of mapping entries, i.e., page table entries (PTEs), which, in turn, contain the physical address of each virtual page of the program. FIG. 1 is a schematic diagram of a prior art page table 100 containing a plurality of PTEs 110. An upper portion, i.e., the virtual page number (VPN 122), of a virtual address 120 is used to index into the page table 100 to access a particular PTE 110; the PTE contains a page frame number (PFN) 112 identifying the location of the page in main memory. A lower portion, i.e., page offset 124, of the virtual address 120 is concatenated to the PFN 112 to form the physical address 130 corresponding to the virtual address. Because of its large size, the page table is generally stored in main memory; thus, every program reference to access data in the system typically requires an additional memory access to obtain the physical address, which increases the time to perform the address translation.
To reduce address translation time, a translation buffer (TB) is used to store translation maps of recently accessed virtual addresses. The TB is similar to a data cache in that the TB contains a plurality of entries, each of which includes a tag field for holding portions of the virtual address and a data field for holding a PFN; thus, the TB functions as a cache for storing most-recently-used PTEs. When the processor requires data, the virtual address is provided to the TB and, if there is a match with the contents of the tag field, the virtual address is translated into a physical address which is used to access the data cache. If there is not a match between the virtual address and contents of the tag field, a TB miss occurs. In response to the TB miss, the memory management unit fetches (from cache or memory) appropriate PTE mapping information and loads it into the TB. The read operation for the PTE is generally followed by a subsequent read or write operation for data at the mapped physical address. Handling of TB misses may be implemented in hardware or software; in the latter case, the instructions used to handle the TB miss constitute a TB miss flow routine.
In order to increase performance, modem processors do not execute memory reference instructions one at a time. In fact, it is desirable that a processor keep a large number of memory reference operations outstanding and issue, as well as complete, those operations out-of-order. This is accomplished by viewing the consistency model as a "logical order", i.e., the order in which memory reference operations appear to happen, rather than the order in which those references are issued or completed. More precisely, a consistency model defines only a logical order on memory references; it allows for a variety of optimizations in implementation. It is thus desired to increase performance by reducing latency and allowing (on average) a large number of outstanding references, while preserving the logical order implied by the consistency model.
In prior systems, an MB instruction is typically contingent upon "completion" of an operation. For example, when a source processor issues a read operation, the operation is considered complete when data is received at the source processor. When executing a store instruction, the source processor issues a memory reference operation to acquire exclusive ownership of the data; in response to the issued operation, system control logic generates "probes" to invalidate old copies of the data at other processors and, possibly, to request forwarding of the data from the owner processor to the source processor. Here the operation completes only when all probes reach their destination processors and, when required, the data is received at the source processor.
Broadly stated, these prior systems rely on completion to impose inter-reference ordering. For instance, in a weakly-ordered system employing MB instructions, all pre-MB operations must be complete before the MB is passed and post-MB operations may be considered. Completion of an operation essentially requires actual completion of all activity, including receipt of data and acknowledgments for probes, corresponding to the operation. The TB miss flow routine described above creates an inter-reference ordering issue if another processor substantially simultaneously creates the PTE that will be read during the routine. For example, consider the computer program example #3 wherein each page contains one thousand (1000) address locations.
 Computer program example #3 P1: P2: Write @2000, data1 Read R1, @1000 Write @2001, data2 Read R2, @1001 MB Read R3, @2001 Write PTE (2000) Read R4, @3001 Read R5, @4002
Processor P1 may be performing write data operations to update the data of a page (e.g., page 2 having addresses 2000-2999) followed by an MB operation and a write PTE operation to update the mapping information in the PTE. The MB instruction imposes order such that creation of the page data is completed before creation of the map of the page so that other processors can access the page data. This creates an inter-reference ordering problem if processor P2 substantially simultaneously attempts to read an address (e.g., address 2001) of the page.
As shown in computer program example #4, P2 issues memory reference operations to addresses 1000 and 1001 until it executes the read operation to address 2001. At this point, P2 incurs a TB miss and jumps to a typical TB miss handler routine. (Note that a TB miss may be handled entirely in hardware or software and that the hardware implementation implicity performs the equivalent of an MB operation.) ##STR1##
In the TB miss flow routine, P2 fetches the PTE corresponding to address 2001, loads it into the TB and executes the MB instruction. P2 then returns from the routine to read the data at address 2001 and, thereafter, issues read operations to addresses 3001 and 4002. The MB instruction in the TB miss flow is used to enforce inter-reference ordering between the fetch of the PTE and subsequent read/write operations to addresses within the page corresponding to the PTE. As noted, all pre-MB operations of a typical weak-ordering system can be overlapped prior to reaching the MB instruction with the only requirement being that each of the pre-MB operations complete before passing the MB instruction. Such an arrangement is inefficient and, in the context of the TB miss flow routine, adversely affects latency because the MB unnecessarily constrains references to addresses located in other pages.