1. Field of the Invention
The present invention relates to the field of computer systems, and more specifically, to controlling the timing and order of the execution of memory instructions in certain regions of computer memory to allow for increased processor efficiency.
2. Description of the Related Art
Current computer systems typically employ a hierarchical memory structure to facilitate the processing of memory access operations. This design practice evolved as an approach to minimize the processing delay and accompanying performance penalty that can occur while the processor is waiting for data or instructions (collectively, "memory operands") to be provided from memory, while still supplying a large physical storage space. Even in very sophisticated computers, processing can come to a complete halt (a "stall") until the memory request is serviced. Memory-related stalls may range from a few nanoseconds to several seconds, depending upon the computer's memory design, the storage location of the memory operand, and the memory access method. Designers adopted the hierarchical approach because a properly designed and managed hierarchical memory arrangement can minimize processor stalls and improve performance, without sacrificing system cost, size, or power requirements.
A commonly employed memory hierarchy is shown in FIG. 1. This design approach includes a special, high-speed memory, known as a cache, in addition to main memory and a swap space. In this hierarchical memory system, the main memory is effectively a cache for the much slower swap space storage. While the cache near the CPU is generally controlled by hardware in the CPU, the main memory is a software-controlled cache. In this discussion, unless otherwise specified, the term "cache" will always refer to the hardware-controlled cache, and the term "main memory" will always refer to the software-controlled main memory.
In a hierarchical memory design, memory operands can be dynamically shifted among swap space, main memory, and cache to ensure that the operand is provided to the processor as quickly as possible (i.e, from cache) as often as possible. This memory control function is handled by a memory management system that is commonly implemented in both hardware and software (the "memory manager"). The hardware portion of the memory manager is known as the memory management unit (MMU), and may be implemented within the processor, or it may be a separate component, either integrated on the same silicon as the processor or on an entirely different integrated circuit that interfaces with the processor and memory. FIG. 2 illustrates both the hardware and software portions of a typical memory management system. For clarity, portions of the memory management system not directly relevant to this discussion are not shown.
Referring to FIG. 2, memory management unit (MMU) 122 is shown within processor 120. MMU 122 interfaces with bus interface unit 123, which accesses main memory 131 and the system's I/O devices 132 via the system bus 133. Hardware components of the MMU 122 that are relevant to this discussion include the translation lookaside buffer (TLB) 126 and the TLB miss handler 125. Other memory management system hardware components and software functions include the load queue 127 and the store queue 128, both shown in FIG. 2 as implemented in the processor hardware, and the page miss handler (PMH) 124 and page table 129, shown in FIG. 2 in a typical software implementation.
Extending the hierarchical memory design approach resulted in the concept of "virtual memory," and the separation of a memory operand's "address" from its physical location in memory. Virtual memory is an abstraction that is used to handle three primary tasks in a computer system: (1) memory hierarchy abstraction; (2) memory protection; and (3) memory fragmentation. Virtual memory was originally conceived as a way to apply the hardware-based concept of a hierarchical memory to software; i.e., allowing software to control and use main memory as a cache for the storage space that is implemented in large, slow magnetic or optical media. Using the concept of virtual memory, an operating system (OS) can create the illusion of an accessible memory space that is much larger than the actual main memory size implemented in DRAM in the hardware. The OS maps portions of the external physical storage space (the disk drives or other storage media) into main memory, and then swaps only those blocks of the storage space currently needed into main memory (hence the term "swap space"). Swap space is also sometimes referred to as paging space, and the blocks of memory that are swapped in and out of main memory are often called pages.
Since the most recently accessed pages also contain the memory operands that are likely to be accessed soon, the OS ensures that the most recently accessed pages are available to the processor in relatively fast DRAM. Therefore, the apparent latency of the large amount of accessible memory is quite low, compared to what the latency would be if all accesses went to the hard disk or other external storage media. A programmer can write his program assuming that he has much more memory available than is implemented in DRAM on the system on which the program is running.
The amount of memory apparently available to a program is dependent upon the OS. Although the OS may create a very large amount of virtual memory, it may only allow a particular program to use a portion of the virtual memory space. If the OS supports multitasking, then it must partition and assign the virtual memory space to each of the concurrent processes. Memory protection and fragmentation are two tasks that are a necessary part of the OS's memory partitioning and assignment function. The OS protects a process's assigned memory by insuring that it cannot be corrupted by another concurrent process. Protecting memory from corruption in this manner increases system reliability.
The OS also keeps track of memory fragments actually used by a process and "stitches together" those fragments such that the process is unaware that it is actually using anything other than a contiguous block of memory. In other words, the OS does not actually assign contiguous blocks of memory to each process, because such an approach would likely result in nonuse of a large portion of assigned memory. Instead, the OS assigns available, non-contiguous fragments of virtual memory, and translates the programmer's view of his address space (sometimes called an effective address space or linear address space) to the address space represented by the assigned fragments. While the programmer's effective address space is contiguous, the virtual addresses that the OS assigns to those effective addresses may not be contiguous. This memory fragmentation function increases system efficiency by insuring that limited memory space is not wasted by being assigned and protected but unused.
To access a memory operand, the processor must take the programmer's effective address, translate it to a virtual address (the address within the large virtual memory space), and then finally, translate the virtual address into a physical address (the address within the main memory space). The physical address can then be used to access either the hardware controlled SRAM cache or the DRAM main memory.
The primary focus of this discussion is the translation from virtual to physical address. The virtual address consists of a virtual page number, plus an offset. The virtual page number identifies the relevant page of virtual memory, and the offset identifies the storage location of the operand on the page. Similarly, a physical address consists of a page frame number, plus an offset. Like the virtual page number, the page frame number identifies the appropriate block of SRAM or DRAM, and the offset identifies the actual storage location within the identified memory block.
The main structure used for the virtual address to physical address translation is the page table (129 in FIG. 2). The page table contains a virtual page number that correlates each page of memory currently in main memory to a corresponding page in the swap space. The OS is responsible for updating the page table when it swaps a new page of memory from the swap space into the main memory. To translate from a virtual to a physical address, the MMU (122 in FIG. 2) looks for the virtual page number in the page table. If the page table contains the virtual page number, then the processor knows that the page of memory containing the desired operand either is currently in the main memory.
To speed up the translation process, the most recently accessed page table entries are usually cached in the processor. This translation cache is called a translation lookaside buffer (TLB--126 in FIG. 2). The TLB is commonly controlled in hardware by the MMU 122, but it can be controlled by the OS. The virtual-to-physical address translation process is shown in the context of a memory access in the flow chart in FIG. 3.
Referring to FIG. 3, if a memory operand's virtual page number (generated from the effective address called for by the program) is not located in the TLB, then a TLB miss has occurred and the page table must be accessed to obtain the proper translation (a "page table walk.") The translation will be found in the page table only if the page has been loaded into main memory (a "page hit"). If the page miss occurs, the operand being accessed is not currently mapped into the main memory of the system, and must be retrieved from the swap space. Control passes from the MMU to the OS, which decides which page must be deleted from main memory (swapped out) to make room for the page that contains the operand the processor is trying to access. The OS swaps in the new page, and updates the page table and TLB. Finally, the original program resumes execution, and the operand called for is now located in main memory. The software that handles swapping the pages from the swap space into the main memory is called the page miss handler (PMH--124 in FIG. 2).
Along with the virtual to physical address translation, the page table generally contains additional information in order to support memory protection. For instance, a page may be marked as read-only or read-write, or a certain privilege level may be required to access data in a page. This protection and permission information is also included in the TLB entries stored in the TLB.
Since data is commonly provided to the processor from cache, rather than from the main memory, updates to memory operands must be accompanied by a corresponding update in cache, either when the update occurs, or when a miss is indicated. This dual-update requirement is referred to as data coherency.
Systems with multiple memory owning clients must generally be designed in such a manner as to maintain data coherency between all the clients, and also to track ownership of memory to prevent the clients from corrupting data written by another client. (For the purposes of this discussion, any device that can be granted temporary ownership of regions of memory is considered to be a "memory owning client." While memory owning clients will most likely be processors, they could also be intelligent peripheral controllers that can be granted temporary ownership of regions of memory.) Hardware designers utilizing multiple memory owning clients and hierarchical memory must ensure that each client receives proper data updates, that memory operands are owned (i.e., can be overwritten) by just one client at a time, and that the ownership of a memory operand is transferred from one client to another in the proper order. Data coherency and memory ownership are simple concepts, and have been addressed in the prior art by various update and ownership protocols, the most popular of which is the MESI protocol. However, the order in which ownership is transferred and access is granted is a much more complex issue that existing protocols do not address. Controlling and enforcing memory order is the subject of the present invention. (The term "memory order" is used to describe endian-ness as well as the order in which memory transactions are processed. In this discussion, memory order does NOT describe or relate to endian-ness).
Often memory accesses do not actually occur when the program thinks they occur. Due to varying latencies in the memory subsystem, memory accesses may be performed out of order, and may be perceived by the various clients as occurring at different times. For example, in a common sequence of events, an operand writing client thinks a memory operand is updated at time N, the memory system thinks it is updated at time N+M, and another client thinks it is updated at time N+M+L. If the second client reads the operand sometime between N and N+M+L, it may read a stale value. The second client's memory accesses have thus occurred out of order.
In general, as long as each client observes the same sequence of updates to a memory operand, the fact that the clients observe the operand updates occurring at different times is ordinarily not problematic. (There is an exception to the rule that each client must observe the same sequence of updates that occurs when a line owner might update a line a number of times, but only inform the memory system or other clients after the final update. However, as far as data coherency is concerned, such behavior is still considered correct). The problem occurs when multiple clients read two different memory locations and either the elapsed time between the updates of the two locations or the relative order of the updates of the two locations is important. Due to the system latencies, each client observes a different update time for each location and a different relative update time between the two location. This can result in a client observing the updates as occurring in the opposite order in which they actually do occur. In other words, if two clients, X and Y, read two lines, A and B, it may be possible for X to observe A being updated first while Y observes B being updated first. Alternatively, either client could read the wrong value for A or B if either the read or the update operation is delayed and the sequence is performed out of order.
To illustrate how this can happen, imagine client X writing line A and then line B at times N and N+1 respectively. X believes the lines are updated in the order A, B. If line A requires five units of time to appear in Y's cache, and line B requires just two units of time, then Y will observe line A being updated at time N+5 and line B being updated at time N+1+2. From Y's perspective, line B is updated first, and Y now thinks the lines were updated in the order B, A. If Y reads A and B at time N+4, Y will read the correct value for B but a stale value for A. Programs in which the order of data updates matters require a memory model called "strong," meaning that memory operations performed by multiple clients must occur in the same order in which the programs running on the clients call for the operation.
We can understand memory ordering by observing the three possible ways in which different processors may observe updates in different orders, causing a failure in a program or configuration that requires a strong memory model:
Case 1 Case 2 Case 3 Processor1 Processor2 Processor1 Processor2 Processor1 Processor2 Write_A B_new Write_A Write_B B_new A_new Write_B A_old B_old A_old Write_A Write_B
In this table, Write_A indicates that a processor updates line A and thinks line A contains the updated value. Likewise, Write_B indicates that a processor updates line B and thinks line B contains the updated value. B_new indicates that a processor reads line B and finds it to contain the updated value. Similarly, A_old indicates that a processor reads line A before it knows that line A has been updated, and thus reads a stale value.
Consider a program that includes an operation on A and B, executes that operation using both Processor1 and Processor2, and requires both Processor1 and Processor2 to execute the operation using the updated values. The program requires a strong memory model to function properly. Case 1 shows the most simple failure sequence where Processor1 performs the write operations in the order A, B, and therefore thinks that A and B have been updated in the order A, B. However, Processor2 observes the lines being updated in the wrong order and reads both lines before it sees a new value for A. Under this scenario, Processor2 will improperly execute its part of the program operation using a stale value for A.
Case 2 shows a sequence where Processor1 writes a new value for A and Processor2 writes a new value for B. However, Processor1 then reads B before it knows that B has been updated, and Processor2 reads A before it knows that A has been updated. Processor1 will perform its part of the program operation using an updated value for A but a stale value for B, while Processor2 will perform its part of the program operation using an updated value for B and a stale value for A.
Case 3 illustrates a slightly different failure mechanism. In Case 3, both processors are running a program in which they are to read one value, and then update the other. Processor1 reads B and then updates A. Processor2 reads A and then updates B. In the failure illustrated in Case 3, the read operation in both processors experiences a queue delay, but the writes proceed. By the time the reads are actually performed, the values have been updated and the "old" value--the correct value for the read operation--has been overwritten. Had the Case 3 memory operations been strongly ordered, the updates would not have occurred until the reads had been completed.
Not every computer program or operation requires a strong memory model to function correctly, and thus other memory models exist. The "weak" memory model places very few requirements on the order of operand updates. Typically, a weak memory model requires the sequence of physical updates to any individual memory location to occur in the same order in which the program on the processor calls for updates, but places no requirements on the interrelationship of order between different memory locations. There are a host of intermediate models too numerous to discuss here.
The stronger the memory model, the poorer the system performance. Strong memory models allow much less freedom in the way in which memory accesses are performed. For example, DRAM components have columns and banks, and when one of these is "open," further accesses to it can occur at a much greater rate. An optimized memory configuration will attempt to perform transactions to DRAM in an order which corresponds to the addresses most quickly accessed, not in the order in which requests are made. A weak memory model allows great freedom in reordering requests. A strong memory model allows little if any freedom in reordering requests.
Most software running today was written long before either multiprocessor configurations or out-of-order execution of memory accesses became popular, and therefore without regard to any specific memory model. As a result, it is difficult to determine if a program, or even if an operating system, will operate in a multiclient system with a weak memory ordering model. The practical result is that in a multiclient system, the strong memory model is generally required as the default memory model, with the option to "weaken" it for newer applications. Practical implementations must support either a variety of memory models, or forfeit performance and operate only in the most restrictive model. Therefore, it would be desirable to employ a memory management approach that dynamically determines when the memory model can be relaxed, and enforces a strong memory model only at those times and for those regions of memory in which a strong memory model is required to avoid program failures.
For example, if a memory region is known to be read-only memory, then there is no need to enforce strong memory ordering because no client can write updates to that region of memory, and data in read-only memory is never stale. Similarly, if a region of memory is known to be accessible by just one CPU, then that region of memory can be treated as weakly ordered because no other client will ever see out-of-order updates. Even if a memory region is shared by multiple clients, that region can be weakly ordered if no line in the region is ever shared by multiple clients. If a region contains lines that are shared by multiple clients, but no line is immediately loadable into more than one client's cache, then that region can be weakly ordered. Finally, if it is known that a program is not sensitive to memory ordering, then data at memory locations accessed by that program can be handled in accordance with a weakly ordered memory model. There are other instances, not described here, where relaxed memory ordering is possible.
On the other hand, if a system includes multiple processors that share data under conditions such that one of the ordering failures discussed above could occur, that data must be handled according to a strong memory model. Likewise, in systems with one or more processors capable of performing out-of-order or speculative instructions, reordered or speculative loads will be sent to memory by such processors in unusual orders. Therefore, these loads cannot be sent to regions of memory in which a strong memory model is being enforced.
The present invention discloses a method and apparatus that identifies regions of memory that can be weakly ordered or must be strongly ordered and enforces the appropriate memory model in those regions of memory. Using extensions to the TLB entry, the present invention dynamically associates one of two different memory ordering models for executing memory operations with specific pages of memory. Such identification and memory model enforcement allows for more efficient execution of memory instructions in cases where memory instructions can be executed out of order. An initial memory model is associated with each page of memory in the page table utilized in a system with a hierarchical memory design. The memory manager enforces and updates the memory model by constructing a TLB entry and loading the TLB with the memory model appropriate for each page during TLB updates. In the preferred embodiment, the TLB is a global TLB. Alternatively, the TLB may comprise either multiple distributed TLBs with shared knowledge, each assigned to a different processor, or a combination of multiple local TLBs, each assigned to a different processor, that exchange information with a global TLB.