The present invention relates to memory access operations in computer systems. More specifically, the present invention relates to atomic memory update operations typically used to access semaphores.
In computer systems, it is common for two or more processes to contend for the same resource. For example, two or more processes may attempt to write a particular sequence of commands to a video controller. The processes may be executed by a single central processing unit (CPU), or may be executed by two or more CPUs in a multi-processor computer system. The terms CPU and processor will be used herein in interchangeably.
Since the processes cannot access the resource at the same time, the operating system of the computer must provide some mechanism to schedule access to the resource. One common mechanism known in the art is the xe2x80x9ctake-a-numberxe2x80x9d scheduling algorithm. This algorithm is somewhat analogous to a group of customers that wish to be serviced by a single store clerk. When a customer enters the store, the customer takes a number. When the clerk calls that number, the customer is serviced by the clerk.
Using this analogy, the mechanism that provides the xe2x80x9cnumberxe2x80x9d to the process is known in the art as a semaphore. Typically, a semaphore is stored in a memory location. A process seeking to access the semaphore first reads the memory location, increments the value read from the memory location, and stores the result back in the memory location. The value read from the memory location acts as the xe2x80x9cnumberxe2x80x9d for the process, and the result written back to the memory location acts as the next xe2x80x9cnumberxe2x80x9d for the next process that attempts to access the resource. When the operating system indicates that the holder of a particular xe2x80x9cnumberxe2x80x9d may access the resource, the process holding that xe2x80x9cnumberxe2x80x9d does so.
For the xe2x80x9ctake-a-numberxe2x80x9d scheduling algorithm to operate correctly, it is critical that the memory read, increment, and memory write operations occur xe2x80x9catomicallyxe2x80x9d. In other words, there must be no chance that a second process can read the memory location holding the semaphore between the point at which the first process reads the memory location and the point at which the first process writes the incremented value back to the memory location. If such a read operation by the second process occurred, then the first and second processes would each have the same xe2x80x9cnumberxe2x80x9d, and may try to access the resource concurrently.
Ensuring that semaphore operations occur atomically is relatively simple in a single CPU computer system in which no other devices coupled to the bus perform direct memory access (DMA) operations. For example, the 32-bit Intelg architecture (IA-32), which is used by the Intel(copyright) i486(trademark), Pentium(copyright), Pentium(copyright) Pro, Pentium(copyright) II, and Celeron(trademark) CPUs, includes the xe2x80x9cexchange and addxe2x80x9d (XADD) instruction. When using this instruction to access a memory location containing a semaphore, the XADD instruction is typically used as follows:
XADD destination memory location, source register
This instruction stores the sum of the values contained in the destination memory location and the source register in a temporary register, stores the contents of the destination memory location in the source register, and stores the contents of the temporary register in the destination memory location. Accordingly, if the value xe2x80x9c1xe2x80x9d is stored in the source register when the instruction is executed, then when the instruction is completed the value in the destination memory location will be incremented by xe2x80x9c1xe2x80x9d and the value originally in the destination memory location will be stored in the source register. Since an interrupt will not be processed until an instruction is complete and the computer system in this example has a single CPU (and no other devices are performing DMA operations), no other process can access the semaphore during the read-modify-write operation performed by the XADD instruction. Accordingly, the semaphore operation occurs atomically. The IA-32 exchange (XCHG) instruction and compare and exchange (CMPXCHG) instruction are also commonly used to ensure atomic access to semaphores.
In multi-processor computer systems and systems having devices that perform DMA operations, assuring atomicity is more complex because it is possible that a second CPU or device may attempt to access the semaphore before the first CPU increments and writes the semaphore back to the memory location. In such computer systems, atomicity is provided either by a bus lock mechanism or a cache coherency mechanism. Before discussing these mechanisms in detail, it is helpful to first consider the operation of CPU cache memories.
Cache memories are relatively small and fast memories that hold a subset of the contents of main memory. For example, a computer system based on a Pentium(copyright) II CPU has a level one (L1) cache on the same integrated circuit (IC) as the CPU, and a level two (L1) cache on the same module as the CPU, but on a separate IC. The L1 cache is smaller and faster than the L2 cache. Main memory contents are stored in cache memories in units called cache lines. The cache line size of the L1 and L2 caches in a Pentium(copyright) CPU is 32 bytes.
The Intel(copyright) i486(trademark) CPU uses a xe2x80x9cwrite-throughxe2x80x9d L1 cache. In such a cache, a memory write from the CPU is written to the cache and main memory concurrently. Beginning with the Intel(copyright) Pentium(copyright) CPU, Intel(copyright) processors provide support for xe2x80x9cwrite-backxe2x80x9d caches. In a write-back cache, a memory write from the CPU is only written to the cache. The cache mechanism then determines whether (and when) the memory write is actually committed to main memory. This increases performance because the write to main memory can be deferred until main memory is not busy. In addition, it is possible that the memory operand many change several times before it is necessary to write the memory operand back to main memory. Also, it provides an opportunity for a cache to assemble a complete cache line of changes before writing the cache line back to memory, which is known in the art as coalescing.
Cache coherency mechanisms ensure that memory contents stored in CPU caches and main memory remain coherent. For example, if the cache of a first CPU contains a cache line having changed (or xe2x80x9cdirtyxe2x80x9d) contents that have not been written back to main memory, and a second CPU attempts to read the corresponding memory location from main memory, the cache coherency mechanism ensures that the second CPU is provided with the correct contents from the cache of the first CPU, not the incorrect contents currently stored in main memory. The cache coherency mechanism can accomplish this in several ways. One technique is to simply force the cache of the first CPU to write the changed cache line back to main memory. Another technique allows the cache of a second CPU to xe2x80x9csnoopxe2x80x9d changes to the cache of the first CPU, thereby allowing the second CPU cache to be continually updated with the changes made in the first CPU cache.
Furthermore, a CPU can request that a cache line be loaded as xe2x80x9csharedxe2x80x9d or xe2x80x9cexclusivexe2x80x9d. A shared cache line cannot be changed by the CPU, and therefore is advantageously used in situations where it is known that the contents of the cache line will not be changed (e.g., program code). An exclusive (or alternatively, xe2x80x9cprivatexe2x80x9d) cache line can be changed by the CPU. Typically, a xe2x80x9cdirty bitxe2x80x9d is associated with an exclusive cache line to indicate if the contents have changed. If the dirty bit is set to indicate that the cache line has changed, the cache line must be written back to main memory. If the dirty bit is cleared to indicate that the cache line has not changed, the cache line can be discarded with being written back to main memory. Typically only one CPU can hold a particular cache line as exclusive at any given time.
Returning to the topic ofatomicity, early IA-32 CPUs provide atomicity by storing semaphores in non-cacheable memory or memory cached using the write-through method, and by issuing a xe2x80x9cbus lockxe2x80x9d when accessing the semaphore. A bus lock ensures that a single CPU has exclusive ownership of the bus during the read-modify-write transactions required by a semaphore operation. This method extracts a rather heavy performance penalty since all other CPUs are blocked from accessing the bus during the pendency of the read-modify-write transaction, even though the other CPUs may not need to access the region of memory containing the semaphore. Note that in high-end multi-processor systems employing a variety of interconnection fabrics, the notion of a xe2x80x9cbusxe2x80x9d and therefore a xe2x80x9cbus lockxe2x80x9d may disappear entirely. For example, in a multi-processor system having pods comprised of four-processors, with each of the processors in a pod coupled via a conventional bus, and with each of the pods interconnected via a ring topology, a CPU in one pod will typically not be able to lock the bus in another pod.
Later IA-32 CPUs provide atomicity via the cache coherency mechanism. When a CPU accesses a semaphore, the L1 cache of the CPU requests exclusive use of a cache line that includes the memory location holding the semaphore. Therefore, the CPU can perform the read-modify-write transaction required by the semaphore operation without the possibility that another CPU can access the semaphore during the transaction. Accordingly, other CPUs can continue to access the bus, and therefore memory. In essence, an xe2x80x9cin-cachexe2x80x9d atomic update is performed via an xe2x80x9caddress lockxe2x80x9d, since the only region of main memory not accessible to the other CPUs is the cache line held as exclusive in the cache of the CPU performing the semaphore operation. Note that since the whole cache line is held as exclusive, it is often desirable to not store multiple semaphores in a single cache line.
While providing atomicity via cache coherency provides much better performance than providing cache coherency via bus locks, xe2x80x9csemaphore cache line thrashingxe2x80x9d can still limit performance. Semaphore cache line thrashing occurs when two or more CPUs continually compete for the same resource, and therefore the same semaphore. Accordingly, each CPU continually tries to obtain exclusive control over the cache line containing the semaphore, resulting in the cache line being continually loaded into and written out of each CPU""s cache. Typically, while a CPU is waiting to gain exclusive access to a cache line containing a semaphore, the CPU cannot make progress.
In the prior art, some large multi-processor systems have addressed this problem using a xe2x80x9cfetch and addxe2x80x9d instruction (FETCHADD). The xe2x80x9cincrementxe2x80x9d operation associated with the FETCHADD instruction is exported to a centralized location, such as a memory controller. Accordingly, when a CPU executes a FETCHADD instruction referencing a semaphore stored in a memory location, the memory controller provides the semaphore value stored in the memory location to the CPU. Furthermore, the memory controller increments the semaphore and stores the result back in the memory location. Therefore, the CPU never needs to acquire exclusive access to the cache line containing the semaphore because the CPU never needs to write the memory location containing the semaphore, thereby eliminating semaphore cache line thrashing. In addition, it is possible to store semaphores in memory more efficiently, since more than one semaphore can exist within a cache line boundary without incurring a performance penalty.
In the computer industry, there is a continuing positive trend toward high-performance hardware. However, there is also a somewhat conflicting positive trend toward low-cost xe2x80x9coff-the-shelf shrink-wrappedxe2x80x9d operating systems (and other software) that can execute on a wide variety of hardware architectures, including hardware architectures that provides atomicity via bus locks, cache coherency mechanisms, and exportation of instructions designed to provide atomic semaphore updates. However, prior art methods of providing atomicity generally assume that the software is xe2x80x9cawarexe2x80x9d of the method by which atomicity is provided. Accordingly, software designed to access semaphores using bus locks will not be able to take advantage of the superior semaphore performance provided by cache coherency mechanisms and exportation of instructions designed to provide atomic semaphore updates. Similarly, software designed to access semaphores using cache coherency mechanisms will not be able to take advantage of the superior semaphore performance provided by exportation of instructions designed to provide atomic semaphore updates. What is needed in the art is a computer architecture that allows low-cost xe2x80x9coff-the-shelf shrink-wrappedxe2x80x9d software to access the highest performing atomic update method provided by the computer system hardware on which it is executing, without the software having to be explicitly coded to exploit particular atomic update methods.
The present invention provides a 64-bit architectural framework in which IA-32 instructions requiring bus locks will execute efficiently on computer hardware that provides superior methods of providing atomicity. In addition, the present invention provides an architectural framework that defines an exportable 64-bit fetch and add (FETCHADD) instruction, which can be coded into xe2x80x9coff-the-shelf shrink-wrapxe2x80x9d software, and a programmable method by which the hardware can ensure atomicity in executing the FETCHADD instruction by exporting the instruction, or by using a cache coherency mechanism.
In the IA-32 instruction set, the LOCK prefix can be prepended to the following instructions, and only to those forms of the instructions that access a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. In accordance with the present invention, a CPU includes a default control register that includes IA-32 lock check enable bit (LC). When LC bit is set to xe2x80x9c1xe2x80x9d, and an IA-32 atomic memory reference requires a read-modify-write operation external to the processor under an external bus lock (i. e., the instruction includes the LOCK prefix), an IA-32 intercept lock fault is raised, and an IA-32 intercept lock fault handler is invoked. The fault handler examines the IA-32 instruction that caused the interruption and branches to appropriate code to atomically emulate the instruction. Accordingly, the present invention allows a computer system having a 64-bit architecture in accordance with the present invention to maintain binary compatibility with IA-32 instructions, while maintaining the superior performance provided by the 64-bit architecture by not locking the bus.
Furthermore, the present invention defines an exportable fetch and add instruction having the following format:
FETCHADD R1=[R3], INC
This instruction reads the memory location indexed by register R3, places the contents read from the memory location in register R1, adds the value INC to the contents read from the memory location, and stores the sum back in the memory location.
Associated with each virtual memory page is a memory attribute that can assume a state of xe2x80x9ccacheable using a write-back policyxe2x80x9d (WB), xe2x80x9cuncacheablexe2x80x9d (UC), or xe2x80x9cuncacheable and exportablexe2x80x9d (UCE). When a FETCHADD instruction is executed and the memory location accessed is in a page having an attribute set to WB, the FETCHADD instruction is atomically executed by the CPU by obtaining exclusive use of the cache line containing the memory location. However, when a FETCHADD instruction is executed and the memory location accessed is in a page having an attribute set to UCE, the FETCHADD instruction is atomically executed by exporting the FETCHADD instruction to a centralized location, such as a memory controller, thereby eliminating semaphore cache line thrashing.
Accordingly, the present invention provides an architectural framework in which xe2x80x9coff-the-shelf shrink-wrapxe2x80x9d software can be encoded with semaphores accessed by a FETCHADD instructions, even though the software xe2x80x9cdoes not knowxe2x80x9d whether atomicity will be provided by the cache coherency mechanism, or by exporting the FETCHADD instruction to a centralized location, such as a memory controller. Therefore, such software will be able to access the fastest method of providing atomic update operations available on the computer hardware, without the software requiring individual code segments for each method.