The present invention relates to data processing and, more particularly, data processing involving load and store operations. A major objective of the invention is to provide for improved implementation of superword-size atomic load and store operations in a multiprocessor environment.
Much of modern progress is associated with advances in computer technology. Generally, computers have one or more data processors and memory. Each data processor fetches instructions from memory and manipulates data, typically stored in memory, in accordance with the instructions. While a processor typically treats instructions distinct from data, instructions can be manipulated as data in accordance with other instructions.
A typical data processor includes a general-purpose register file that provides for temporary holding of data loaded from memory, data calculated by the processor, and data to be stored in memory. The general-register file typically includes a number of registers having a common bit width. The common register bit width defines the word size, the largest amount of data typically transferable with one instruction. Early processors had 8-bit registers, but 64-bit registers are presently typical.
In addition to general-purpose registers, some processors have special-purpose registers. For example, the Intel® Itanium® architecture provides a set of application registers including an “ar.csd” register used to ensure compatibility with previous generation 32-bit processors. Additionally, the Intel Itanium architecture provides an ar.ccv register that is typically used to store values for comparison. For example, this register is used by a word-size (eight-byte) compare and exchange semaphore instruction (cmpxchg8). When this instruction is executed, a word at an explicitly specified memory location is transferred to a specified general-purpose register. The transferred value is then compared with the value in the ar.ccv register. If the values are equal, a value from another specified general register is stored to the same specified memory address. “Semaphore” refers to a key that grants exclusive access to a section of memory to a processor that holds it. Thus, the semaphore operations are performed atomically; that is, they are done as a single memory operation, and no other memory operations can occur in the middle.
Computer processors typically implement memory operations that transfer data between memory and processor with a transfer size equal to or smaller than the size of the processor registers. For example, most modern 64-bit RISC (“reduced instruction set computers”) processors can load and store information from and to memory in units of 64 bits, 32 bits, 16 bits, or 8 bits.
There are performance advantages to providing for superword-size (typically double-word, but generally, anything greater than one word) transfers as well. Modern processors typically provide a wider data path to a level-1 cache than the word-width data path normally provided to the general-purpose registers. For example, a 64-bit processor can have a 128-bit (16-byte) data path to the level-1 cache. For word-size transfers, 64 bits of the level-1 cache data path are selected for transfer. Providing a way to transfer all 128 bits available on the level-1 cache data path into the processor registers can increase the maximum cache to register bandwidth, which can, in turn, increase overall performance in some processors.
There are further advantages to superword transfers in the context of multiprocessor systems in which multiple processors may be attempting to read from and write to the same region of memory. For example, in the Intel Itanium architecture, the basic unit of program instruction is the “bundle” which is 128 bits in size and holds three instructions. A program running on one processor may attempt to modify a bundle in memory, while, at the same time, another processor may attempt to execute that same bundle. The writing processor acts to ensure that, if the executing processor fetches the bundle, it will fetch either the entire old (umodified) bundle, or the entire newly written bundle, and not a combination of 64 old bits and 64 new bits.
In other words, there is a need for “atomic” multiword transfers. In an “atomic” transfer, the entire access is done as a single inseparable unit, as opposed to multiple, distinct accesses. In practical terms, storing a quantity atomically means that another processor that may be reading the same memory simultaneously sees either all the bytes being stored or none of them, depending upon whether their load of that memory happens to occur after or before the store. Similarly, loading a quantity atomically means that all of the bytes being read are read with no intervening stores by other processors.
If only word-sized transfers are permitted, two load instructions are required to read a bundle. To ensure consistency, some sort of memory locking mechanism must be implemented, either in hardware or software, during the read or the write so that the two loads result in consistent data. Unfortunately, such locking mechanisms can be very complex to implement and can cause performance scalability problems as the number of processors in a system increases.
Prior systems have permitted single-instruction superword memory accesses, thus reducing the need for complex blocking mechanisms. For example, the transfer can be between memory and a pair of general-purpose registers. However, such transfers typically require additional read and write ports to the general-purpose register file. Such extra ports are costly to implement and adversely impact instruction cycle time (and thus performance). It is generally a poor tradeoff to add such read or write ports unless they will be used by a large percentage of instructions.
Additionally, the two general-purpose registers must be somehow specified. Due to instruction encoding space constraints, it is generally not possible to specify an additional register in a single instruction. Alternatively, the additional register can be implicitly specified—e.g., it can be a register adjacent to the one that is specified. However, implied general-register sources or targets create complexity for the software, e.g., exception handling software, which must then manage the registers as pairs in situations where the larger operations are used. Accordingly, there remains a need for a system that can efficiently perform superword memory accesses atomically.