1. Field of the Invention
This invention is related to the field of processors and, more particularly, to load/store units within processors.
2. Description of the Related Art
Processors are more and more being designed using techniques to increase the number of instructions executed per second. Superscalar techniques involve providing multiple execution units and attempting to execute multiple instructions in parallel. Pipelining, or superpipelining, techniques involve overlapping the execution of different instructions using pipeline stages. Each stage performs a portion of the instruction execution process (involving fetch, decode, execution, and result commit, among others), and passes the instruction on to the next stage. While each instruction still executes in the same amount of time, the overlapping of instruction execution allows for the effective execution rate to be higher. Typical processors employ a combination of these techniques and others to increase the instruction execution rate.
As processors employ wider superscalar configurations and/or deeper instruction pipelines, memory latency becomes an even larger issue than it was previously. While virtually all modem processors employ one or more caches to decrease memory latency, even access to these caches is beginning to impact performance.
More particularly, as processors allow larger numbers of instructions to be in-flight within the processors, the number of load and store memory operations which are in-flight increases as well. As used here, an instruction is xe2x80x9cin-flightxe2x80x9d if the instruction has been fetched into the instruction pipeline (either speculatively or non-speculatively) but has not yet completed execution by committing its results (either to architected registers or memory locations). Additionally, the term xe2x80x9cmemory operationxe2x80x9d is an operation which specifies a transfer of data between a processor and memory (although the transfer may be accomplished in cache). Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory. Load memory operations may be referred to herein more succinctly as xe2x80x9cloadsxe2x80x9d, and similarly store memory operations may be referred to as xe2x80x9cstoresxe2x80x9d. Memory operations may be implicit within an instruction which directly accesses a memory operand to perform its defined function (e.g. arithmetic, logic, etc.), or may be an explicit instruction which performs the data transfer only, depending upon the instruction set employed by the processor. Generally, memory operations specify the affected memory location via an address generated from one or more operands of the memory operation. This address will be referred to herein in as a xe2x80x9cdata addressxe2x80x9d generally, or a load address (when the corresponding memory operation is a load) or a store address (when the corresponding memory operation is a store). On the other hand, addresses which locate the instructions themselves within memory are referred to as xe2x80x9cinstruction addressesxe2x80x9d.
Since memory operations are part of the instruction stream, having more instructions in-flight leads to having more memory operations in-flight. Unfortunately, adding additional ports to the data cache to allow more operations to occur in parallel is generally not feasible beyond a few ports (e.g. 2) due to increases in both cache access time and area occupied by the data cache circuitry. Accordingly, relatively larger buffers for memory operations are often employed. Scanning these buffers for memory operations to access the data cache is generally complex and, accordingly, slow. The scanning may substantially impact the load memory operation latency, even for cache hits.
Additionally, data caches are finite storage for which some load and stores will miss. A memory operation is a xe2x80x9chitxe2x80x9d in a cache if the data accessed by the memory operation is stored in cache at the time of access, and is a xe2x80x9cmissxe2x80x9d if the data accessed by the memory operation is not stored in cache at the time of access. When a load memory operation misses a data cache, the data is typically loaded into the cache. Store memory operations which miss the data cache may or may no t cause the data to be loaded into the cache. Data is stored in caches in units referred to as xe2x80x9ccache linesxe2x80x9d, which are the minimum number of contiguous bytes to be allocated and deallocated storage within the cache. Since many memory operations are being attempted, it becomes more likely that numerous cache misses will be experienced. Furthermore, in many common cases, one miss within a cache line may rapidly be followed by a large number of additional misses to that cache line. These misses may fill, or c come close e to filling, the buffers allocated within the processor for memory operations. An efficient scheme for buffering memory operations is therefore needed.
An additional problem which become s even more onerous as processors employ wider superscalar configurations and/or deeper pipelines is the issue of self-modifying code (SMC) checks. Self-modifying code is code which performs store memory operations to memory locations which are subsequently fetched as instructions to be executed. Pipelined and/or wide issue processors may have fetched numerous instructions subsequent to the stores and hence may have fetched from the memory locations being stored to prior to the store being performed. In order to operate correctly, the store data must be fetched as the instructions from the updated memory locations. Accordingly, SMC checks are performed for stores to determine if one or more instructions updated by the store have been fetched prior to completion of the store, so that corrective action may be taken to fetch the correct instructions. As the number of instructions in-flight within processors increase, the difficulty in performing SMC checks (and hence the amount of time these checks may take) increases. Additionally, the likelihood that a n SMC check will indicate that the subsequent instructions have been fetched when self-modifying code is executing is increased. A mechanism which minimizes the number of explicit SMC checks which are performed but which still guarantees correct self-modifying code execution is therefore desired.
The problems outlined above are in large part solved by processor employing an SMC check apparatus as described herein. The SMC check apparatus may minimize the number of explicit SMC checks performed for non-cacheable stores. Cacheable stores may be handled using any suitable mechanism. For non-cacheable stores, the processor tracks whether or not the in-flight instructions are cached. Upon encountering a non-cacheable store, the processor inhibits an SMC check if the in-flight instructions are cached. Since, for performance reasons, the code stream is often cached, non-cacheable stores may frequently be able to skip an explicit, complex, and time consuming SMC check. Performance of non-cacheable stores (and memory throughput overall) may be increased. The handling of non-cacheable stores as described herein may be particularly beneficial to video data manipulations, which may frequently be of a non-cacheable memory type and which may be important to the overall performance of a computer system.
Broadly speaking, a processor is contemplated comprising a load/store unit. The load/store unit includes a buffer and a control logic. The buffer is configured to store a store memory operation and a corresponding cacheability indication. The control logic is configured to set the cacheability indication according to a translation of a store address corresponding to the store memory operation. The control logic is coupled to receive a signal indicative of whether one or more instructions in-flight within the processor are uncacheable. The control logic is configured to inhibit a self-modifying code (SMC) check for the store memory operation responsive to: (i) the cacheability indication indicating non-cacheable; and (ii) the signal indicating cacheable. A computer system is also contemplated including the processor and an input/output (I/O) device. The I/O device provides communication between the computer system and another computer system to which the I/O device is coupled.
Additionally, a method for performing self-modifying code (SMC) checks in a processor is contemplated. The method determines that a store memory operation is non-cacheable. A signal is asserted if one or more instructions in-flight within the processor are non-cacheable. An SMC check for the store memory operation is inhibited if the store memory operation is non-cacheable and the signal is deasserted.