1. Field of the Invention
This invention relates to the field of superscalar microprocessors and, more particularly, to a method for transferring data between a pair of caches within said microprocessor.
2. Description of the Relevant Art
Superscalar microprocessors achieve high performance by simultaneously executing multiple instructions in a clock cycle and by specifying the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time during which the pipeline stages of a microprocessor perform their intended functions. Memory elements (such as registers and arrays) capture the resulting values according to a clock signal defining the clock cycle.
Since superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a high bandwidth memory system is required to provide instructions and data to the superscalar microprocessor (i.e. a memory system that can provide a large number of bytes in a short period of time). Without a high bandwidth memory system, the microprocessor would spend a large number of clock cycles waiting for instructions or data to be provided, then would execute the received instructions and/or the instructions dependent upon the received data in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles. Unfortunately, superscalar microprocessors are ordinarily configured into computer systems with a large main memory composed of dynamic random access memory (DRAM) cells. DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells provide a memory system that provides a relatively small number of bytes in a relatively long period of time, and do not form a high bandwidth memory system.
Because superscalar microprocessors are typically not configured into a computer system with a memory system having sufficient bandwidth to continuously provide instructions and data, superscalar microprocessors are often configured with caches. Caches are storage devices containing multiple blocks of storage locations, configured on the same silicon substrate as the microprocessor or coupled nearby. The blocks of storage locations are used to hold previously fetched instruction or data bytes. The bytes can be transferred from the cache to the destination (e.g. a register or an instruction processing pipeline) quickly; commonly one or two clock cycles are required as opposed to a large number of clock cycles to transfer bytes from a DRAM main memory.
Caches may be organized into an "associative" structure (also referred to as "set associative"). In an associative structure, the blocks of storage locations are accessed as a two-dimensional array having rows and columns. When a cache is searched for bytes residing at an address, a number of bits from the address are used as an "index" into the cache. The index selects a particular row within the two-dimensional array, and therefore the number of address bits required for the index is determined by the number of rows configured into the cache. The act of selecting a row via an index is referred to as "indexing". The addresses associated with bytes stored in the multiple blocks of a row are examined to determine if any of the addresses stored in the row match the requested address. If a match is found, the access is said to be a "hit", and the cache provides the associated bytes. If a match is not found, the access is said to be a "miss". When a miss is detected, the bytes are transferred from the memory system into the cache. The addresses associated with bytes stored in the cache are also stored. These stored addresses are referred to as "tags" or "tag addresses". It is noted that an "address" is indicative of a storage location within the main memory of a computer system at which a particular value is stored.
Several blocks of storage locations are configured into a row of an associative cache. Each block of storage locations is referred to as a "way"; multiple ways comprise a row. The way is selected by providing a way value to the cache. The way value is determined by examining the tags for a row and finding a match between one of the tags and the requested address. A cache designed with one way per row is referred to as a "direct-mapped cache". In a direct-mapped cache, the tag must be examined to determine if an access is a hit, but the tag examination is not required to select which bytes are transferred to the outputs of the cache. Since only an index is required to select bytes from a direct-mapped cache, the direct-mapped cache is a "linear array" requiring only a single value to select a storage location within it. It is noted a set of contiguous bytes which may fill a block of storage locations within the cache is often referred to as a "cache line".
A high bandwidth memory system is particularly important to a microprocessor implementing the x86 microprocessor architecture. The x86 architecture implements a relatively small register set including several registers which are not general purpose. Registers which are not general purpose may not be used to store an arbitrary value because the value they store has a specific interpretation for certain instructions. Consequently, many data values which a program is manipulating are stored within a stack. As will be appreciated by those of skill in the art, a stack is a data storage structure implementing a last-in, first-out storage mechanism. Data is "pushed" onto a stack (i.e. the data is stored into the stack data structure) and "popped" from the stack (i.e. the data is removed from the stack data structure). When the stack is popped, the data removed is the data that was most recently pushed. The ESP register of the x86 architecture stores the address of the "top" of a stack within main memory. The top of the stack is the storage location which is storing the data that would be provided if the stack is popped.
Since data on the stack is manipulated often, a method for providing relatively quick access to data on the stack is desired. In particular, accessing stack data as early as possible in the instruction processing pipeline may improve microprocessor performance by allowing instructions which access the stack to fetch their operands early. As used herein, the term "instruction processing pipeline" refers to a pipeline which performs instruction processing. Instruction processing may include fetching, decoding, executing, and writing the results of each instruction. An instruction processing pipeline is formed by a number of pipeline stages in which portions of instruction processing are performed. A particular stage may require more than one clock cycle to perform its function. Often, such a stage includes several memory elements through which the instruction may flow. A decode stage of the instruction processing pipeline performs the decoding of an instruction. Decoding may include determining what type of instruction is to be executed and accessing the register operands. An execute stage of an instruction processing pipeline may include executing the decoded instruction to produce a result. Many other stages may be defined for a particular instruction processing pipeline.
Typically, memory operands (both stack and non-stack) are accessed from the execute stage of the instruction processing pipeline. As used herein, the term "operand" refers to a value which an instruction is intended to manipulate. Operands may be memory operands (which are stored in memory) or register operands (which are stored in registers).