1. Field of the Invention
This invention is related to the field of use of caches. More particularly, this invention is directed to a method and apparatus for speeding sequential access of a set-associative cache in a pipelined processor.
2. Description of Related Art
Currently, many processors are implemented with a pipelined architecture. These pipelined processors may also include one or more caches, such as an instruction cache for storing instructions for faster access. In such an architecture, there are distinct steps that an instruction undergoes as it is processed. Typically, these include: (1) loading the instruction from main memory into the instruction cache; (2) fetching the instruction from the instruction cache; (3) decoding the instruction; and (4) executing the instruction.
One instruction that often needs to be processed is an instruction pointer (IP) relative branch instruction. IP-relative branch instructions are conditional or unconditional branch instructions that compute the target address of the branch (the address of the next instruction to be executed if the branch is taken, also known as the "branch target") by adding a value extracted from the branch instruction itself to the address contained in the instruction pointer (which is address of the branch instruction). The IP-relative branch instruction is a common form of branch instruction in many computer instruction sets. In this context, branch instructions may include call and branch-and-link instructions.
The computation of the target address typically occurs in either the instruction decode step or in a subsequent execution step. FIG. 1 contains a target address computation unit 100 that illustrates how this processing occurs. Target address computation unit 100 contains an IP register 110; a direct-mapped cache 112 including a set of cache tags 114, cache data 116 and a hit detection unit 117; a decoder 118; an adder 120 for calculating the branch target address; an incrementer 122 for handling non-branch cases; and a selector 124 that determines whether the next IP is obtained from incrementer 122 or adder 120. For an IP-relative branch, the circuit shown in FIG. 1 functions by beginning with some number of the less-significant bits (LSb's) of the IP (but not necessarily the least significant bits), sufficient to select one line from the plurality of lines in cache 112, being sent to the address inputs of the storage arrays that hold set of cache tags 114 and cache data 116. The remaining more significant bits (MSb's) of the IP are sent to hit detection unit 117, where they are compared with the tag of the accessed cache line to determine whether a cache hit occurred (that is, whether the instruction referenced by the IP is contained in cache 112).
Assuming a hit occurred, decoder 118 determines whether the instruction is a branch instruction and sends the displacement from the instruction word that is read out of cache 112 to adder 120. The displacement is then added to the IP. Assuming the branch is taken, selector 124, which is under control of other logic (not shown), routes the output of adder 120 to IP register 110, where it becomes the next IP when the load signal is sent to IP registers 110. If the instruction is not a branch instruction or if the instruction contains a branch that is not taken, then the output of incrementer 122 becomes the next IP.
FIG. 1 illustrates shows the operation of a typical instruction cache. Two separate processes are involved: the process by which information is placed in the instruction cache, and the process by which information is fetched from the instruction cache. The two arrows coming into the top of cache 112 represent the process of putting information into cache 112. The information consists of an address and the data that is contained in that address in a main memory. Assuming cache 112 is a simple cache where cache lines are filled all at once, information is placed in the set-associative cache by the following process:
1) The set that the address belongs to is determined. The set is a function of a group of bits in the address. Generally, these are the least significant bits beyond those bits needed to identify an offset within the cache line, sufficient to address all the sets in the cache. For example, if a cache line is 32 bytes, then 5 bits of address are needed to identify where in the cache line data is, so the sixth bit and up are used to select which set to use. If there are 128 sets (the number of sets is essentially always a power of two) then 7 bits would be needed to select one set. In this example, the sixth through twelfth least significant bits would select which set to use. PA0 2) A line within that set is chosen. The selection of which line to use can be done via a number of different algorithms, such as Least-Recently-Used, Least-Recently-Written, Circular, or another algorithm that is known by one of ordinary skill in the art. In a direct mapped cache, there is only one line per set, so the selection process is trivial. PA0 3) The data is written into the selected cache line's data storage, and the more significant bits (those not used in the selection of a byte within a line nor in the selection of which set to use) are placed in the tag for that line. PA0 1) The address of the next information desired from cache 112 is generated. If a branch has not occurred, this is done by incrementing the previous address that was used. If a branch has occurred, then the computed target address of the branch is used. For the type of branches covered by this invention, the branch target address is generated by adding a displacement contained within the instruction to the address of the instruction (or, in some architectures, the address of the next instruction that would have been executed if the branch had not been taken.) The displacement is extracted from the instruction via decoder 118. Incrementer 122, adder 120, selector 124 and IP Register 110 of FIG. 1 are used to perform the function of generating the needed address, under the control of logic not shown in the diagram. If another form of branch is possible (such as a return from subroutine, or, a branch to an address specified by a general purpose register), then another input to selector 124 would be used to route the address generated by that branch into the circuit. This is not shown in the diagram but is obvious to one of ordinary skill in the art. PA0 2) The bits of the generated address needed to select the set are fed into cache 112. In the example given above, these are the sixth through twelfth least significant bits. These bits select the lines belonging to one set, all of which then drive their tags and data out. PA0 3) The tags coming out of the lines are compared to the more significant bits (in the example, the thirteenth least significant bit and all more significant bits) to determine whether a cache hit has occurred. If one of the lines from the set produces a hit, then the needed data is selected from its data using the least significant bits below those used to select the set (the fifth least significant bit and those bits less significant than it). PA0 4) This fetched data is then fed into decoder 118, as described for first step of execution.
One exemplary process of fetching information from cache 112 proceeds as follows:
In an address computation unit such as target address computation unit 100 of FIG. 1, the cache access and the decode/adder stages would often be implemented in separate pipestages. Thus, a target address of a branch instruction is not immediately available at the beginning of the clock cycle after the cycle in which instruction was fetched from cache 112. This in turn means that the instruction that would be executed immediately after the branch, if the branch is taken, can not be fetched in the cycle after the branch instruction was fetched. This causes one or more clock cycle during which no new instruction is issued into the execution stages of the pipeline. This problem has been termed a "control hazard".
In one attempt to overcome problems such as control hazards, a "branch-target buffer" may be provided. However, branch-target buffers are generally expensive and complex devices. Thus, branch target buffers generally contain many devices, which also adds cost and complexity to the testing and deployment phases of the integrated circuit.
A more efficient mechanism for speeding sequential access of a set-associative cache in a pipelined processor is thus desired.