1. Field of the Invention
This invention is related to the field of processors and, more particularly, to caching mechanisms within processors.
2. Description of the Related Art
Superscalar processors achieve high performance by simultaneously executing multiple instructions in a clock cycle and by specifying the shortest possible clock cycle consistent with the design. As used herein, the term xe2x80x9cclock cyclexe2x80x9d refers to an interval of time during which the pipeline stages of a processor perform their intended functions. At the end of a clock cycle, the resulting values are moved to the next pipeline stage. Clocked storage devices (e.g. registers, latches, flops, etc.) may capture their values in response to a clock signal defining the clock cycle.
To reduce effective memory latency, processors typically include caches. Caches are high speed memories used to store previously fetched instruction and/or data bytes. The cache memories may be capable of providing substantially lower memory latency than the main memory employed within a computer system including the processor.
Caches may be organized into a xe2x80x9cset associativexe2x80x9d structure. In a set associative structure, the cache is organized as a two-dimensional array having rows (often referred to as xe2x80x9csetsxe2x80x9d) and columns (often referred to as xe2x80x9cwaysxe2x80x9d). When a cache is searched for bytes residing at an address, a number of bits from the address are used as an xe2x80x9cindexxe2x80x9d into the cache. The index selects a particular set within the two-dimensional array, and therefore the number of address bits required for the index is determined by the number of sets configured into the cache. The act of selecting a set via an index is referred to as xe2x80x9cindexingxe2x80x9d. Each way of the cache has one cache line storage location which is a member of the selected set (where a cache line is a number of contiguous bytes treated as a unit for storage in the cache, and may typically be in the range of 16-64 bytes, although any number of bytes may be defined to compose a cache line). The addresses associated with bytes stored in the ways of the selected set are examined to determine if any of the addresses stored in the set match the requested address. If a match is found, the access is said to be a xe2x80x9chitxe2x80x9d, and the cache provides the associated bytes. If a match is not found, the access is said to be a xe2x80x9cmissxe2x80x9d. When a miss is detected, the bytes are transferred from the main memory system into the cache. The addresses associated with bytes stored in the cache are also stored. These stored addresses are referred to as xe2x80x9ctagsxe2x80x9d or xe2x80x9ctag addressesxe2x80x9d.
As mentioned above, a cache line storage location from each way of the cache is a member of the selected set (i.e. is accessed in response to selecting the set). Information stored in one of the ways is provided as the output of the cache, and that way is selected by providing a way selection to the cache. The way selection identifies the way to be selected as an output. In a typical set associative cache, the way selection is determined by examining the tags within a set and finding a match between one of the tags and the requested address.
Unfortunately, set associative caches may be higher latency than a direct mapped cache (which provides one cache line storage location per index) due to the tag comparison to determine the way selection for the output. Furthermore, since the way selection is not known prior to the access, each way is typically accessed and the corresponding way selection is used to late select the output bytes if a hit is detected. Accessing all of the ways may cause undesirably high power consumption. Limiting power consumption is rapidly achieving equal par with increasing operating speed (or frequency) in modem processors. Accordingly, a low latency, low power consuming method for accessing a set associative cache is desired.
The problems outlined above are in large part solved by a cache as disclosed herein. The cache is coupled to receive an input address and a corresponding way prediction. The cache provides output bytes in response to the predicted way (instead of performing tag comparisons to select the output bytes), and thus may reduce access latency as compared to performing the tag comparisons. Furthermore, a tag may be read from the predicted way and only partial tags are read from the non-predicted ways. The tag is compared to the tag portion of the input address, and the partial tags are compared to a corresponding partial tag portion of the input address. If the tag matches the tag portion of the input address, a hit in the predicted way is detected and the bytes provided in response to the predicted way are correct. If the tag does not match the tag portion of the input address, a miss in the predicted way is detected. If none of the partial tags match the corresponding partial tag portion of the input address, a miss in the cache is determined. On the other hand, if one or more of the partial tags match the corresponding partial tags portion of the input address, the cache searches the corresponding ways to determine whether or not the input address hits or misses in the cache (e.g. by performing full tag comparisons for the ways in which a partial tag match is detected). Because partial tags are read from the non-predicted ways, power may be conserved as compared to reading the full tags from each of the ways. Advantageously, both access latency and power consumption may be reduced. Furthermore, by providing partial tags, the other ways to be searched may be identified and the number of ways to be searched may be reduced (e.g. each way having a partial tag miss may not be searched).
Broadly speaking, a cache is contemplated. The cache comprises a tag array and a control circuit coupled to the tag array. The tag array is coupled to receive an index of a read address and a way selection, and comprises a plurality of ways. The tag array is configured to output a plurality of partial tags, each of which is from one of the plurality of ways. The control circuit is configured to generate a search way selection identifying a search way responsive to the read address missing in a first way of the plurality of ways of the tag array. The first way is identified by the way selection. A first partial tag from the search way matches a corresponding portion of the read address.
Additionally, a processor is contemplated. The processor comprises a line predictor configured to provide a way prediction responsive to a fetch address, and an instruction cache coupled to receive the way prediction and the fetch address. The instruction cache is set associative and includes a tag array configured to output a plurality of partial tags responsive to an index of the fetch address. The instruction cache is configured, responsive to the fetch address missing in a first way identified by the way prediction, to search a second way of the tag array for which a corresponding partial tag of the plurality of partial tags matches a corresponding portion of the fetch address. Still further, a computer system is contemplated including the processor and an input/output (I/O) device configured to communicate between the computer system and another computer system to which the I/O device is couplable.
Moreover, a method is contemplated. A plurality of partial tags are read from a cache responsive to a fetch address. Whether or not the fetch address hits in a predicted way of the cache is determined. A second way of the cache is searched for a hit in response to determining that the fetch address does not hit in the predicted way of the cache. The second way is different from the predicted way and a first partial tag of the plurality of partial tags corresponding to the second way matches a corresponding portion of the fetch address.