The ever-growing requirement for high performance computers demands that state-of-the-art microprocessors execute instructions in the minimum amount of time. Over the years, efforts to increase microprocessor speeds have followed different approaches. One approach is to increase the speed of the clock that drives the processor. As the clock rate increases, however, the processor's power consumption and temperature also increase. Increased power consumption increases electrical costs and depletes batteries in portable computers more rapidly, while high circuit temperatures may damage the processor. Furthermore, processor clock speed may not increase beyond a threshold physical speed at which signals may traverse the processor. Simply stated, there is a practical maximum to the clock speed that is acceptable to conventional processors.
An alternate approach to improving processor speeds is to reduce the number of clock cycles required to perform a given instruction. Under this approach, instructions will execute faster and overall processor "throughput" will thereby increase, even if the clock speed remains the same. One technique for increasing processor throughput is pipelining, which calls for the processor to be divided into separate processing stages (collectively termed a "pipeline"). Instructions are processed in an "assembly line" fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the processor as a whole to become faster.
"Superpipelining" extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, for example, a processor in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can be processed simultaneously in the pipeline, with the processing of one instruction completed during each clock cycle. Therefore, the instruction throughput of an N stage pipelined architecture is, in theory, N times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every N clock cycles.
Another technique for increasing overall processor speed is "superscalar" processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (i.e., the execution of an instruction does not depend upon the execution of any other instruction) processor throughput is increased in proportion to the number of instructions processed per clock cycle ("degree of scalability"). If, for example, a particular processor architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the processor is theoretically tripled.
One of the most frequently employed techniques for increasing overall processor throughput is to minimize the number of cache misses and to minimize the cache access time in a processor that implements a cache memory. There is a wealth of information describing cache memories and the general theory of operation of cache memories is widely understood. This is particularly true of cache memories implemented in x86 microprocessor architectures. A cache memory is a small but very fast memory that holds a limited number of instructions and data for use by the processor. The lower the cache access time, the faster the processor can run. Also, the lower the cache miss rate, the less often the processor is stalled while the requested data is retrieved from main memory and the higher the processor throughput is. Many techniques have been employed to reduce the access time of cache memories. However, the cache access time is still limited by the rate at which data can be examined in, and retrieved from, the SRAM circuits that are internal to a conventional cache memory. This is in part due to the rate at which the translation look-aside buffer (TLB) translates logical memory addresses into physical memory addresses. The cache access time is also limited by the speed at which a cache line can be selected after a hit has occurred in the cache. The cache access time is also limited by the speed at which a TLB entry can be selected after a hit has occurred in the TLB.
Therefore, there is a need in the art for improved cache memories that maximize processor throughput. More particularly, there is a need in the art for improved cache memories having a reduced access time. There is also a need for hit determination circuits that are able to more rapidly select and retrieve a data array from a TLB or cache.