The ever-growing requirement for high performance computers demands that state-of-the-art microprocessors execute instructions in the minimum amount of time. Over the years, efforts to increase microprocessor speeds have followed different approaches. One approach is to increase the speed of the clock that drives the processor. As the clock rate increases, however, the processor's power consumption and temperature also increase. Increased power consumption increases electrical costs and depletes batteries in portable computers more rapidly, while high circuit temperatures may damage the processor. Furthermore, processor clock speed may not increase beyond a threshold physical speed at which signals may traverse the processor. Simply stated, there is a practical maximum to the clock speed that is acceptable to conventional processors.
An alternate approach to improving processor speeds is to reduce the number of clock cycles required to perform a given instruction. Under this approach, instructions will execute faster and overall processor "throughput" will thereby increase, even if the clock speed remains the same. One technique for increasing processor throughput is pipelining, which calls for the processor to be divided into separate processing stages (collectively termed a "pipeline"). Instructions are processed in an "assembly line" fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the processor as a whole to become faster.
"Superpipelining" extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, for example, a processor in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can be processed simultaneously in the pipeline, with the processing of one instruction completed during each clock cycle. Therefore, the instruction throughput of an N stage pipelined architecture is, in theory, N times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every N clock cycles.
Another technique for increasing overall processor speed is "superscalar" processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (i.e., the execution of an instruction does not depend upon the execution of any other instruction), processor throughput is increased in proportion to the number of instructions processed per clock cycle ("degree of scalability"). If, for example, a particular processor architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the processor is theoretically tripled.
A cache memory is a small but very fast memory that holds a limited number of instructions and data for use by the processor. One of the most frequently employed techniques for increasing overall processor throughput is to minimize the number of cache misses and to minimize the cache access time in a processor that implements a cache memory. The lower the cache access time, the faster the processor can run. Also, the lower the cache miss rate, the less often the processor is stalled while the requested data is retrieved from main memory and the higher the processor throughput is. There is a wealth of information describing cache memories and the general theory of operation of cache memories is widely understood. This is particularly true of cache memories implemented in x86 microprocessor architectures.
Many techniques have been employed to reduce the access time of cache memories. However, the cache access time is still limited by the rate at which data can be examined in, and retrieved from, the RAM circuits that are internal to a conventional cache memory. This is in part due to the rate at which address translation devices, such as the translation look-aside buffer (TLB), translate linear (or logical) memory addresses into physical memory addresses. If the TLB has a comparatively long access time for retrieving data, then the translation of the logical memory address into a physical address is comparatively slow. The slower this translation is, the slower the cache memory is in its overall operation.
A significant portion of the latency of a cache memory and its associated TLB is the complex switching and multiplexing networks interconnecting the main cache memory and the TLB and other parts of the processor. In conventional x86 processors, the cache memory and its TLB receive addresses from a number of address generating sources within the processor. Some of the addresses are generated when the processor is operating in real mode and do not require translation by the TLB. Other addresses are generated when the processor is operating in paging enabled mode and must be translated in the TLB. Thus, there are frequently multiple paths interconnecting the same addressing generating sources with the cache memory and/or the TLB in order to service both real mode and paging mode. This results in complex switching and multiplexing gate arrays that add additional delays to the time required to translate addresses and retrieve data from the cache memory.
Therefore, there is a need in the art for improved cache memories that maximize processor throughput. There is a further need in the art for improved cache memories having a reduced access time. In particular, there is a need for improved cache memories that minimize cache latencies related to switching circuitry used to service both real mode and paging mode.