The ever-growing requirement for high performance computers demands that state-of-the-art microprocessors execute instructions in the minimum amount of time. Over the years, efforts to increase microprocessor speeds have followed different approaches. One approach is to increase the speed of the clock that drives the processor. As the clock rate increases, however, the processor's power consumption and temperature also increase. Increased power consumption increases electrical costs and depletes batteries in portable computers more rapidly, while high circuit temperatures may damage the processor. Furthermore, processor clock speed may not increase beyond a threshold physical speed at which signals may traverse the processor. Simply stated, there is a practical maximum to the clock speed that is acceptable to conventional processors.
An alternate approach to improving processor speeds is to reduce the number of clock cycles required to perform a given instruction. Under this approach, instructions will execute faster and overall processor "throughput" will thereby increase, even if the clock speed remains the same. One technique for increasing processor throughput is pipelining, which calls for the processor to be divided into separate processing stages (collectively termed a "pipeline"). Instructions are processed in an "assembly line" fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the processor as a whole to become faster.
"Superpipelining" extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, for example, a processor in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can be processed simultaneously in the pipeline, with the processing of one instruction completed during each clock cycle. Therefore, the instruction throughput of an N stage pipelined architecture is, in theory, N times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every N clock cycles.
Another technique for increasing overall processor speed is "superscalar" processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (i.e., the execution of an instruction does not depend upon the execution of any other instruction), processor throughput is increased in proportion to the number of instructions processed per clock cycle ("degree of scalability"). If, for example, a particular processor architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the processor is theoretically tripled.
A cache memory is a small but very fast memory that holds a limited number of instructions and data for use by the processor. One of the most frequently employed techniques for increasing overall processor throughput is to minimize the number of cache misses and to minimize the cache access time in a processor that implements a cache memory. The lower the cache access time, the faster the processor can run. Also, the lower the cache miss rate, the less often the processor is stalled while the requested data is retrieved from main memory and the higher the processor throughput is. There is a wealth of information describing cache memories and the general theory of operation of cache memories is widely understood. This is particularly true of cache memories implemented in x86 microprocessor architectures.
Many techniques have been employed to reduce the access time of cache memories. However, the cache access time is still limited by the rate at which data can be examined in, and retrieved from, the RAM circuits that are internal to a conventional cache memory. This is in part due to the rate at which address translation devices, such as the translation look-aside buffer (TLB), translate linear (or logical) memory addresses into physical memory addresses. For example, the TLB is itself a RAM circuit that uses the linear memory address to select a memory location that contains the corresponding physical memory address that is needed by the cache memory. If the RAM array of the TLB has a comparatively long access time for retrieving data from the TLB memory locations, then the translation of the logical memory address into a physical address is comparatively slow. The slower this translation is, the slower the cache memory is in its overall operation.
Therefore, there is a need in the art for improved cache memories that maximize processor throughput. More particularly, there is a need in the art for improved cache memories having a reduced access time. There also is a need for improved cache memories that are not limited by the rate at which data can be retrieved and examined from the RAM circuits in the cache memories. There is a further need in the art for cache memories implementing improved address memory devices that more rapidly translate linear memory addresses into physical memory addresses. In particular, there is a need in the art for an improved TLB using a very fast RAM circuit having a low data access time.