Traditionally, the development of processor and memory devices has proceeded independently. Advances in process technology, circuit design, and integrated chip (IC) architecture have led to a near exponential increase in processor speed and memory capacity. However, memory device latencies have not improved as dramatically and access times are increasingly becoming the limiter of processor performance. This is a problem known as the Memory Wall and is more fully described in Hitting the Memory Wall: Implication of the Obvious, by William A. Wulf and Sally A. McKee, ACM Computer Architecture News, Vol. 23, No. 1, March 1995, which is hereby explicitly incorporated by reference.
Current high performance processors, which use complex superscalar central processing units (CPUs) that interface to external off-chip main memory through a hierarchy of caches, are particularly affected by the Memory Wall problem. In fact, this CPU-centric design approach requires a large amount of power and chip area to bridge the gap between CPU and memory speeds.
The Memory Wall problem is commonly addressed by adding several levels of cache to the memory system so that small, high speed, static random access memory (SRAM) devices feed the CPU at low latencies. Combined with latency hiding techniques, such as prefetching and proper code scheduling, it is possible to run a high performance processor at reasonable efficiencies for applications with enough locality for the caches. However, while achieving impressive performance on applications that fit nicely into their caches, these processors have become increasingly application sensitive. For example, large applications such as CAD programs, data base applications, or scientific applications often fail to meet CPU based speed expectations by a wide margin.
Moreover, the CPU-centric design approach has lead to very complex superscalar processors with deep pipelines. Much of this complexity, such as out-of-order execution and register scoreboarding, is devoted to hiding memory system latency. In addition, these processors demand a large amount of support logic in terms of caches, controllers and data paths to talk to the external main memory. This adds considerable cost, power dissipation, and design complexity.
To fully utilize a superscalar processor, a large memory system is required. The effect of this is to create a bottleneck that increases the distance between the CPU and main memory. Specifically, it adds interfaces and chip boundaries which reduce the available memory bandwidth due to packaging and connection constraints.
However, integrating the processor with the memory device avoids most of the problems of the CPU-centric design approach. And, doing so offers a number of advantages that effectively compensate for the technological limitations of a single chip design.
Specifically, in CPU-centric processor designs, the instruction and data cache lines have a width that is significantly less than the width of the main memory. This is primarily due to the fact that the time to fill these cache lines from the off-chip main memory would introduce severe second order contention effects at the memory interface of the processor. As a result, such less than full width caches are unable to take advantage of the often high spatial locality of instruction and data streams.
Thus, there is a need for full width instruction and data caches that take advantage of the high spatial locality of instruction and data streams in many applications. Moreover, the corresponding U.S. Pat. No. 5,900,011, issued May 4, 1999, and hereby explicitly incorporated by reference, describes and claims the use of a victim data cache to further improve the miss rate of such a full width data cache.