Traditionally, the development of processor and memory devices has proceeded independently. Advances in process technology, circuit design, and integrated chip (IC) architecture have led to a near exponential increase in processor speed and memory capacity. However, memory device latencies have not improved as dramatically and access times are increasingly becoming the limiter of processor performance. This is a problem known as the Memory Wall and is more fully described in Hitting the Memory Wall: Implication of the Obvious, by William A. Wulf and Sally A. McKee, ACM Computer Architecture News, Vol. 23, No. 1, Mar. 1995, which is hereby explicitly incorporated by reference.
Current high performance processors, which use complex superscalar central processing units (CPUs) that interface to external off-chip main memory through a hierarchy of caches, are particularly affected by the Memory Wall problem. In fact, this CPU-centric design approach requires a large amount of power and chip area to bridge the gap between CPU and memory speeds.
The Memory Wall problem is commonly addressed by adding several levels of cache to the memory system so that small, high speed, static random access memory (SRAM) devices feed the CPU at low latencies. Combined with latency hiding techniques, such as prefetching and proper code scheduling, it is possible to run a high performance processor at reasonable efficiencies for applications with enough locality for the caches. However, while achieving impressive performance on applications that fit nicely into their caches, these processors have become increasingly application sensitive. For example, large applications such as CAD programs, data base applications, or scientific applications often fail to meet CPU based speed expectations by a wide margin.
Moreover, the CPU-centric design approach has lead to very complex superscalar processors with deep pipelines. Much of this complexity, such as out-of-order execution and register scoreboarding, is devoted to hiding memory system latency. In addition, these processors demand a large amount of support logic in terms of caches, controllers and data paths to talk to the external main memory. This adds considerable cost, power dissipation, and design complexity.
To fully utilize a superscalar processor, a large memory system is required. The effect of this is to create a bottleneck that increases the distance between the CPU and main memory. Specifically, it adds interfaces and chip boundaries which reduce the available memory bandwidth due to packaging and connection constraints.
However, integrating the processor with the memory device avoids most of the problems of the CPU-centric design approach. And, doing so offers a number of advantages that effectively compensate for the technological limitations of a single chip design.
Specifically, in CPU-centric processor designs, the instruction and data cache lines have a width that is significantly less than the width of the main memory. This is primarily due to the fact that the time to fill these cache lines from the off-chip main memory would introduce severe second order contention effects at the memory interface of the processor. As a result, such less than full width caches are unable to take advantage of the often high spatial locality of instruction and data streams.
In the corresponding U.S. patent application, Ser. No. 08/675,254, entitled "INTEGRATED PROCESSOR/MEMORY DEVICE WITH FULL WIDTH CACHE", by Ashley Saulsbury, Andreas Nowatzyk, Fong Pong, filed Jul. 1, 1996(pending) and hereby explicitly incorporated by reference, there is described and claimed an integrated P/M device with small on-chip data and instruction caches that are full width. With such an instruction cache, the generally good locality of instruction streams means that prefetching long instruction cache lines works well. On the other hand, such a full width data cache needs to cope with much more complex access patterns in order to be effective. Thus, while for some kinds of applications the long data cache lines may result in low data cache miss rates, in other applications the long data cache lines may cause high data cache miss rates.