A typical processing system with video/graphics display capability includes a central processing unit (CPU), a display controller coupled to the CPU by a CPU local bus (directly and/or through core logic), a system memory coupled to the CPU local bus through core logic, a frame buffer memory coupled to the display controller via a peripheral local bus (e.g., PCI bus), peripheral circuitry (e.g., clock drivers and signal converters, display driver circuitry), and a display unit. The CPU is the system master and generally provides overall system control in conjunction with the software operating system. Among other things, the CPU communicates with the system memory, holding instructions and data necessary for program execution, normally through core logic. Typically, the core logic is two to seven chips, with one or more chips being "address intensive" and one or more other chips being "data path intensive." The CPU also, in response to user commands and program instructions, controls the contents of the graphics images to be displayed on the display unit by the display controller. The system and display frame buffer memories are typically constructed from dynamic random access memory devices (DRAMs) since DRAMs are typically less expensive, consume substantially less power, and provide more bits per given chip space (i.e. have a higher density). DRAMs however are substantially slower than other types of memories, in particular static random access memories (SRAMs). As a result, the system memory and frame buffer bandwidths are normally limited.
To account for limited system and/or frame buffer memory bandwidth, one or more levels of data cache memory may be provided. The level 1 (L1) data cache is normally on the CPU memory chip itself. When used, the level 2 (L2) and level 3 (L1) are normally off-chip and coupled to the CPU by the CPU local bus. Cache memories are typically constructed from SRAMs which provide shorter access time and higher bandwidth, although they consume more power, are more expensive to fabricate, and provide fewer cells (bits) per given chip space. For example, a typical SRAM cache may have a cycle time of 3 to 10 nsecs for a random access while a random cycle time of a typical DRAM memory device may require 110 to 130 nsecs. In other words, the "latency" for a typical DRAM is approximately 10 times that of the typical SRAM.
During cache operations, blocks of data are read from the system memory and written into the cache in anticipation of the data needs of the CPU. This "encachement" is typically done by the operating system as a function such factors as the spatial and/or temporal locality of the data required by the CPU during a sequence of operations. If the CPU requires data for a given operation, and that data is already part of the encached block (i.e a "cache hit" occurs), it can be accessed much faster than from the slower system memory. By selecting latency and density ratios between the system memory and the cache memory to be on the order of 10 to 1, and depending on the partitioning of the system memory by the operating system, cache hits for reads to memory by the CPU can exceed 95%. When required data is not found encached, a cache "miss" occurs, and the CPU must directly access the system memory.
Even with cache hit rates of 95%, state of the art processors running at high clock rates are still confronted with a substantial number of cache misses. Thus, significant number of direct accesses to the lower bandwidth system memory cannot be avoided. The problem is further compounded in "clock doubling" and "clock tripling" CPUs. In sum, state of the art CPUs simply require more bandwidth than can be accommodated by presently available memory devices and architectures, including those implementing one or more cache memories.
In addition to memory bandwidth considerations, other access requirements must be considered for each memory subsystem. For example, accesses to the system memory are normally made in either bursts or long streams of data. Typically, the bus is only seized for a short period of time, and then run at peak speed. The display frame buffer memory on the other hand is accessed on an almost continuous basis, since 70% of the time the frame buffer is supporting display screen refresh. In a Unified Memory Architecture (UMA), the unified memory maintains both the system memory and the display frame buffer and therefore the frame buffer and system memory requirements must be balanced.
Further, the CPU and the peripheral controllers may demand that the memory subsystems support priority operations. During priority operations, the CPU or processor may request that a given memory operation be performed before another to meet some processing goal. For example, the CPU may request a read operation before a write operation, even though the write operation is currently earlier in the instruction queue, because the CPU requires data to complete the current processing operation. As another example, if a write operation and a read operation are time queued, and since a read operation typically takes longer, the write may be executed first.
Thus, the need has arisen for circuits, systems, and methods for constructing and operating memory devices and subsystems. Such circuits, systems and methods should be applicable to the design and construction of devices and subsystems for use in state of the art processing systems, but not necessarily limited thereto. Among the particular considerations to be addressed should be memory latency, priority and access type.