1. Field of the Invention
This invention relates generally to digital computer systems and, more specifically, to a digital computer system having one or more buffer devices external to a central processing unit (CPU) for increasing the throughput of the digital computer system and methods therefor.
2. Description of the Related Art
Improving the performance of digital computer systems has always been a challenge for system architects. Specifically, much design work has focused on decreasing the time, or number of clock cycles, a given CPU must spend communicating with the relatively large main memory, or RAM, of a digital computer when executing read-from-memory and write-to-memory operations.
Most modern microprocessors, such as Intel's 486 CPU, include a small (8k bytes for the 486) internal first level cache memory (L1) to increase system performance. Cache memories are fast memory storage devices that utilize the principle of locality of reference to improve CPU read-from-memory efficiency and, therefore, overall system performance. Whenever the CPU accesses the main memory for code or data, additional bytes "surrounding" the byte(s) being fetched are brought into the cache in the form of a cache line. The principle of locality of reference predicts that the CPU will very probably use the additional bytes subsequent to the use of the code or data brought in, and quite possibly a multiple number of times. Multiple uses of the same code or data occurs during program loops, for example. These subsequent accesses will be "hits" in the relatively small and fast cache, and will therefore speed up execution because each "hit" reduces by one the number of CPU accesses to the relatively large and slow main memory. In the event of a "miss" in the cache, the CPU must access the main memory for its required code or data and the cache is loaded with a new cache line of memory that "surrounds" this required code or data for potential subsequent use by the CPU.
To further decrease memory latency and increase system performance, some higher performance PC systems, whether desk-top or notebook, include a large (e.g. 128K bytes) second level cache (L2) in their memory subsystems. An L2 cache performs the same basic task as an L1 cache but is much larger and external to the CPU. Therefore, an L2 cache can not only decrease memory latency, but also helps to reduce memory bus utilization, allowing Direct Memory Access (DMA) devices more access to the system memory, thereby further increasing system throughput.
L2 cache subsystems, however, can be quite expensive, large and power hungry. Typical L2 caches hold 128K bytes of memory or more and can cost hundreds of dollars. As today's personal computers continue to decrease in size with the proliferation of portable, battery-powered notebook, sub-notebook and hand-held computers, and as prices plummet, L2 cache subsystems have become cost prohibitive in terms of price, real estate and power consumption for many of these smaller systems.
Accordingly, a definite need has evolved to provide alternatives to L2 cache memory subsystems that would be small and low cost, that would consume little power, that would be external to the CPU and that would significantly increase CPU read/write from-memory efficiency and, consequently, the throughput of personal computer systems.
Another well known approach to improve CPU performance is the Posted Write Buffer (PWB). In a basic computer system without a PWB, when the CPU generates a data write-to-memory operation, it must wait for the write cycle to complete before starting a subsequent cycle on the bus. If the cycle following the write happens to be a data read cycle, the CPU will be forced to wait till the write cycle finishes, and the read cycle goes through returning data to the CPU. The CPU can get stalled for significant amounts of time if the read follows multiple write cycles. The PWB is designed to decrease the latency of write-to-memory cycles generated by the CPU by providing a buffer into which the CPU may quickly "dump" data for temporary storage in a first-in first-out (FIFO) configuration for subsequent submission to the main memory whenever the memory is available. The CPU, therefore, sees a very fast response to sequential write operations going to the main memory. If a subsequent read is now allowed to go to the memory while the writes, in the PWB wait, the stall time of the CPU is significantly reduced. This can be done by incorporating enough "intelligence" in the PWB to ensure that the CPU gets the latest data when the read completes (Read Around Write with Merge). The efficiency of the system can be further improved by combining (in the PWB) multiple writes going to the same Dword in main memory (combine and store). Multiple CPU writes can thereby result in a single write to main memory when the PWB writes out the "combined" data.
While the Intelligent Posted Write Buffer (IPWB) increases CPU efficiency through its fast response to CPU writes, and its ability to move memory writes out of the way of subsequent reads, there exists a need for speeding up response to CPU reads so that these can move out of the way of subsequent reads and writes so that the CPU is not stalled.