1. Field of the Invention
The present invention generally relates to controlling moving data entries in a hierarchical buffer system.
2. Description of Background
Currently, modern computer systems typically contain several integrated circuits (ICs), including a processor which may be used to process information in the computer system. The data processed by a processor may include computer instructions which are executed by the processor as well as data which is manipulated by the processor using the computer instructions. The computer instructions and data are typically stored in a main memory in the computer system.
Processors typically process instructions by executing the instruction in a series of small steps. In some cases, to increase the number of instructions being processed by the processor (and therefore increase the speed of the processor), the processor may be pipelined. Pipelining refers to providing separate stages in a processor where each stage performs one or more of the small steps necessary to execute an instruction. In some cases, the pipeline (in addition to other circuitry) may be placed in a portion of the processor referred to as the processor core. Some processors may have multiple processor cores, and in some cases, each processor core may have multiple pipelines. Where a processor core has multiple pipelines, groups of instructions (referred to as issue groups) may be issued to the multiple pipelines in parallel and executed by each of the pipelines in parallel.
As the number of processing cores increases, it places more demand on the memory subsystem to deliver the required bandwidth of data. Since there is a practical limit to the number of channels a processor can directly attach to memory devices, a common architectural solution involves one or more memory buffer chips present on the channel. A primary role of the buffer chip is to forward a stream of read operations to a plurality of ranks and banks, attached to one or more memory ports, and buffer the returning read data for transmission back to the processor cores. Often the DRAM frequency differs from the memory channel frequency, and this necessitates buffering and speed matching of the returning data.
As the number of read buffers increase, along with the operating frequency of the buffer chip itself, a new problem emerges. The multiplicative product of the number of data bursts with the number of outstanding read requests, results in the number of data sources which needs to be multiplexed onto the memory channel. For example, a buffer chip with 4 read buffers, each capable of holding a burst length 8 (i.e. BL8) DRAM read, results in 32 bursts of data which must be delivered to the memory channel. With channel frequencies surpassing 2 GHz, the buffer data flow now exceeds 1.5 GHz cycle times.
Furthermore, high performance processors are capable of generating continuous read streams which require the buffer chip to support some number of outstanding reads greater than the actual number of physical read buffers. For instance, if the buffer chip has 4 read buffers, the sophisticated scheduling schemes employed by the memory controller will typically launch a 5th read before the 1st read departs the buffer chip. This exploits the known fixed latencies in the memory channel, buffer chip and DRAM devices to pipeline additional read operations and stress the buffers.
The simplest (i.e. brute force) solution is to overdesign the buffer chip data flow and instantiate additional buffers. By using simple round robin schemes, the buffer management logic is easy to implement, but at a physical design cost of additional real estate. This creates a huge problem if the data sources are scattered around the chip. This would necessitate adding pipelining stages just to transport the data either to or from the buffer pool. This approach would also aggravates the problem of having to select from all of the data sources by introducing even more sources into the data flow muxing.
The more common approach is to only employ the required number of buffers (4 in this example), but to use a more sophisticated buffer controller which supports pipelining. As data is being read out of the first buffer, the returning DRAM data from the 5th read simultaneously begins loading into the first buffer. Then the returning data from a 6th read can pipeline into the second buffer and so on. This solution permits the memory controller to send a continuous read stream, and depending on the ratio of the DRAM frequency to the channel frequency, a sustained bandwidth of twice the number of actual read buffers can typically be achieved. However, the problem of outgating 32 sources still remains. With data flows running at 1-2 GHz, this often requires additional pipelining stages between the buffer pool and the memory channel. Unfortunately, this method impacts the latency of the start of data delivery.