1. Field of the Invention
The present invention relates to a data supply technique useful for efficiently supplying data in a computer system.
2. Description of the Related Art
In recent years, improved processing performance and cost reduction have been required for various apparatuses. Generally, a computer system includes a processor which executes an application, a data processing circuit, and a storage device, such as a memory, used for storing a program or data to be processed. It is ideal that a storage device in a computer system is capable of reading/writing all the programs and data used by the processor and the data processing circuit at a high speed. For example, if a memory unit with comparatively short access latency, such as a static dynamic random access memory (SRAM), is provided as a dedicated local memory for each processor and data processing circuit, the processing performance can be easily improved.
On the other hand, in realizing a cost reduction of apparatuses, it is desirable that a single storage device can be shared by many processors and data processing circuits so that the number of storage devices can be reduced. Further, when a memory is used as a storage device, in most cases, an inexpensive dynamic random access memory (DRAM) widely used at that time is used as the memory.
However, if an inexpensive DRAM is used, access latency will be increased compared to the SRAM described above. Further, if a single storage device is shared among many processors and data processing circuits, competing of reading/writing of the storage device among the processors and data processing circuits occurs. In such a case, arbitration of each access is performed and, as a result, the access latency of the processors or data processing circuits will be increased. Thus, processing performance of each processor or data processing circuit is reduced.
In order to prevent a performance reduction of the above-described processor or data processing circuit, a cache device is generally provided between the processor or the data processing circuit and the storage device. So long as the desired data can be read out from an implemented cache device, each processor or data processing circuit does not access the storage device (submit a data request). In this manner, access to the storage device from each processor or data processing circuit is reduced and the total access bandwidth can be reduced.
Although the circuit size naturally increases according to the use of a data supply mechanism such as the cache device, the circuit size is still small compared to when a dedicated local memory is used as described above. By using an optimum cache device in the computer system, a low cost apparatus with a high processing performance can be realized.
If the desired data exists in the cache device (a cache hit), the processor or the data processing circuit does not need to access the storage device for data, and thus the access latency is reduced. On the other hand, if the desired data does not exist in the cache device (a cache miss), naturally, the processor or the data processing circuit accesses the storage device (submits a data request) for the desired data. In this case, the access latency is similar to a case where a cache device is not provided.
Generally, the processor or the data processing circuit processes data in order. Thus, when a cache miss occurs, the processor or the data processing circuit temporarily stops operating for a while until the desired data is read out from the storage device. Naturally, the processing performance of the processor or the data processing circuit is reduced by such stopping of operation. This is called a blocking operation. Further, the process of reading out data from the storage device when a cache miss occurs is called “refill” and the data which is read out is called “refill data”. Further, a unit of data read at a time is called a “refill length” and the length of the reading time is called “refill latency”.
In order to enhance the processing performance, Japanese Patent No. 3846638 discusses a data supply device with a cache mechanism which can hide the above-described refill latency. First, a pipeline processor discussed in Japanese Patent No. 3846638 determines whether a cache miss is included in a preceding stage (pre-processing) of a predetermined pipeline stage with respect to data necessary in the processing in the pipeline stage. If a cache miss is determined, the data necessary in the preceding stage (pre-processing) is requested and the refill is executed.
At that time, the pipeline processor discussed in Japanese Patent No. 3846638 includes an intermediate queue (FIFO) that is longer than the refill latency. The pipeline processor discussed in Japanese Patent No. 3846638 sequentially stores the subsequent processing, including the processing being “refilled”, in the intermediate queue (FIFO). In other words, the pipeline processor discussed in Japanese Patent No. 3846638 can continue the cache miss/hit determination of the next processing while storing the processing in the intermediate queue (FIFO). Thus, unlike the above-described blocking operation, the processing of the processor is not temporarily stopped each time a cache miss occurs.
On the other hand, the pipeline processor discussed in Japanese Patent No. 3846638, after reading out data from the storage device, which is necessary each time a cache miss occurs, needs to temporarily store the refill data in a fill FIFO before updating the cache memory. Since data of a cache hit, which precedes the data processing of the cache miss, exists in the intermediate queue (FIFO), if the data processing of the cache hit is not finished in a predetermined pipeline stage, the cache memory cannot be updated. Thus, the pipeline processor discussed in Japanese Patent No. 3846638 necessarily includes the above-described fill FIFO. An operation used for making a cache miss/hit determination of the next data processing by using an intermediate queue (FIFO) is called a non-blocking operation.
A data processing command is delayed in the intermediate queue (FIFO). If the refill is completed during the delay and the refill data for the cache miss is stored in the fill FIFO, the refill data can be supplied from the fill FIFO and the data processing can be executed. In other words, the data supply device having the cache mechanism discussed in Japanese Patent No. 3846638 can continue data processing while hiding the refill latency during the cache miss without temporarily stopping the processing.
However, according to the technique discussed in Japanese Patent No. 3846638, a fill FIFO for temporarily storing the refill data is required in addition to a cache memory.
A low-cost DRAM is used as the storage device of cache data. Generally, from the viewpoint of memory band efficiency, it is better if a data request is submitted in such a manner that reading/writing of the DRAM is collectively performed for certain consecutive storage regions. This data request is called burst access. Thus, it is desirable that the DRAM is accessed and read/written in this unit of burst access.
Due to advancements in fine semiconductor processing and product needs in manufacturing DRAMs, the internal operating frequency of DRAMs is increasing year by year together with the manufacture generation. Naturally, the unit of reading/writing by burst access is increasing year by year. Due to a growing demand for high performance devices, it is assumed that the reading/writing unit of DRAMs will continue to increase.
Regarding a cache device, cache data (cache line) corresponding to one cache tag (cache address) is often adjusted to an integral multiple of this reading/writing unit of burst access. The reading/writing unit of refill data (refill length) that corresponds to one cache miss will be the same as the cache line. For example, the reading/writing unit of refill data in relation to the above-described DRAMs is 32 to 128 bytes.
The above-described fill FIFO needs to have a capacity that can store an amount of refill data that corresponds to a number of commands of the cache miss in the intermediate queue (FIFO). The refill latency of a device that implements a cache device is tens to hundreds of cycles and the number of stages of the intermediate queue (FIFO) corresponds to such a number of cycles.
For example, if the cache hit ratio is 75%, 25% of the intermediate queue (FIFO) will be a cache miss. If the intermediate queue (FIFO) includes 128 stages, the fill FIFO will be 25% of 128 stages. Accordingly, 32 stages will be necessary for the fill FIFO. Considering the reading unit of refill data described above, the capacity of the fill FIFO is 1K to 4K bytes. This is not small enough to be ignored in a device that implements a cache device.
The cache device discussed in Japanese Patent No. 3846638 includes the following storage regions:
(1) a storage region of a cache tag used for determining a cache hit/miss by prefetch logic;
(2) a storage region of the intermediate queue (FIFO);
(3) a storage region of a fetch logic fill FIFO; and
(4) a storage region of a cache memory for storing fetch logic cache data.
As described above, the storage regions that impact the circuit size are (3) “fill FIFO” with a long refill length and (4) “cache memory”. If (3) “fill FIFO” and (4) “cache memory” exist as different hardware devices as is discussed in Japanese Patent No. 3846638, the circuit size will be increased. Although the number of FIFO stages in (2) “intermediate queue (FIFO)” is large, since the intermediate queue is used for transferring a flag indicating a result of a cache hit/miss and an address where the data is stored in the cache memory, the data length of the FIFO itself is very short compared to the refill length described below with respect to the present subject matter.