1. Field of the Invention
The technique described herein relates to a cache technique for the purpose of a high-speed execution of a program in the field of the high performance processor and high performance computing.
2. Description of the Related Art
Recently, improvements in operating frequency have made the delay time for a memory access relatively longer, which has come to affect the performance of an overall system. In order to cover up the memory access delay time, processors are often provided with a high-speed memory with a small capacity called a cache memory.
FIG. 1 shows an operation outline of a set associative cache memory which is currently most popular. A cache memory 1401 comprises a plurality of sets. Each of the set is controlled by dividing it into a plurality of cache ways 1402 (hereinafter, the cache way may simply be called a “way”), for example, cache ways 1402(#1) through 1402(#4). Thus, the example in FIG. 1 illustrates a 4-way set associative cache memory.
Each of the cache ways 1402 comprises a plurality of cache blocks 1403 (hereinafter, the cache block may simply be called a “block”), for example cache blocks 1403(#1) through 1403(#n), the value of n being, for example, 1024.
Each of the cache blocks 1403 comprises a validity flag that shows validity/invalidity, a tag and data. The data size is, for example, 1 bit for the validity flag, 15 bits for the tag, and 128 bytes for the data.
The size of the cache memory 1401 is, for example, 512 kilobytes, calculated, for example, as “the size of a cache block×the number of the cache blocks×the number of the cache ways=128 bytes×1024 blocks×4 ways”.
Meanwhile, an address 1405 comprises 32 bits specified by the program for the memory access. In the 32 bits of the address 1405, the top 15 bits are used as a tag, the next 10 bits are used as an index, and the last 7 bits are used as an offset within a cache block.
According to the above configuration, when a data read out for the address 1405 is specified, one of the block numbers #1 through #n is specified by the 10-bit index in the address 1405. Now, the number is assumed as #i.
As a result, the cache block 1403 (#i) corresponding to the specified block number #i is read out from each of the cache ways 1402 (#1) through (#4). The read-out cache blocks 1403 (#i) are then input to comparators 1404 (#1) through (#4), respectively.
The comparators 1404 (#1) through (#4) detect match/mismatch between the tag value in each of the read-out cache blocks 1403 (#i) and the tag value in the specified address 1405. The cache hit occurs in the cache block 1403 (#i) that is read out for the one of the comparator (#1) through (#4) in which the match is detected. As a result, the data in the cache block 1403 (#i) is readout. The above configuration thus enables a data read out at a higher speed than a read out from the main memory.
When no match is detected in all of the comparators 1404, or when the validity flag indicates invalidity even if a match is detected, the cache hit does not occur. In this case, data is read out from the address 1405 in the main memory.
Meanwhile, when a data write in for the address 1405 is specified, #i as one of the block numbers #1 through #n is specified by the 10-bit index in the address 1405, in the same manner as for the read out.
Next, a replacement way selection circuit 1501 as shown in FIG. 2 selects, from the four cache blocks 1403 (#i) corresponding to the block number #i specified respectively in the cache ways 1402 (#1) through (#4), a block which is not yet used (in which the tag is not specified), or a block with its validity flag indicating invalidity, or when all the blocks are currently used, a block in a way determined in accordance with a predetermined algorithm. Then the replacement way selection circuit 1501 outputs a 4-bit selection signal as shown in FIG. 2. In accordance with the selection signal output from the replacement way selection circuit 1501 as described above, the data is written into the cache block 1403 in the selected one of the four ways (#1) through (#4) having the specified block number #i.
When all of the blocks are currently used, the selection from the four cache ways 1402 (#1) through (#4) is made in accordance with, for example, the LRU (Least Recently Used) algorithm. According to the algorithm, the cache block data in the cache way that was least recently used is selected and replaced (removed).
As is apparent from the above operation description, when the object of the write in is large-size data, a plurality of pieces of data may have the same index value in the address 1405, causing cache conflicts between the pieces of data. However, in a set associative cache memory, even if the same cache block 1403 is specified from the cache blocks 1403 (#1) through (#n) by the index, the cache block selection can be made from a plurality of cache ways. Therefore, for example, the 4-way cache memory 1401 shown in FIG. 1 can handle maximum of four pieces of data having the same index.
In the common cache memory having the configuration as described above, a programmer cannot distinctly specify the data arrangement, such as to keep predetermined pieces of data in the cache memory so that a high-speed access can be made to the data. For this reason, there has been a problem that processing performance deteriorates due to unintended data removal (replacement).
Methods using a local memory, scratch pad or cache line (way) lock have been proposed to solve the above problem.
Japanese Patent Application Publication No. 10-187533 discloses a conventional art in which a cache memory that can be divided into (1) a normal cache memory area and (2) a scratch pad (or a local memory) area for use.
Japanese Patent Application Publication No. 4-175946 also discloses the use of a cache memory while dividing it into a normal cache memory area and a local memory area. According to the conventional art, an address space of the main memory is given respectively to the cache memory and to the local memory area, to maintain data consistency by distinguishing which area is accessed at the time of a memory access.
Other methods have been proposed for enabling object data to continuously exist in a cache memory by locking a certain cache memory or a cache way in the cache memory, instead of dividing the cache memory as described above.
Meanwhile, the weakest way method has been proposed, as a conventional art mainly for avoiding unintended data removal with an access to stream data. According to the method, when transferring a piece of data to a cache memory in accordance with a memory access instruction, the piece of data can be specified as the data to be removed first, among pieces of data having the same index, which makes it possible to remove data that is used only once, such as stream data, prior to the other data.
However, the conventional art described in Japanese Patent Application Publication No. 10-187533 requires special instructions for the scratch pad area (loading instruction to read in data from the main memory and write-back instruction to write the data back into the main memory). It also has problems such as a need for control to maintain data consistency between the cache area, scratch pad area and the main memory.
The conventional art described in Japanese Patent Application Publication No. 4-175946 a problem such as a need for an area judgment circuit. Furthermore, it requires a control by an operating system to manage the memory space.
In addition, both Japanese Patent Application Publication No. 10-187533 and Japanese Patent Application Publication No. 4-175946 have a problem that the operations such as the change of the area size during the program execution involve a large performance overhead.
Meanwhile, the conventional art of locking a cache line or cache way can easily cause problems such as that, when the programmer forgets to perform the unlocking operation or when the programmer locks all cache areas by mistake, the cache system does not operate properly, leading a system shutdown. It also has problem that a dedicated hardware mechanism needs to be provided, the additional hardware requiring a high cost.
Furthermore, the conventional art adopting the weakest way method has a problem that it cannot be implemented with the conventional art related to the local memory function.
Japanese Patent Application Publication No. 2003-296191 has also been disclosed as a conventional art.