1. Field of the Invention
The present invention relates to a cache provided between a processor and a main memory which stores therein programs or various kinds of data or information, and more particularly to a multi-port integrated cache in which a plurality of types of caches including an instruction cache are integrated.
2. Description of the Related Art
Generally, as shown in FIG. 1A, in order to improve a processing speed of a processor 1, a cache 2 is interposed between the processor 1 and a main memory 3. From various kinds of information (data) stored in the main memory 3, information to which the processor 1 frequently has access is copied to cache 2. Further, when the processor 1 accesses cache 2 instead of the main memory 3, high-speed processing by the processor is enabled.
Therefore, when the processor 1 accesses the cache 2 but target information (data) is not stored in the cache 2, a cache miss is generated. When the cache miss occurs, the processor 1 reads the target information (data) from the main memory 3 and writes it in the cache 2. A minimum unit of information (data) transmitted/received between this main memory 3 and the cache 2 is referred to as a unit block.
In recent years, in order to improve a processing speed of the processor 1, a parallel processor which executes a plurality of types of processing in one clock cycle as typified by a super scalar processor has come into practical use. In the processor 1 performing this parallel processing, it is required to simultaneously access an instruction (machine instruction) and data (arithmetic operation data) from the cache 2, for example. In order to simultaneously have access to a plurality of sets of information (data) from one memory, one memory must have a plurality of ports (write/read terminals).
However, since there is no technique to create a multi-port memory with a large capacity which can be used for the cache, independent one-port caches are provided by utilizing a fact that access patterns relative to the machine instruction and the arithmetic operation data are different from each other. For example, FIG. 1B shows an example that the cache 2 depicted in FIG. 1A is divided into an instruction cache 4 which stores therein only instructions (machine instructions) and a data cache 5 which stores therein only data (arithmetic operation data).
It is to be noted that a difference between an access pattern of the instructions and an access pattern of the data is as follows. One instruction includes a plurality of steps which cannot be divided, and contiguous addresses are accessed. Therefore, in the access pattern of instructions, a required data width (number of bits of data read at a time) is large. On the contrary, it is often the case that the data is relatively randomly accessed, and hence a required data width (bit number of data read at a time) is small in the access pattern of the data.
However, optimal storage capacities for the respective caches 4 and 5 are different in accordance with each program stored in the main memory 3. Therefore, comparing with one cache 2 in which the capacities of the respective caches 4 and 5 are added, a fragmentation is generated and a utilization efficiency of the storage capacity is reduced. Furthermore, a cache miss ratio is increased when a program with a large working set is executed.
Moreover, generally, when the number of accessible ports is increased in the memory, an increase in a necessary area of the memory in proportion to a square of the number of ports is described in the following cited reference: H. J. Mattausch, K. Kishi and T. Gyohten, “Area efficient multi-port SRAMs for on-chip storage high random-access bandwidth “IEICE Trans on Electronics vol. E84-C, No. 3, 2001, p410-417
Therefore, as an area cost and a wiring delay are increased, it is hard to configure a large-capacity cache.
Additionally, as shown in FIG. 1C, a speed of the processor 1 can be increased by providing a trace cache 6 besides the instruction cache 4. A string of instructions which have been once executed by the processor 1 is stored in the trace cache 6. Further, when the processor 1 newly executes an instruction, it retrieves data in the instruction cache 4 and the trace cache 6 at the same time by using an address (fetch address). The processor 1 adopts data of an instruction string of the trace cache 6 when there are hits in the both caches, and adopts data of the command in the instruction cache 4 when there is no hit in the trace cache 6.
The detailed operations of the instruction cache 4 and the trace cache 6 will now be described with reference to FIG. 2.
Basic blocks A to E corresponding to respective instructions are stored in a program 7 saved in the main memory 3. Incidentally, as to the execution order, the basic block A is determined as a top, B is skipped, and the processing diverges to the basic blocks C and D.
In such a state, the basic blocks A to E in the program 7 are sequentially stored every one line from the top in the instruction cache 4. On the other hand, the basic blocks A, C and D which have been actually executed are sequentially stored in the trace cache 6.
Next, there is considered a case that execution is again performed from the basic block A like a previous execution history of A, C and D. In this case, the respective basic blocks of the instructions are stored in the instruction cache 4 in the order like the storage order in the memory. Therefore, the processor 1 first fetches one line including the basic blocks A, B and C, then deletes B, and fetches one line including the basic blocks C, D and E. Therefore, the processor 1 requires two cycles in order to fetch the target basic blocks A, C and D.
On the other hand, since an instruction string (basic blocks A, C and D) which has been once executed is stored in the trace cache 6, it is possible to cope with the segmentation of the command string (basic block string), and a fetch efficiency of the processor 1 can be improved.
As described above, when only the instruction cache 4 is used, the fetch efficiency is lowered due to the segmentation of the instruction string owing to a branch instruction, three or four of which branch instructions are assumed to exist in 16 instructions. Therefore, the trace cache 6 is provided. Furthermore, as described above, the processor 1 confirms hit statuses of the two caches 4 and 6, fetches a target instruction string (basic block string) from the trace cache 6 when there is a hit in the trace cache 6, and fetches the target instruction string (basic block string) from the instruction cache 4 when there is a cache miss in the trace cache 6.
However, the following problems still occur even if the trace cache 6 is provided besides the instruction cache 4 as mentioned above.
Although cache capacities required in a time series vary in the trace cache 6 and the instruction cache 4, the respective capacities of the trace cache 6 and the instruction cache 4 are fixed, and hence a capacity ratio cannot be dynamically changed. Therefore, a utilization efficiency of the entire caches is lowered.
Since duplicative instructions (basic blocks) exist in the instruction cache 4 and the trace cache 6, the utilization efficiency of the entire caches is reduced.
When it is predicted that the basic block A branches to the basic block B by a branch prediction, the instruction of only the basic block A is issued (fetched) from the trace cache 6.
Since the basic blocks are stored in the trace cache 6 with one basic block determined as a top, data strings with C and D being determined as tops may be additionally possibly stored in the trace cache 6 when the executed instruction strings A, C and D exist as shown in FIG. 2. Therefore, an overlap of data (basic blocks) occurs in the trace cache 6, and an effective utilization ratio of the caches is lowered.
In a cache system in which the instruction cache 4 and the data cache 5 shown in FIG. 1B are individually provided or in a cache system in which the trace cache 6 is provided besides the instruction cache 4 shown in FIG. 1C as described above, each cache has a small capacity, mutual transfer of excessive capacities is impossible between the respective caches, and a cache miss generation ratio is increased as a whole. Moreover, overlapping storage of data (basic blocks) occurs between the instruction cache and the trace cache, thereby reducing the effective utilization ratio of the caches.