The present invention relates to an information processing unit having a hierarchical cache structure, and relates more particularly to an information processing unit equipped with a software prefetch instruction which can reduce an overhead due to a cache miss by transferring in advance, from a main memory to a cache, an operation data to be used for an operation.
In general, an information processing unit having a cache reads an operation from a main memory and uses it when the operation referred to by an instruction does not exist in the cache or when a cache miss has occurred. Usually, a reading of this operation takes numerous times (several times or dozens of times) that of the time required for a cache access. Accordingly, an information processing unit of this type has a problem in that when a cache miss has occurred, an execution of a succeeding instruction is delayed until an operation has been read from the main memory. Thus, the execution time of the information processing unit is extended, thereby restricting the performance of the information processing unit.
For solving the above-described problem, there is a technique which is known for transferring in advance from main memory to a cache an operation data to be used in the future and for reducing the penalty of a cache miss by achieving a cache hit at the time when the operation data is used. Research on a software prefetch instruction for achieving this has been carried out and a result of this study is being used in various information processing units.
As a conventional technique relating to the software prefetch instruction, a technique described in the following paper, for example, is known: Callahan, D, Kennedy, K. Porterfield, A., xe2x80x9cSoftware Prefetchingxe2x80x9d, Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991, pp. 40-52.
The operation of the software prefetching instruction by an information processing unit according to the conventional technique will be explained below with reference to drawings.
FIG. 5 is a block diagram for showing an example of the configuration of the information processing unit according to the conventional technique, and FIGS. 6A and 6B are time charts for explaining a processing operation according to a presence or absence of a software prefetch. In FIG. 5, 21 denotes a CPU (central processing unit), 22 a primary cache, 24 a SCU (storage control unit) and 25 a main memory.
The conventional technique shown in FIG. 5 is an example of the information processing unit having a cache of one hierarchical level, and this information processing unit is structured by a CPU 21 for carrying out an information processing, a primary cache 22, a main memory 25 and an SCU 24 for controlling a writing and reading of information to and from the main memory 25. In this information processing unit, the CPU 21 searches an operation data in the primary cache 22 to refer to the operation data. When a cache miss has occurred, the CPU 21 issues a request for transferring the operation data, to the SCU 24 through a request line 201 and an address line 202. The SCU 24 reads the operation data from the main memory and transfers this data to the CPU 21 through a data line 204. The CPU 21 stores the received operation data in the primary cache 22 and also uses this data for an operation.
The operation of the information processing unit shown in FIG. 5 will be explained next by assuming that the processing of an instruction by the CPU 21 is carried out at five pipeline stages of IF (instruction fetch), D (decoding), E (execution), A (operation access) and W (register writing).
FIG. 6A shows a time chart for the case where instructions 0 to 3 are sequentially inputted to the pipelines and a software prefetching is not carried out, in the information processing unit having the above-described structure.
An example shown in FIG. 6A shows a case where a cache miss has occurred in an operation access of the instruction 1 which is a load instruction. In this case, the processing of the instruction 1 is kept waiting until the operation has been read from the main memory at the stage A for carrying out an operation access, and the processing of the instruction 1 at the stage W is kept waiting until that time. Accordingly, the processings of the instructions 2 and 3 at the stages A and E respectively and the subsequent stages are kept waiting, with a result that all the time required for reading the operation from the main memory appears as a penalty due to the cache miss.
The time chart shown in FIG. 6B shows a case where a software prefetch is carried out. In this case, prior to the execution of the instruction 1 which is a load instruction, a software prefetch instruction designated by an instruction 1 is executed in advance by the time period required for transferring the operation from the main memory. As a result, the instructions 2 and 3 which follow the instruction 1 are executed without being interrupted by proceeding with the processing at the pipeline stages, during the period while the operation is being transferred from the main memory by the instruction 1 according to the software prefetch. By the time when the instruction 1 makes an access to the operation data, the operation data required by this instruction 1 has been stored in the primary cache 22 by the instruction 1 which is the software prefetch instruction, and a cache hit is achieved. Thus, it is possible to reduce a penalty attributable to the cache miss.
FIG. 7 is a block diagram for showing another example of the configuration of the information processing unit according to the conventional technique, and FIGS. 8A and 8B are time charts for explaining a processing operation according to a case where a software prefetch is carried out. In FIG. 7, 23 denotes a secondary cache and all other symbols denote the same items as those in FIG. 5.
The conventional technique shown in FIG. 7 shows an example of the configuration of the information processing unit having caches of two hierarchical levels.
This information processing unit has the same structure as that of the conventional technique shown in FIG. 5, except the primary cache 22 is incorporated within the CPU 21 and a secondary cache 23 is provided.
According to the conventional technique shown in FIG. 7, the CPU 21 searches the primary cache 22 at first to refer to an operation data. When the primary cache 22 has a cache miss, the CPU 21 searches the secondary cache 23. If the secondary cache 23 is hit, the operation data is transferred from the secondary cache 23 to the primary cache 22. In the subsequent explanation, this transfer will be called a block transfer and the operation data to be transferred will be called a block.
When a miss occurs in the secondary cache 23, the SCU 24 reads the operation data from the main memory 25 and transfers this data to the CPU 21. In the subsequent explanation, this transfer will be called a line transfer and the data to be transferred will be called a line. Usually, the data quantity of the block is smaller than the data quantity of the line, and at the time of the line transfer, only data of the line is stored in the secondary cache 23 and only data of the block referred to is stored in the primary cache 22.
The operation of the information processing unit shown in FIG. 7 will be explained next by assuming that the processing of an instruction by the CPU 21 is carried out at five pipeline stages of IF (instruction fetch), D (decoding), E (execution), A (operation access) and W (register writing), in a manner similar to that explained above.
FIGS. 8A and 8B show a case where instructions 0 to 3 are sequentially inputted to the pipelines and are processed, in the information processing unit having the above-described structure. FIG. 8A is a time chart of the case for executing a software prefetch instruction for preventing a primary cache miss, and FIG. 8B is a time chart of the case for executing a software prefetch instruction for preventing both the primary cache miss and the secondary cache miss.
Usually, there is a considerably large difference between the number of penalty cycles according to the primary cache miss and the number of penalty cycles according to the secondary cache miss. For example, the primary cache miss has four to five cycles and the secondary cache miss has 30 to 40 cycles. Accordingly, in the information processing unit having the above-described cache structure of two hierarchical levels, a software prefetch instruction for avoiding a primary cache miss in handling an operation data may be executed at least five cycles prior to the execution of the instruction for actually referring to the operation data. However, a software prefetch instruction for avoiding a secondary cache miss in handling an operation data needs to be executed at least forty cycles prior to the execution of the instruction for actually referring to the operation data. Therefore, according to the above-described conventional technique, it is necessary to provide a time interval between the execution of the software prefetch instruction and the execution of the instruction for actually referring to the operation data depending on whether a primary cache miss is to be avoided or a secondary cache miss is to be avoided.
According to the above-described information processing unit of the conventional technique having a cache structure of at least two hierarchical levels, it is necessary to change the timing for executing a software prefetch instruction depending on which one of the caches of the hierarchical levels contains the operation data resulting in a hit in the corresponding cache. Accordingly, the above-described conventional technique has a problem in that a control at the time of a compiler generating a software prefetch instruction becomes complex and it is difficult to sufficiently make use of the effect of the software prefetch instruction.
FIG. 9 is a diagram for showing an address relationship of an array of data which occupies a continuous area of memory which becomes an object of a software prefetch operation, and FIG. 10 is a time chart for explaining an operation of transferring the operation data of the array as shown in FIG. 9 by using a software prefetch instruction.
The above-described problems of the conventional technique will be explained below by assuming that a size of a block is 32 Bytes (B), a size of a line is 128 B, a transfer of the block requires four cycles and a transfer of the line requires forty cycles in the unit shown in FIG. 7.
As an example of using a software prefetch instruction, consider a case of sequentially referring to an array A (i) (i=0, 1, 2, . . . ) which occupies a continuous area of memory. When the size of individual data of the array A (i) is 8 B, the relationship of address between the line and the block is as shown in FIG. 9.
Assume that a cache miss occurred in both the primary cache and the secondary cache for all the data of the array A (i) in the initial state. In this case, at first, a software prefetch instruction is executed forty cycles prior to the execution of the instruction for referring to the data of A (0), in order to achieve a line transfer of the data of A (0) from the main memory. By this execution, data of 32 B corresponding to one block, that is the data from A (0) to A (3), is hit in the primary cache. Similarly, data of 128 B corresponding to the line, that is the data from A (0) to A (15), is hit in the secondary cache.
Accordingly, a software prefetch instruction is not necessary to refer to the data from A (1) to A (3) in the processing of the subsequent instructions since these data already exist in the primary cache 22. However, when the data of A (4) becomes necessary, a software prefetch instruction is executed four cycles prior to the execution of the instruction for referring to the data of A (4) in order to carry out a block transfer from the secondary cache, since the data of A (4) exists in only the secondary cache. By this execution, data of 32 B corresponding to one block, that is the data from A (4) to A (7), is hit in the primary cache. Reference to the data from A (8) afterward becomes possible by repeating an issue of a software prefetch instruction similar to the one described above. FIG. 10 shows a time chart of the above-described operation.
As is clear from FIG. 10 and as already explained in FIGS. 8A and 8B, a time interval for executing the software prefetch instructions is not constant but is irregular. This is because the time interval between the execution of the software prefetch instruction for carrying out a line transfer from the main memory and the execution of the software prefetch instruction for carrying out a block transfer from the secondary cache is not consistent.
On the other hand, accessing and operating the data having the above-described array are repeated extremely regularly, and therefore they are usually executed by an instruction sequence structured by a loop according to a branch instruction. When a time interval between the execution of the software prefetch instructions is irregular, it becomes difficult to build these software prefetch instructions into the loop. Further, an instruction for determining whether a line transfer is to be executed or a block transfer is to be executed becomes necessary. Therefore, this instruction incurs an increase in the number of instructions and interrupts an improvement in the performance of the processing unit according to the software prefetch instructions. Further, when all the data of the line are handled by one loop, the number of instructions increases similarly negatively effecting the performance of the processing unit.
It is an object of the present invention to provide an information processing unit having a cache structure of at least two hierarchical levels which can eliminate the above-described problems of the conventional technique, can provide a software prefetch instruction to help a compiler generate an instruction sequence easily, and can reduce the negative effects on the performance of the information processing unit due to a cache miss.
In an information processing unit having a software prefetch instruction for transferring an operation data to be used for an operation from a main memory to a cache prior to the execution of the operation, the above-described object of the present invention can be achieved by providing one or a plurality of indication bits for indicating the content of a prefetch operation. The indication bits are included in either an operation code or an operand address of the software prefetch instruction. The type or content of the prefetch operation performed is based on the indication bits.
Further, the above-described object can be accomplished as follows. When the information processing unit is an information processing unit having caches of two hierarchical levels, the above-described indication bits can indicate at least one of a hierarchical level of a cache to which the operation data is to be prefetched, and a quantity of the operation data to be prefetched. The above-described indication bits also can indicate, when a certain quantity of the operation data is to be transferred from the main memory to the secondary cache, operation data of the same quantity is to be prefetched to the primary cache or that operation data or an integer times that of a quantity of data to be transferred by a normal cache access instruction is transferred when executing a software prefetch instruction.
According to the present invention, it is possible to clearly specify a hierarchy of a cache to which data is to be transferred or a size of data to be transferred by using the above-described indication bits when a compiler generates an instruction in the information processing unit having a cache structure of at least two hierarchial levels. Accordingly, it becomes possible to regularly generate software prefetch instructions. Therefore, the present invention can eliminate the necessity of generating instructions for determining an address relation when an instruction sequence is structured by a loop of instructions such as an access of the array of data and an operation performed on the array of data.
Further, according to the present invention, data of a plurality of lines can be transferred at one time by one software prefetch instruction when making an access to the operation data having continuous addresses. Therefore, the number of software prefetch instructions can be reduced.
Furthermore, a compatibility between a set architecture according to the present invention and a set architecture according to the conventional technique can be achieved easily. For example, when a block size is 32 bytes, the lower five bits of the operation address of a software prefetch instruction is never used as an address. When the above-described indications are carried out by using these five bits, the compatibility can be maintained without extending the architecture of an instruction set.
As described above, according to the present invention, a software prefetch instruction can be used effectively, and this can improve the performance of the information processing unit.