The present invention relates to an information processing system and to a cache flash control method used for the same. More particularly, the present invention relates to a cache flash control method suitable for an information processing system adopting a vector scheme.
The delay of a clock distribution circuit has bottlenecked to respond recent increasing operation frequencies of information processing systems. As a result, it has become difficult to control different divisional units (such as CPUs (Central Processing Units) or LSIs (Large-scale Integrated circuits)) with clocks in phase.
One approach to solve the above-mentioned problem is to synchronize CPUs operated with asynchronous clocks, by software. For example, there is the method of utilizing hardware functions such as barrier sync/communication registers and dispatching plural processes divided by a compiler or manual specifications to CPUs installed with different OS (Operating Systems).
This method is premised on a group of processes operated with completely different timings. Hence, even if CPUs are clocked asynchronously, an incorrect problem in operation resulting from a hardware function does not occur. Such a scheme is realized in scalar-type parallel supercomputers.
The parallel processing support function employing the above-mentioned hardware can be realized more easily and inexpensively, compared with functions such as an increased clock rate of a microprocessor and expansion of a CPU-to-memory bandwidth. This approach can be generally performed to increase an apparent performance-to-cost ratio.
However, increasing the parallel process rate by software makes it difficult to parallelize programs. Specifically, the limit of parallelizing programs being not in parallel depends on the type of a program.
Moreover, even if programs can be arranged in parallel, it is very difficult to debug them, compared with programs not arranged in parallel. The debugging is performed by performance tuning, but requires a high-level skill on the parallel computer technology. Such a task has to be done every time hardware is up-graded.
Moreover, there is the problem in that an enormous amount of program resources are not fully drawn because of the above-mentioned reasons. Even if the technical problem on the parallel processing is overcome, there are many problems such as a shortage of human resources to employ it at a field site.
One of approaches for such a problem is the parallel processing by hardware and the specific is a product that is called a vector-type supercomputer. The vector scheme is one of parallel processing schemes (of SIMD (Single Instruction stream Multiple Data stream) type), each in which plural sets of data are subjected to similar operation/memory access.
The section storing plural sets of data is called a vector register. The command indicating the same operation, the same memory access, and the same transfer to all sets of element data in the vector register is called a vector command. For example, operation indication objective elements of 128 are set to the VL (vector length register) in accordance with the LVL (Load VL) command, as shown below.
LVL VL<−128
VADD V7<−V5+V4
Thereafter, the 128 elements of V5 or V4 are added and set to the 128 element areas of V7, in accordance with the VADD (vector addition) command.
As described above, the vector scheme does not require synchronization between processes so that the parallel processing can be realized with an extension of a single CPU. The vector scheme is time-proven as a parallelizing technique. The automatic vector compiler technique has been developing in a product level.
Even if the vector unit, which executes the vector commands, is realized so as to extend over plural asynchronous operation units, the CPU is synchronized by hardware, like the vector computer. Thus, a programmer can easily realize the high-speed operations by the parallel processing.
In order to realize an effective performance gain, the effective vector scheme requires acquiring a bandwidth between the CPU and the memory appropriate to the effective performance. However, the number of pins per LSI, connectable to a memory, is limited. One approach for such a problem is to split the vector unit into plural LSIs, thus putting to some extent a brake on the increasing pins of the CPU.
However, even in this approach, when the operation clock reaches several hundreds MHz, an increase in the clock skew makes it difficult to operate the split LSIs in sync with the clocks.
The conventional information processing systems described above has the following problems on the VSC (vector scan) command when the vector unit is realized over plural asynchronous operation units
The VSC command is very important in the vector-type computers. The specification of the VSC command will be now explained by referring to FIG. 6. The memory stores as an address the content of the vector register Vy specified in the y field and stores the content of the vector register Vz specified in the corresponding z field.
First, the vector-type CPU command process will be briefly explained here. In order to send a large volume of data at a time, in accordance with the vector command, the memory is directly accessed, without accessing a cache memory (hereinafter referred to as a cache). However, the conventional program contains a scalar operation and a vector operation in a mixed state. The scalar operation is performed using the data in the cache.
The coherency between the cache and the memory is maintained by considering that the case where common data is used in the scalar operation and the vector operation. In the case of the scalar store command, the store data is simultaneously written into the cache and the memory. In the case of the vector store command, while data is being directly written (or stored) into the memory, both the address stored in the cache and the vector store address is compared. When the address matches the vector store address, the corresponding cache line is invalidated (or flashed).
Here, the configuration of the cache required in the present invention will be explained by referring to FIG. 8. Now, it is assumed that the number of bits of an address is 40 bits and that the cache has 64 KB, a two-way set associative, and a line size of 256 B. In the case of the cache shown in FIG. 7, the interblock address is 8 bits (because 256 B=28 B when the minimum access unit is 1 B). The index address (the line address of the cache) is 7 bits (because 27 B=(128=(32 KB per one way)/(256 B per line)) and the cash has a two-way, 64 KB configuration). The tag address is 25 bits (=40−8−7).
Referring to FIG. 8, a cache is formed of an address array (hereinafter referred to as AA) and a data array (hereinafter referred to as DA). AA stores a tag address. DA stores data. Each line has a VALID bit (hereinafter referred to as a V bit) representing that a line is effective. In this case, the cache line is valid with V bit=1 but is invalid with V bit=0.
Next, the cache flash process of the VSC command will be explained by referring to FIG. 9. The cache flash is the process of invalidating the corresponding cache line by a memory direct-accessing command, such as a vector store command, when a mismatch occurs between data of a cache and data in a memory.
Because the cache flash is performed every cache line, it is unnecessary to subject the interblock address (lower 8 bits of 40 bits) to comparison, as shown in FIG. 7. The upper 32 bits of the store address (hereinafter the store address used for a flash process of the vector command is referred to as a flash address) of the VSC command may be merely subjected to comparison.
The lower 7 bits of the flash address of the VSC command is compared with the index address of 7 bits of the flash address array (a copy of an address array used in a flash process, hereinafter referred to as FAA). The tag address 25 bits and V bit of a corresponding line are output from FAA.
The tag address of 25 bits, output from FAA, is compared with the upper 25 bits of the flash address of the VSC command. As a result, when all bits match each other and the V bit is 1, V bit of the corresponding index address of AA is set to 0 and the cache line is flashed.
Next, the cache updating process (updating process) at a cache miss time will be explained by referring to FIG. 10. When a cache miss occurs, the content of the cache has to be updated. Let us now consider updating of AA. In the case of the cache configuration in 2 ways or more, the way to be updated by a cache push-out algorithm (e.g. LRU (Least Recently Used)) is determined. The 2-way is represented with 2 bits.
The lower 8 bits of the address of 40 bits of the load command in a cache-miss state are an interblock address and hence become unnecessary. The next lower 7 bits correspond to an index address of a cache line to be updated and a tag address of 25 bits is written to a corresponding line.
The cache line updated as V bit=1 is valid. Thus, the updating process is completed. By the way, because FAA is a copy of AA, FAA has to be updated at the same time when AA is updated.
In consideration of such things, the conventional technique where a vector unit is realized in plural asynchronous operation units will be explained below by referring to (b) part of FIG. 2 and FIG. 5. As shown in FIG. 5, CPU 5 is formed of a master unit 6 and a slave unit 7. In this configuration, the cache has to be shared to maintain the coherency of the cache and simplify the control.
In the case of a shared cache, the address array 65 only in the master unit 6 is used. In such a configuration, when the VSC command is executed, the store data Vz is written into the memory unit 8, using the signal line 500 to the vector unit 61 in the master unit 6 and using the signal line 600 to the vector unit 71 in the slave unit 7 (S31 in (b) part of FIG. 2).
At the same time, the vector unit 61 outputs the flash address Vy to FAA 64 using the signal line 501 (S32 in (b) part of FIG. 2). The vector unit 71 outputs the flash address to FAA 64, using the signal line 601 (S33 in (b) part of FIG. 2). When the flash address has been output to the last vector element, the vector unit 61 outputs the END signal to the cache control section 63 using the signal line 502. The vector unit 71 outputs the END signal to the cache control section 63 using the signal line 602.
The FAA 64 in the master unit 6 compares the address stored in the cache with the flash address (S34 in (b) part of FIG. 2). When both the addresses match each other, a coincidence address is sent to the AA 65 (S35 in (b) part of FIG. 2).
Finally, a corresponding address of the AA 65 is flashed. The cache control circuit 63 receives the END signals from the master unit 6 and the slave unit 7. Thus, the flash process ends (S36 in (b) part of FIG. 2).
It is assumed that both the master unit 6 and the slave unit 7 can process in vector elements of 256. The signal line 601 interconnects the vector unit 71 and the FAA 64. When a flash address being one element per 1T (one clock) is output, signal lines corresponding to 32 bits (=an index address of 7 bits+a tag address of 25 bits) are required.
In this case, if the number of vector elements is 256, the complete outputting of a flash address takes the time period of 256 T, which is impractical processing rate. Until the flash address and the cache address of the VSC command are completely compared, the cache flash process of the next command and the updating process of a cache occurring at a cache miss time are not performed. Consequently, a high-rate address comparing process is essential to improve the performance.
To process n vector elements per 1T at high-rate, the signal lines 601 of (32 bits×n) is required. This leads to a sharply increased number of pins of a LSI. This problem spoils the advantage that the vector unit split into plural LSIs acts as a considerable brake on an increasing number of pins of the CPU.