1. Technical Field
The present invention relates to multi-core processors in general, and more particularly, to a method for performing a matrix calculation using a multi-core processor.
2. Description of Related Art
In many science and technology evaluations, such as fluid analysis or structural analysis, calculations for finding a product of a sparse matrix and a vector are frequently performed. For example, in a conjugate gradient method, the solution of a system of linear equations in several variables is calculated by using an iterative solver. In an iterative solver, calculations of a product of a sparse matrix and a vector are iteratively performed until the solution converges. For the purpose of efficiently performing such calculations, several techniques have been proposed. For example, when all vector data cannot be stored in a cache memory of a processor, the occurrence of a cache miss is reduced by replacing rows or columns with subsequent ones in advance in order that parts of the vector data stored in the cache memory can be accessed continuously. Moreover, some amount of read out time is saved by simultaneously reading out four elements of a sparse matrix as a block of two elements by two elements instead of reading out non-zero elements of the sparse matrix one by one from a memory or a hard disk drive.
In recent years, multi-core processors have been widely employed. A multi-core processor includes processor elements formed on a single processor chip. Each of the processor elements operates independently from and also in parallel with the other processor elements. In addition, a multi-core processor includes cache memories respectively within the processor elements. Since the processor elements can access the cache memories quickly, an arithmetic operation can also be processed quickly if data to be used by the processor elements during the same period of time can be previously stored in cache memories. However, it is not easy for the multi-core processor to perform a process of keeping consistency of data between a cache memory and a system memory, or between a cache memory and another cache memory. Since the process requires a large amount of hardware resources, there are many cases where a structure of the multi-core processor becomes complicated.
For this reason, with an attempt to simplify the structure of the processor by causing a software program to manage consistency of data, there has been developed a multi-core processor from which a function to keep consistency of data is removed. For example, a Cell processor (Cell Broadband Engine7) includes, within the processor, a local memory without a function to keep consistency of data, instead of a cache memory. Thereby, the hardware structure inside the processor can be simplified. In this case, another function may be also provided to the processor and the operation speed of the processor can be improved as well. If keeping consistency of data is required in this case, a control by a software program is necessary, however. In addition, since a processor having such structure is a completely new processor that has not existed so far, software techniques that have been studied cannot be applied to the new processor without modifications or changes. This logic also applies to matrix operations as well.
For example, since a Cell processor does not include a cache memory, the technique to reduce the occurrence of a cache miss cannot be applied to a Cell processor without modifications or changes. In addition, on the basis of a comparison of a Cell processor to a conventional multiprocessor parallel computer system, one may think that conventional techniques related to a parallel computer system can be applied to a Cell processor. Such idea is invalid, however. First, since the size of data that can be stored in a local memory is extremely small as compared with that of a memory of a parallel computer system, there are cases when the content of the local memory is frequently updated when the same techniques are applied to a Cell processor. Second, while a Cell processor achieves a extremely fast communication speed between local memories in comparison with a communication speed of accessing a system memory, a parallel computer system achieves a communication speed between nodes approximately the same as that of accessing a system memory or even slower. For this reason, although an attempt to reduce the amount of communications between nodes is advantageous in a parallel computer system, it is preferable that communications be even utilized actively in a Cell processor.