While processing performance per a processor (information processing apparatus) improves year by year, an amount of performance improvement becomes saturated. Multi-core (many-core) technology, which makes a plurality of cores carry out a desired process in parallel, is one of technologies which achieve higher performance improvement. In the case of the multi-core technology, in order to improve the processing performance, it is important to make an efficiency of parallel processing high.
The numerical simulation is one of fields where a user requests the high processing performance to a processor. For example, in the case of the structure analysis, a user expresses a simulation object (for example, building) based on a partial differential equation, and then simulates the object based on the partial differential equation. In order to carry out the simulation on an information processing apparatus, it is necessary to discretize the partial differential equation. For example, in case of the finite element method, the partial differential equation is converted into a simultaneous linear equation with a large-scale sparse coefficient matrix.
In this case, non-zero elements appear at random in the coefficient matrix. Multiplication of a zero element and a variable is a fruitless calculation theoretically. Therefore, a calculation method with a list vector reduces number of the fruitless calculations by accessing only non-zero elements. The list vector is an array storing only non-zero elements of the coefficient matrix.
An example of the program with the list vector is shown as a program 1.Do i=1, K×N S(A(i))=S(A(i))+X(i) (here, K and N are positive integers, and S is an array, and A is a list vector, and X is a variable.)  (program 1).
It is assumed in the program 1 that the array S has M elements. In this case, 1≦A(i)≦M (here, 1≦i≦K×N, hereinafter, the small letter ‘x’ means the multiplication and the letter ‘/’ means division).
The program 1 executes processes of updating and referring to values of elements specified by a first element to an N-th element of the list vector A in the array S.
An information processing apparatus copies a value of the array S from a main storage apparatus to a register, and furthermore store a value memorized in the register to the main memory apparatus in accordance with the list vector A. The function of a scatter instruction is to copy of the value in the register to the main storage apparatus in accordance with the list vector. The function of a gather instruction is to read the value of the array S from the main storage apparatus and to write the value in the register in accordance with the list vector.
That is, the function of the gather instruction is to copy a value of an A(i)-th (here, 1 i≦K×N) element of the array S in the main storage apparatus stores to the register (process related to S(A(i)) shown as a right-hand side of the program 1). The function of the scatter instruction is to copy a value of the A (i)-th (here, 1≦i≦K×N) element of the array S in the register to the main storage apparatus (process related to S(A(i)) shown as a left-hand side of the program 1).
In the case of parallel execution of the program 1, for example, a k-th (here, 1≦k≦K) core included in an information processing apparatus operates a (N×(k−1)+1)-th element to an (N×k)-th element respectively. As mentioned above, each core processes a part of the scatter instruction and the gather instruction allocated to the core.
Patent documents 1 to 3 disclose a technology for parallel programing.
A compiler disclosed in the patent document 1 compiles a source program including a list vector into a parallelized object program for a distributed memory processor system. The compiler inserts a preprocessing instruction for collecting information on the list vector referred to by each processor into the object program. The compiler inserts communication operation for the parallelization to the object program based on the information collected by the preprocessing instruction.
The patent document 2 discloses a method which enables a parallel computer to carry out a process such as the LU decomposition repeatedly in a short time. The LU decomposition is a method to solve a simultaneous linear equation with a dense coefficient matrix.
A compiler disclosed in the patent document 3 compiles a source program including a list vector into a parallelized object program in accordance with a domain decomposition technique selected by a user.
Patent document 1: Japanese Patent Application Laid-Open No. 1991-203256
Patent document 2: Japanese Patent Application Laid-Open No. 1996-227405
Patent document 3: Japanese Patent Application Laid-Open No. 1995-044508