1. Field of the Invention
The present invention relates to an information processing apparatus, and more particularly to a vector information processing apparatus for parallel processing of information on a hardware basis.
2. Description of the Related Art
Information processing apparatus in recent years suffer greater signal delays caused by transmission lines as the operating frequency thereof goes higher. In such information processing apparatus, it is very difficult to operate a plurality of semiconductor integrated circuits (CPUs, LSI circuits, etc.) with clock signals that are kept in phase with each other.
One solution to the above problem is proposed as a method of synchronizing, on a software basis, the processes carried out by a plurality of CPUs that operate with asynchronous clock signals. For example, there is known a method of dispatching a plurality of designated processes to CPUs that operate under different OSs (Operating Systems) using a hardware function referred to as a barrier synchronization/communication register. Since this method is based on the premise that the plural processes operate at entirely different timings, operation failures on account of the hardware function do not occur even if the clock signals of the respective CPUs are out of synchronism with each other. The method has been implemented in products called a scalar parallel computer, for example.
The above method, which synchronizes a plurality of processes on a software basis, has an increased apparent performance vs. cost ratio because it can be achieved much more inexpensively than attempts to improve hardware performance such as CPU operating speeds and data transfer speeds between CPUs and memories.
However, the above method is problematic in that it is highly difficult to parallelize programs. The difficulty arises from the fact that instructions used in programs have a wide variety of different limitations on parallelization. Even if programs can be parallelized, a process of debugging them is much more difficult to perform than programs that are not parallelized. The debugging process is generally carried out when performance tuning is effected on the information processing apparatus, and requires a high level of skill about the parallel processing technology. Inasmuch as the difficult debugging process needs to be carried out each time hardware improvements are introduced, vast program resources are made useless. Even if technical goals for parallelizing programs are accomplished, another problem is encountered in that sufficient human resources are not available for operating the programs at site.
The above problems may be solved by the parallel processing of information on a hardware basis. One specific example of such a solution is known as a vector information processing apparatus.
A vector process is a process (Single Instruction Multiple Data stream: SIMD) for simultaneously processing a plurality of regularly arranged data. A register which stores such a plurality of regularly arranged data is referred to as a vector register, and instructions for performing the same operation on, effecting memory access to, and transferring, all the elements stored in the vector register are referred to as vector instructions.
A vector instruction is described, for example, as:
LVL VL<−128
VADD V7<−V5+V4
In this example, elements (128 elements) to be processed are stored in a VL (vector length register) using an LVL (Load VL) instruction, after which elements (128 elements) in vector registers V5, V4 are added using a VADD (vector addition) instruction, and the resultant sum is stored in a vector register V7.
According to the vector process, since software-based synchronization between processes is not required, software can be generated on the same idea as with a single CPU. The vector process has actually been used effectively as a parallelizing process, and a compiler for parallelization already exists.
For improving performance with the vector process, however, a bandwidth (data transfer speed) commensurate with the performance to be improved needs to be kept between a CPU and a memory. If the CPU comprises a plurality of vector units for executing vector instructions and the vector units are operated parallel to each other, then processing operations can be performed at a higher speed.
The vector information processing apparatus which has a CPU comprising a plurality of vector units suffers problems to be described below when a VSC (vector scatter) instruction is executed.
The VSC instruction is a very important instruction in the vector information processing apparatus. Specifications of the VSC instruction will be described below with reference to FIG. 1 of the accompanying drawings.
As shown in FIG. 1, the VSC instruction is an instruction which uses elements in a vector register Vy designated by a Y field and stores elements in a vector register Vz designated by a corresponding Z field in a memory. In FIG. 1, an OPC field is an operation code indicative of a VSC code, and an X field is an invalid area which is not used.
In a process according to a VSC instruction, elements are successively written into a memory in the sequence of element numbers. For storing a plurality of elements at the same address, in particular, priority has to be given to the writing of an element having a larger element number. For example, when an element n and an element n+1 are to be stored at the same memory address, it is necessary to give priority to the writing of the element n+1 and invalidate the element n. If the process is carried out by a single unit or a plurality of units that operate in synchronism with each other as is conventional, the above limitation is not required to be taken into account since writing requests are issued in the sequence of element numbers from one port.
In the vector information processing apparatus where the CPU comprises a plurality of asynchronously operating units, since the sequence of processing based on element writing requests (hereinafter referred to as element requests) issued from the units is not guaranteed, the sequence of writing requests in the memory may be reversed.
For example, as shown in FIG. 2 of the accompanying drawings, a CPU comprises a master unit and a slave unit which are asynchronously operating units and element requests of adjacent element numbers (an element n and an element n+1) are distributed to and issued from the master and slave units. If the element requests are requests for storing the element n and the element n+1 at the same memory address, then a memory controller for controlling the writing of elements in the memory may possibly process the element n+1 prior to the element n. If the element n+1 is written prior to the element n, then the element n is written to overwrite the element n+1.
The above problem may be solved by synchronizing element requests issued from a plurality of asynchronously operating units. This solution, however, requires increased overhead for synchronizing element requests, and results in increased intervals at which the element requests are issued. These drawbacks cancel out the advantages provided by a high-speed processing apparatus based on parallel operation of the master and slave units.