1. Field of the Invention
The present invention relates to a parallel computer architecture able to implement an SIMD (Single Instruction Stream, Multiple Data Stream) architecture, and more specifically relates to a computer architecture that is able to perform general-purpose parallel processing by means of appropriate and high-speed memory control.
2. Description of the Prior Art
Now that computers have been introduced into many aspects of society in its entirety and the Internet and other networks have become pervasive, data is being, accumulated on a large scale. Vast amounts of computing power is required in order to process data on such large scales, so attempts to introduce parallel processing are natural.
Now, parallel processing architectures are divided into “shared memory” types and “distributed memory” types. The former (“shared memory” types) are architectures wherein a plurality of processors shares a single enormous memory space. In this architecture, traffic between the group of processors and the shared memory becomes a bottleneck, so it is not easy to construct practical systems that use more than 100 processors. Accordingly, at the time of calculating the square roots of 1 billion floating-point numbers, for example, processing can be performed no faster than 100 times the speed of a single CPU. Empirically, the upper limit is found to be roughly 30 times.
In the latter (“distributed memory” types), each processor has its own local memory and these are linked to construct a system. With this architecture, it is possible to design a hardware system that incorporates even several hundred to tens of thousands of processors. Accordingly, at the time of calculating the aforementioned square roots of 1 billion floating-point numbers, processing can be performed several hundred times to tens of thousands of times the speed of a single CPU. However, the latter also has several problems as will be described later.
The present invention pertains to the “distributed memory” type, so we shall make comparisons with the prior art while first adding some description of this architecture.
[Problem 1: Division of Management of Large Arrays]
The first problem with “distributed memory” type architectures is the problem of the division of management of data.
Huge amounts of data (typically consisting of arrays, so hereinafter we shall describe it in terms of arrays) cannot be stored in the local memory belonging to a single processor, so it must be managed by division among a plurality of local memories by necessity. It is evident that an effective and flexible division of management mechanism must be introduced or this will bring various obstacles to the development and execution of programs.
[Problem 2: Poor Efficiency of Interprocessor Communication]
When the various processors in a distributed memory type system are to access huge arrays, while each processor can quickly access the array elements in the local memory, interprocessor communication becomes vital for accessing array elements belonging to other processors. This interprocessor communication has extremely low performance in comparison to communication with local memory, being said to require a minimum of 100 clock cycles. For this reason, performance is extremely degraded during the implementation of sorting because lookups are performed over the entire scope of a huge array and thus interprocessor communication occurs frequently.
Here follows a detailed description of this problem. As of the year 1999, personal computers use between one and several CPU's in a “shared memory” type architecture. The standard CPU used in these personal computers operates with an internal clock speed roughly 5–6 times that of the memory bus, being equipped with automatic internal parallel execution functions and pipeline processing functions so that one piece of data can be processed in roughly one clock cycle (memory bus).
When a sort process is performed on a huge array in a “shared memory” type personal computer, one clock cycle is required for one piece of data, so it is thought to achieve 100 times the performance of a “distributed memory” type multiprocessor system that requires 100 clock cycles (memory bus) for one piece of data.
[Problem 3: Supply of Programs]
The third problem with the “distributed memory” type architecture is the problem of how programs are to be supplied to the plurality of processors.
In an architecture wherein programs are loaded separately to an extremely large number of processors and the whole is operated cooperatively (MIMD: Multiple Instruction Stream, Multiple Data Stream), the creating, compiling and distributing of programs poses a major burden.
On the other hand, in an architecture wherein many processors are operated with the same program (SIMD: Single Instruction Stream, Multiple Data Stream), the degree of freedom in programming is reduced, so situations in which programs that give the desired results cannot be developed are also conceivable.
The present invention provides a method and computer architecture for solving Problems 1 through 3 with the “distributed memory” type described above. Problem 1 with the division of management of large arrays can be solved by the division of management with a method in which the layout (physical addresses) of various elements within the array is uniform within the various processor modules. By means of this technique, the need for garbage collection is eliminated, the insertion or deletion of array elements is completed in several clocks, and the implicit (non-explicit) division of processing by the various processors essential for the implementation of SIMD can be allocated. This method will be described later by the concept of “multi-space memory.”
Problem 2 with the poor efficiency of interprocessor communication can be solved by reconnecting the various processors depending on the processing that is to be achieved, and performing one-directional continuous transfer of stipulated types of data in a stipulated order on each connection route, thereby scheduling communication so that nearly 100% of the capacity of the bus can be used, and simultaneously achieving massively parallel pipeline processing.
In order to demonstrate its effectiveness, we shall later present an example of a method of constructing a system wherein a sort of 1 billion records is completed in roughly one second in a realistic system design. This is more than 100,000 times the speed of the fastest known device. This method will be described later as “bus reconfiguration.”
Problem 3 with the “supply of programs” can be solved by adopting the SIMD scheme. In the case of SIMD, the largest problem is how to solve the implicit (non-explicit) division of processing among the various processors, but this problem of division of processing can be solved automatically with the aforementioned “multi-space memory” technique, so the degree of freedom of programming can be kept even with SIMD.
To wit, the present invention has as its object to provide a distributed memory type computer architecture wherein the input/output of elements within an array stored in various types of memory can be performed with a single instruction, and extremely high-speed parallel processing is achievable.