Many scientific data processing tasks involve extensive arithmetic manipulation of ordered arrays of data. Commonly, this type of manipulation or "vector" processing involves performing the same operation repetitively on each successive element of a set of data. Most computers are organized with an arithmetic unit which can communicate with a memory and with input-output (I/O). To perform an arithmetic function, each of the operands must be successively brought to the arithmetic unit from memory, the functions must be performed, and the result must be returned to the memory. Machines utilizing this type of organization, i.e. "scalar" machines, have been found too slow and hardware inefficient for practical use in large scale vector processing tasks.
In order to increase processing speed and hardware efficiency when dealing with ordered arrays of data, "vector" machines have been developed. Basically, a vector machine is one which deals with ordered arrays of data by virtue of its hardware organization, rather than by a software program and indexing, thus attaining higher speed of operation. One such vector machine is disclosed in U.S. Pat. No. 4,128,880, issued Dec. 5, 1978 to Cray. The vector processing machine of the Cray therefor patent employs one or more registers for receiving vector data sets from a central memory and supplying the same at clock speed to segmented functional units, wherein arithmetic operations are performed. More particularly, Cray provides eight vector registers, each adapted for holding up to sixty-four vector elements. Each of these registers may be selectively connected to any one of a plurality of functional units and one or more operands may be supplied thereto on each clock period. Similarly, each of the vector registers may be selectively connected for receiving results. In a typical operation, two vector registers are employed to provide operands to a functional unit and a third vector register is employed to receive the results from the functional unit.
Cray further provides a single port memory connected to each of the vector registers through a data bus for data transfers between the vector registers and the memory. Thus, a block of vector data may be transferred into vector registers from memory and operations may be accomplished in the functional units using data directly from the vector registers. This vector processing provides a substantial reduction in memory usage, where repeated computation on the same data is required, thus eliminating inherent control memory start up delays for these computations.
Scalar operation is also possible in the Cray system and scalar registers and functional units are provided therefor. The scalar registers, along with address registers and instruction buffers are employed to minimize memory transfer operations and speed up instruction execution. Transfer intensity is further reduced by two additional buffers, one each between the memory and the scalar registers and address registers. Thus, memory transfers are accomplished on a block transfer basis which minimizes computational delays associated therewith.
Further processing concurrency may also be accomplished in the Cray system using a process called "chaining". In this process, a vector result register becomes the operand register for a succeeding functional operation. This type of chaining is restricted to a particular clock period or "chain slot" time in which all issue conditions are met. Chaining of this nature is to some extent dependent upon the order in which instructions are issued and the functional unit timing.
Thus, the system of U.S. Pat. No. 4,128,880 accomplishes a significant increase in processing speed over conventional scalar processing for the large class of problems which can be vectorized. The use of register to register vector instructions, the concept of chaining, and the use of the plurality of independent segmented functional units provides a large amount of concurrency of processing. Further, since the start up time for vector operations are nominal, the benefits of vector processing are obtainable event for short vectors.
The present invention employs an improved version or Cray's vector processing machine to provide a general purpose multiprocessor system for multitasking applications. In operation, independent tasks of different jobs or related tasks of a single job may be run on multiple processors. While multiprocessor organization has been accomplished in the prior art, inter-CPU communication in these prior art machines has been accomplished through the main memory, in a "loosely coupled" manner. Inter-CPU communication of this nature is hampered by the need to repetitively resort to relatively slow main or central memory references, and by access conflicts between the processors.
The multiprocessor of the present invention overcomes the substantial delays and software coordination problems associated with loosely coupled multiprocessing by providing a "tight-coupling" communication circuit between the CPU's which is independent of the shared or central memory. The tight coupling communication circuits provide a set of shared registers which may be accessed by either CPU at rates commensurate with intra-CPU operation. Thus, the shared registers provide a fast inter-CPU communication path to minimize overhead for multitasking of small tasks with frequent data interchange. The present multiprocessor system may also couple tasks through the shared memory as provided by the prior art. However, the tight coupling communication circuits provide a hardware synchronization device through which loosely coupled tasks as well as tightly coupled tasks may be coordinated efficiently.
Typically, prior art multiprocessors are characterized by a master-slave relationship between the processors. In this organization the master processor must initiate and control multitasking operations so that only one job may be run at a time. Because many jobs do not require multiprocessor efficiency this type of master-slave organization often results in underutilization of the multiprocessor.
In the present multiprocessor system all processors are identical and symmetric in their programming functions, so that no master-slave relationship is required. Thus, one or more processors may be selectively "clustered" and assigned to perform related tasks of a single job by the operating system. This organization also allows each processor to operate independently whereby independent tasks of different jobs may be performed. Accordingly, the present multiprocessor system avoids the problems of underutilization and provides higher system throughput.
The multiprocessor system of the present invention is also uniquely adapted to minimize central memory access time and access conflicts involving memory to CPU and memory to I/O operations. This is accomplished by organizing the central memory in interleaved memory banks, each independently accessible and in parallel during each machine clock period. Each processor is provided with a plurality of parallel memory ports connected to the central memory through a hardware controlled access conflict resolution hardware capable of minimizing memory access delays and maintaining the integrity of conflicting memory references. This interleaved and multiport memory design, coupled with a short memory cycle time, provides a high performance and balanced memory organization with sufficient bandwidth to support simultaneous high-speed CPU and I/O operations.
Processing speed and efficiency is further improved in the present system through a new vector register design and organization which provides additional memory-vector register data transfer paths and substantially enhanced hardware automatic "flexible" chaining capability. This new vector register organization and the parallel memory port configuration allow simultaneous memory fetches, arithmetic, and memory store operations in a series of related vector operations which heretofore could not be accomplished. Thus, the present multiprocessor design provides higher speed and more balanced vector processing capabilities for both long or short vectors, characterized by heavy register-to-register or heavy memory-to-memory vector operations.