A vector computer has a plurality of vector pipelines for processing a plurality of pieces of data simultaneously at each machine clock.
The plurality of vector pipelines divide a single vector arithmetic instruction into the respective vector pipelines for processing.
For example, page 214 of Non-Patent Document 1 depicts parallel pipelines and multiple parallel pipelines.
When a single vector arithmetic instruction is divided and executed in a plurality of vector pipelines, both input and output vectors are also divided into the respective vector pipelines for data exchange.
In order to uniformize the processing time of each vector pipeline, the data string is interleaved element by element and retained in a vector register even when the number of elements (vector length) of the vector register is small.
That a plurality of pieces of data are simultaneously processed at each machine clock applies not only to vector arithmetic instructions but also to main memory access instructions.
Various techniques may be used to exchange a plurality of pieces of data between a central processing unit and a main memory device at each machine clock, whereas a plurality of interfaces to the main memory device called ports are generally provided to the central processing.
The plurality of ports can be simultaneously operated in a single machine clock, which makes it possible to transfer a plurality of pieces of data to the main memory device or receive a plurality of pieces of data from the main memory device at every machine clock.
Here, the main memory device has data storing sections that are divided port by port.
The data storing sections and the ports correspond to each other in a one-to-one relationship. If a main memory address to be accessed is specified, the data storing section and port to be accessed are determined uniquely.
The main memory device of the vector computer generally stores data as interleaved as setting a data element as a unit.
Such a design is intended to perform data transfer at maximum speed when a plurality of vector pipelines simultaneously access a plurality of pieces of data that are stored in consecutive addresses.
A configuration example of a vector processor with such a plurality of vector pipelines and a plurality of ports is described, for example, in Patent Document 1.
Now, FIG. 1 shows a configuration example of a vector processing apparatus of the background art.
A central processing unit 101 includes an instruction issuance control section 102, a vector processing section 103, an address calculating section 104, and CPU input/output ports 105.
The CPU input/output ports 105 are connected to main memory input/output ports 107 of a main memory device 106, respectively.
The main memory input/output ports 107 are connected to main memory data storing sections 108, respectively. The main memory data storing sections 108 have a multibank configuration as setting a data element as a unit.
The vector processing section 103 includes a plurality of vector pipelines 109 and a crossbar 110.
Each vector pipeline 109 includes a vector arithmetic section 111 and a register bank (any one of register banks 112-0 to 112-7 in FIG. 1). The whole of the register banks 112-0 to 112-7 is a single vector register. The vector register is divided into a plurality of banks to constitute the respective register banks 112-0 to 112-7.
Each of the register banks 112-0 to 112-7 has a read pointer 113. The crossbar 110 can connect the inputs and outputs of the plurality of vector pipelines 109 and the plurality of CPU input/output ports 105 in any combination.
Each of the register banks 112-0 to 112-7 is divided into a plurality of vector pipelines.
In the present example, the vector register retains 256 elements, which are divided into eight vector pipelines.
The division is element by element such that the element 0 is on the register bank 112-0, the element 1 is on the register bank 112-1, . . . , the element 7 is on the register bank 112-7, the element 8 is on the register bank 112-0, and so on.
When the vector processing section 103 receives a vector instruction from the instruction issuance control section 102, the vector processing section 103 activates the vector register's register bank(s) that is/are designated in the instruction word.
Each of the register banks 112-0 to 112-7 exchanges data with each CPU input/output port 105 through the crossbar 110.
When the address calculating section 104 receives a main memory access instruction such as a vector load instruction and a vector store instruction from the instruction issuance control section 102, the address calculating section 104 calculates main memory addresses according to the designation of the instruction word.
In order for the vector pipelines 109 to access the main memory device 106 independently, the address calculating section 104 is configured so that it can calculate main memory addresses as many as the number of vector pipelines in a single machine clock.
The central processing unit 101 has a plurality of CPU input/output ports 105, and the main memory device 106 is divided into the same number as the number of CPU input/output ports.
In the present example, the central processing unit 101 has thirty-two CPU input/output ports 105-0 to 105-31, and the main memory device 106 is divided in thirty-two sections.
The CPU input/output ports 105-0 to 105-31 are connected to the main memory input/output ports 107-0 to 107-31 in a one-to-one relationship, respectively. The main memory input/output ports 107-0 to 107-31 are connected to the main memory data storing sections 108-0 to 108-31 in a one-to-one relationship; respectively.
The CPU input/output ports 105 and the main memory input/output ports 107 are operable with the same machine clock.
The main memory data storing sections 108 are divided into banks as setting an element as a unit.
For example, the main memory data storing section 108-0 shall contains the element 0, and the main memory data storing section 108-1 contains the element 1.
Next, an example of the configuration and operation of the address calculating section 104 will be described with reference to FIG. 2.
A start address is given from the instruction issuance control section 102 through a signal line 601, and is recorded into a start address retaining section 602.
The value of the start address retaining section 602 is passed to eight address calculating circuits 608, which calculate addresses corresponding to the outputs from the eight vector pipelines 109, respectively.
A pipeline offset 603 retains the product of each pipeline number and the stride value of the vector store instruction.
An adder 604 calculates the sum of the start address and the pipeline offset 603 to determine the address of the top element of the register bank. The address is transmitted to the crossbar 110 through an address signal line 607.
The address calculated is input to an adder 606. The value of an offset 605 is added to the address, and the address which the value of an offset 605 is added is output again at the next machine clock. The value of the offset 605 is the product of the stride value of the vector store instruction and the number of pipelines.
By adding such a value, the address value of the next element in the register bank 112-n (here, “n” indicates one of numbers from 0 to 7) can be calculated.
Next, specific description will be given of the operation of the vector store instruction in the vector processing apparatus of the background art.
When a vector store instruction is given, the instruction issuance control section 102 activates the vector processing section 103 and the address calculating section 104.
Each vector pipeline 109 of the vector processing section 103 reads the element that the read pointer 113 of the designated register bank points to, and sends the element to the crossbar 110.
The value of the read pointer 113 which indicates the read position points to the top element initially, and is controlled to point to the next element after a read.
That is, at the first machine clock, the element 0 is output from the register bank 112-0, the element 1 is output from the register bank 112-1, and so on. At the next machine clock, the element 8 is output from the register bank 112-0, the element 9 is output from the register bank 112-1, and so on.
Meanwhile, the address calculating section 104 calculates the write addresses of the data output from the respective vector pipelines 109.
The address calculating section 104 can calculate addresses as many as the number of vector pipelines 109 in a single machine clock.
Consequently, all the outputs from the vector pipelines 109 are simultaneously input to the crossbar 110.
In the present example, the elements 0, 1, 2, . . . on the register banks 112-0 to 112-7 shall be stored at addresses 0, 1, 2, . . . on the main memory data storing sections 108.
The data input to the crossbar 110 is transmitted through the CPU input/output ports 105 and main memory input/output ports 107 to the main memory data storing sections 108, and written to the main memory. If a conflict occurs here at any output port of the crossbar 110, the data to be output wait in the crossbar 110.
FIG. 3 shows a specific operation time chart.
The upper half of FIG. 3 shows the arrangement of the elements on the register. The lower half of FIG. 3 shows the numbers of the elements to be output at each time.
The elements 0 to 7 are output at the first machine clock, and the elements 8 to 15 are output at the next machine clock.
Since the pieces of data are transferred to respective different output ports element by element, no conflict occurs in the crossbar 110. As a result, it takes 32 machine clocks before all the elements are output.
According to the background art, when performing a vector store instruction, the vector pipelines can thus simultaneously output data and transfer the data through different CPU input/output ports to execute the vector store instruction with high throughput.    Patent Document 1: JP-A-2005-038185    Non-Patent Document 1: Sidney Fernbach, “Supercomputers” (translated by Shigeo Nagashima, 1988, Personal Media Corp.)