The present invention relates to a vector processor for concurrently processing a plurality of vector elements of vector data, and in particular, to a vector processor adopting a memory skew scheme for preventing deterioration in the accessing performance or efficiency when vector elements are consecutively accessed with a particular address increment value.
In a vector processor, data in the form of an array have been usually stored as vector data in such a method that a sequence of elements of each row of the array are arranged in a continuous memory area. Consequently, when accessing the sequence of elements of an identical row of the array, there are used a string of consecutive addresses. Namely, the addresses thus employed are a sequence of addresses of which the values are increased with an identical increment value (to be called a stride herebelow).
However, as well known, a method of addressing the respective storage locations in a plurality of memory modules constituting a main memory has been attended with a problem of variation in execution speed of accesses to these addresses. For example, an interleaving method has been commonly employed as an ordinary memory addressing method. FIG. 30 shows an example of addresses allocated in a main memory including four memory modules in accordance with the interleaving method. The four memory modules are respectively assigned with numbers 0 to 3 as denoted MM#, namely, MM1 to MM3. In this interleaved addresses, consecutive addresses are sequentially assigned to mutually different memory modules.
However, an address of this example is a number assigned to a memory position to be utilized as an accessing unit. The address definition applies to the following description if not otherwise specified herebelow. Moreover, it is assumed that a memory accessing unit in the vector processor of the prior art and a vector processor in each embodiment according to the present invention is equal to the data length of an element of vector data. Namely, the accessing operation is accomplished for each vector data element.
In the interleaving method in which addresses are sequentially assigned to the different memory modules, when data is accessed with consecutive addresses, the access operation is achieved through the different memory modules in a sequential manner. This leads to a very rapid access to the data. However, it has been known that when accessing equally separated locations based on addresses having an identical stride therebetween, the accessing is possibly concentrated on a particular memory module, which leads to deterioration of the accessing efficiency. FIG. 31 shows relationships between the stride and the memory access efficiency. For example, the problem above occurs when the value of stride is four in the address configuration of FIG. 30. When a sequential access is carried out with the stride set to four in the memory beginning from address "0", the accessing is carried out through the addresses "0", "4", "8", "12", and so on. Each of these addresses are assigned to the memory module MM0 and hence the accessing speed is lowered due to congestion of the accesses.
As means for minimizing the deterioration in the accessing efficiency, there has been known a so-called memory skew scheme. The memory skew scheme has been mathematically based on an article such as D. J. Kuck: "ILLIAC IV Software and Application Programming", IEEE Transactions on Computers, Vol. C-17, No. 8, pp. 758-770, August 1968 or P. Budnik and D. J. Kuck, "Organization and Use of a Parallel Memories", IEEE Transactions on Computers, pp. 1566-1569, December 1971. Moreover, the memory skew scheme is employed in various forms of its variations. Some of Such variations have been described in articles such as D. T. Harper, III and J. R. Jump, "Performance Evaluation of Vector Access in Parallel Memories using a Skewed Storage Scheme", IEEE Transactions on Computers, C-36 (12), pp. 1440-1449, December 1987; D. T. Harper, III and J. R. Jump, "Performance Evaluation of Vector Access in Parallel Memories using a Skewed Storage Scheme", Conf Proc of the 13th Annual International Symposium on Computer Architecture, pp. 324-238, June 1986; and the U.S. Pat. No. 4,918,600. Description will now be given of variation examples of the skew scheme in a system of four vector processors using the skew scheme and four memory modules.
Vector processors employing the skew scheme have been described in the U.S. Pat. Nos. 4,370,732 and 4,918,600. FIG. 32 shows the vector processor disclosed in the U.S. Pat. No. 4,918,600.
The constitution of FIG. 32 includes a processor 500 for sequentially issuing access requests, memory modules 510 to 513, buffers 520 to 523 each for temporarily keeping or storing therein access requests issued from the processor 500, and buffers 530 to 533 for temporarily keeping data respectively read from the memory modules 510 to 513. A reference numeral 540 indicates an address mapping circuit responsive to address information contained in an access request for selecting a memory module to which the access request is to be sent. The processor 500 can send out an access request in each cycle. An access request includes address information so that based thereon, the access request is issued to the destination memory module. In the known example, although four cycles are required to access a memory module, since four modules are sequentially accessed, an access request can be processed in each cycle. Moreover, even in a case where access requests are successively sent to an identical memory module, when a memory module is to be accessed in response to an access request, the other succeeding access requests are provisionally kept in the buffers 520 to 523. Namely, until these buffers are set to the busy state due to the stored data, the succeeding access requests can be issued. In order to return data acquired from the memory modules 510 to 513 to the processor in an order in which the data are previously issued by the processor 500, the buffers 530 to 533 are disposed to temporarily store the data therein.
FIG. 33 shows an addressing applied to memory modules according to a first skew scheme described in the U.S. Pat. No. 4,918,600.
In the skew scheme of FIG. 33, each time the value of access address is incremented by the number of the disposed memory modules (four in this example), the objective memory module is changed for the addressing. Relationships between the memory module number MM#, the address ADR, and the number of memory modules are represented by expression (1) as follows. EQU MM#=(ADR+ADR.div.M)mod4 (1)
where, mod4 denotes computing a value in modulo 4. In this case, since M takes a value "4" addresses 0, 4, 8, are assigned to the modules, 0, 1, 2, respectively.
Next, FIGS. 34 and 35 show an addressing for memory modules according to the second skew scheme shown in the drawings of the U.S. Pat. No. 4,918,600.
In this skew scheme, each time the address values are increased by the double (=8) of the memory module count, the next memory module is used for the memory assignment. In this situation, relationships between the memory module number MM#, the address ADR, and the number (M) of memory modules are represented by expression (2) as follows. EQU MM#=(ADR+ADR.div.(M.times.2))mod4 (2)
For example, addresses 0, 8, and 16 are assigned to the memory modules 0, 1, and 2, respectively.
FIG. 36 shows relationships between the address interval i.e. the stride and the access performance in a case where the first skew scheme represented by expression (1) or shown in FIG. 33 is used in the apparatus of FIG. 32. FIG. 37 shows relationships between the stride and access performance in a case where the second skew scheme represented by expression (2) or shown in FIGS. 34 and 35 is employed. In this case, however, it is assumed that the access performance is attained when the number of access requests to be processed becomes to be substantially fixed when a satisfactorily long period of time is elapsed after the processor 500 starts issuing or sending out an access request. Moreover, the value of access performance is assumed to be represented as one when an element is processed or accessed in each cycle. As can be understood by comparing FIGS. 36 and 37 with FIG. 31, when the first skew scheme of expression (1) or FIG. 33 is adopted, the number of stride types for which the performance is lowered can be minimized. In addition, when the second skew scheme of expression (2) or FIGS. 34 and 35 is utilized, the advantageous tendency is much more increased.
In this connection, the vector processor of the U.S. Pat. No. 4,918,600 is a vector processor which sequentially issues an access request in each cycle. However, to increase the processing efficiency of the vector processor, there has already been employed a vector processor which simultaneously processes a plurality of elements of identical vector data in response to a single instruction. The simultaneous processing above is called a concurrent element processing. The number of elements to be concurrently processed is to be referred to as an element concurrence grade herebelow. A vector processing other than the concurrent element processing is to be called a sequential processing. Next, description will be given of a conventional vector processor accomplishing a concurrent element processing. FIG. 38 shows a vector processor having the element concurrence grade set to four, which has been described in the JP-A-63-66661 filed by the applicant of the present invention.
In this diagram, a reference numeral 14 denotes a vector register unit 15, which includes four vector data controllers 14-0, 14-1, 14-2, and 14-3 and a group of vector registers, not shown.
A reference numeral 15 indicates a requester module for transmitting access requests to a main memory. The requester module 15 includes four request control units 1 to 4 respectively connected to the vector data controllers 14-0 to 14-3.
Reference numerals 9 to 12 designate priority units for deciding priority of access requests for the access processing thereof.
The main memory 13 includes memory modules MM0 to MM3 respectively assigned with numbers "0" to "3". Addresses of these memory modules are allocated such that a continuous memory space is constituted with separate address spaces respectively of the four memory modules.
Subsequently, description will be given of a processing flow to be applied to an access request.
First, access requests are concurrently issued from the request control units 1 to 4 to the request buffer units 5 to 8, respectively. However, if an empty area is missing in the request buffers 5-2 respectively of the request buffer units 5 to 8, any access request is not transmitted or sent out from the request control units 1 to 4.
In the request buffer unit 5, an address decode unit 5-1 decodes an address signal belonging to an access request to determine a number assigned to a memory module to be accessed. The priority unit 9, 10, 11, or 12 corresponding to the memory module is then selected such that the request is notified to a request send-out unit 5-3 of the request buffer unit 5. Moreover, the access request is transferred to the request buffer 5-2 thereof.
When the request buffer 5-2 contains access requests, the request send-out unit 5-3 selects the oldest one of the access requests to transmit the request to one of the priority units 9 to 12 associated with the memory module as the destination of the access request.
The other request send-out units 6-3, 7-3, and 8-3 also conduct the similar processing.
In the priority unit 9, a priority logic or circuit 9-1 determines priority for processing access requests from the respective request buffer units 5 to 8.
The request accept controllers 9-2 accept access requests respectively having the highest priority levels. On accepting an access request, the controller 9-2 sends a request accept signal to the request buffer 5, 6, 7, or 8 from which the access request has been issued.
The other priority logic circuits 10 to 12 also achieve the similar processing.
In the request buffer unit 5, 6, 7, or 8 having received the request accept signal, the request send-out unit 5-3 transmits a subsequent access request to an associated priority unit 9, 10, 11, or 12.
On receiving an access request from either one of the request buffer units 5 to 8, the priority unit 9 sends the access request to the memory module MM0.
The other priority units also carry out the similar processing.
The four vector elements can be thereby processed in a concurrent manner.
In order to improve the access performance of the vector processor, if it is desired to increase the element concurrence grade in the conventional vector processor of FIG. 38, it is necessary to increase the number of request control units (1 to 4 in the example) and that of request buffer units (5 to 8 therein). Resultantly, the number of signal lines connecting these request control units and a memory control unit 20 and those linking the memory control unit 20 with the main memory 13 become to be greater. In this situation, it has been found this accordingly leads to a first problem of difficulty that the overall memory control unit cannot be easily accommodated in a device mounting substrate. To overcome this problem, there has been desired a vector processor which has a higher element concurrence grade and which can be easily mounted on the substrate.
Moreover, as a result of an attempt to implement such a vector processor which has a higher element concurrence grade and which can be easily mounted on the substrate, it has been found that a second problem appears. Namely, when the vector processor conducts memory accesses with various kinds of access strides, the access performance is decreased.
Furthermore, in the vector processor of the prior art shown in FIG. 38, even when the element concurrence grade is low, the performance may possibly be deteriorated depending on the kinds of the employed skew schemes. It has been found that this phenomenon accordingly causes a third problem of limitation of available skew schemes.
These problems have been found by the inventor of the present invention.