1. Field of the Invention
The present invention relates to a numerical arithmetic processing unit for processing numeric data, and more particularly, it relates to a unit for processing vast amounts of numerical operations at a high speed. More specifically, the present invention relates to a numerical arithmetic processing unit for efficiently performing vast amounts of numerical operations such as operations of sums of products and load value update operations employed in a neural network mathematical model, for example, which must be repeatedly executed at a high speed.
2. Description of the Background Art
With development of the technique of fabricating a semiconductor integrated circuit, the processing speed of a numerical arithmetic processing unit is increased. Namely, a semiconductor integrated circuit is improved in degree of integration and speed with development in micronization of elements and wires for electrically interconnecting the elements with each other, thereby increasing the speed of the numerical arithmetic processing unit. However, the micronization of the semiconductor elements now encounters its physical limit, and it is extremely difficult to remarkably improve the degree of integration and the speed of a semiconductor integrated circuit only by such process technique.
To this end, a technique of improving the overall arithmetic processing speed by implementing parallel arithmetic operations is recently watched with interest.
In general, a conventional parallel processing technique is adapted to extract only parallel-processable operations from a program described in the form of a serial sequential procedure for distribution to a plurality of arithmetic processing units. In relation to such a method of distributing parallel-processable operations to a plurality of arithmetic processing units, there have been developed an MIMD (multi instruction multi data stream) system of distributing a plurality of arithmetic operations having different processing contents to arithmetic processing units for processing in parallel with each other, an SIMD (single instruction multi data stream) system of distributing a number of arithmetic operations having common contents to a plurality of arithmetic processing units for processing the same in parallel with each other, and the like.
In general, however, the MIMD system is inferior in parallelization efficiency, although average meritorious effects can be expected for various applications and programs can be readily described with small fluctuation in parallelization efficiency for any method of describing procedures in this system. According to this system, operations having different contents are processed in parallel with each other, and hence it is difficult to extract a large number of operations which are processable in parallel with each other, leading to reduction in number of arithmetic processing units actually operating in parallel with each other, i.e., parallelization efficiency.
In the SIMD system, on the other hand, parallelization efficiency is remarkably influenced by the contents of the operations although its hardware structure is relatively simple and the design is simplified since a plurality of arithmetic processing units execute a common instruction. This is because the parallelization efficiency depends on whether or not a set of data which are processed in parallel with each other can be efficiently prepared. In simulation of a natural phenomenon or in processing along mathematical model expression of a neural network or the like, however, it is necessary to repeatedly operate vast amounts of numerical values, i.e., to repeatedly execute the same arithmetic operation. In such a field, therefore, process distribution in the SIMD system is simple and effective for implementing high parallelization efficiency. A unit for executing such vast amounts of numerical arithmetic operations is necessary and inevitable for future development of the field of information processing. Thus, improvement in performance of a parallel arithmetic processing unit of the SIMD system is expected.
On the other hand, a neural network is watched with interest as an information processing technique simulating the operation principle of vital neurons. When such a neural network is employed, it is possible to structure a flexible system having high failure immunity, which has been hard to implement in a conventional program type information processing system. In particular, the neural network exhibits a high effect in a recognition system for images, characters and sounds which are hard to program, a multi-degree-of-freedom control system and the like. However, the technique of implementing a neural network is still on the way of development, and a number of neural networks are still in a stage of implementation of low-speed and small-scale systems utilizing general purpose microcomputers. Thus, awaited is parallel processor architecture, which corresponds to a high-speed large-scale neural network.
In relation to such a stream, the parallel processing technique of the SIMD system can be regarded as the architecture which is most suitable for a neural network. The reason for this resides in the arithmetic structure of the neural network.
In the neural network, all arithmetic elements (neurons) perform weighted average processing and nonlinear processing, as described later in detail. Processed data such as synapse load values and neuron state output values are varied with the arithmetic elements. Thus, it is possible to supply an instructions to all arithmetic elements (neurons) in common. This condition meets the requirement of single instruction and multiple data, which is required for the SIMD system.
FIG. 52 illustrates a conceptual structure of each arithmetic element, i.e., a neuron 950, which is employed in a neural network. Referring to FIG. 52, the neuron 950 includes a synapse load part 952 which weighs output state values Sp, Sq, . . . , Sr supplied from other neurons with prescribed synapse load values, a total sum part 954 which obtains the total sum of load signals received from the synapse load part 952, and a nonlinear conversion part 956 which nonlinearly processes an output of the total sum part 954.
The synapse load part 952, which stores weight values Wia (a=p, q, . . . , r) for respective ones of related neurons, weighs the received output state values Sp, Sq, . . . Sr with the corresponding load values Wia and supplies the same to the total sum part 954. Each synapse load Wia indicates coupling strength between a neuron a and the neuron 950 (neuron i).
The total sum part 954 obtains the total sum of load state values Wia.multidot.Sa received from the synapse load part 952. A total sum value .SIGMA.Wia.multidot.Sa outputted from the total sum part 954 supplies a membrane potential ui of this neuron 950 (neuron i). The total sum .SIGMA. is executed for every one of the related neuron units a.
The nonlinear conversion part 956 applies a prescribed nonlinear function f to the membrane potential ui received from the total sum part 954, to form an output state value Si (=f(ui)) of the neuron 950. The nonlinear function f() employed in the nonlinear conversion part 956 is generally prepared from a monotonous non-decreasing function such as a step function or a sigmoid function.
A neural network employs a plurality of neurons each having the function shown in FIG. 52. In a hierarchical neural network, such a plurality of neurons are so connected that the neurons are grouped and the respective groups are layered to provide a hierarchical structure neural network.
FIG. 53 shows an exemplary structure of a three-layer neural network. A hierarchical neural network includes an input layer, an intermediate layer (hidden layer) and an output layer. The intermediate layer may include an arbitrary number of layers. FIG. 53 shows layers I, J and K. These layers I, J and K are arbitrary layers satisfying conditions of being adjacent to each other in a neural network. The layer I includes neurons X1, X2, X3 and X4 and the layer J includes neurons Y1, Y2, Y3 and Y4, while the layer K includes neurons Z1, Z2 and Z3. The neurons Xa (a=1 to 4) of the layer I are coupled with the neurons Yb (b=1 to 4) of the layer J with weights Wbaj. The neurons Yb of the layer J are coupled with the neurons Zc (c=1 to 3) of the layer K with weights Wcbk.
One of features of the neural network resides in that the weight indicating coupling strength between neurons can be set at an optimum value by "learning". One of methods of such learning is a method called "back propagation", which is a learning with an educator. This back propagation method is now briefly described.
When a certain input pattern P is supplied, the respective neurons asynchronously operate to change output state values thereof. In the hierarchical neural network, the neuron output state values are transmitted in the order of the input layer.fwdarw.the intermediate layer.fwdarw.the output layer (feed-forward structure). Namely, when output states Xa (for the purpose of convenience, neurons and corresponding output state values are denoted by the same symbols) of the neurons X1 to X4 of the layer I are outputted, the neurons Yb of the layer J have the following membrane potential ub: EQU ub=.SIGMA.Wbaj.multidot.Xa
and the output state values Yb thereof are as follows: EQU Yb=f(ub)
When the output state values Yb of the neurons Y1 to Y4 provided in the layer J are transmitted, the neurons Z1 to Z3 of the layer K have the following membrane potential uc: EQU uc=.SIGMA.Wcbk.multidot.Yb
and the output state values Zc thereof are as follows: EQU Zc=f(uc)
where the total sum .SIGMA. is obtained with respect to every one of the neurons included in the lower layer.
In learning, errors between output patterns S (S1 to Sk, where k represents the number of the neurons included in the output layer) outputted from the output layer and an educator pattern are obtained. The educator pattern shows an output pattern which is expected for the input pattern P. Assuming that the layer K shown in FIG. 53 is the output layer, an error ek of the output state value of each neuron Zc is provided as follows: EQU ek=T-S(=Tc-Zc)
An effective error .delta.k is obtained from the error ek as follows: EQU .delta.k=ek.multidot.d(Zc)/duc
where d()/duc represents a differential of the output state value of the neuron Zc by the membrane potential uc. This effective error .delta.k is transmitted to each neuron Yb of the layer J, so that an error eb of the output state value of this neuron Yb is obtained as follows: EQU eb=.SIGMA.Wcbk.multidot.Sc
where the total sum .SIGMA. is obtained with respect to every neuron of the layer K. An effective error .delta.b with respect to the neuron Yb of the layer J is obtained from the error eb as follows: EQU .delta.b=ebd(Yb)/dub
Such errors e are successively propagated from the upper layer to the lower layer, so that weights W indicating coupling strength levels between the neurons are corrected. A weight Wcbk between each neuron Zc of the layer J and each neuron Yb of the layer K is corrected in accordance with the following equations: EQU .DELTA.Wcbk=.alpha..multidot..DELTA.Wcbk(t-1)+.eta..multidot..delta.c.multi dot.Yb EQU Wcbk=.DELTA.Wcbk+Wcbk(t-1)
where .DELTA.Wcbk(t-1) and Wcbk(T-1) represent a weight correction value and a weight value obtained in a precedent weight correction cycle, and .alpha. and .eta. represent prescribed coefficients. Similarly, a weight Wbaj indicating coupling strength between each neuron Xa of the layer I and each neuron Yb of the layer J is corrected in accordance with the following equations: EQU .DELTA.Wbaj=.alpha..multidot..DELTA.Wbak(t-1)+.eta..multidot..delta.b.multi dot.Xa EQU Wbaj=.DELTA.Wbaj+Wcbaj(t-1)
In the back propagation, the errors are successively propagated from the upper layer to the lower layer so that the weights W of the neurons of the respective layers are corrected in accordance with the errors. The weights W are repeatedly corrected to minimize errors with respect to an education pattern T.
In such a neural network, the output state values are successively transmitted from the lower layer to the upper layer in an input pattern recognizing operation. At this time, membrane potentials and output state values are calculated in the respective neurons. These calculations correspond to weighted average processing and nonlinear conversion processing. Also in a weight correcting operation, the same arithmetic operation is executed in the respective neurons for correcting the weights. In the weight correcting operation, the aforementioned feed forward processing, error back propagation processing and weight correcting processing are repeatedly executed until errors between the educator pattern and the output patterns are minimized or reduced below a prescribed threshold value. Thus, it is possible to execute such operations in parallel with each other, with the neurons regarded as arithmetic elements (units) in accordance with the SIMD system.
FIG. 54 shows an exemplary structure of a conventional SIMD system arithmetic processing unit. The structure shown in FIG. 54 is described in an article entitled "Parallel Architectures for Artificial Neural Nets" by S. Y. Kung et al., Proceedings of ICNN (International Conference on Neural Network) 1988, IEEE, vol. II pp. 165 to 172, for example.
Referring to FIG. 54, the parallel processing unit includes three processing units P#1 to P#3. The processing units P#1 to P#3, which are identical in structure to each other, include local memories LM1 to LM3 storing weight data, registers R1 to R3 for storing numeric data (output state values) to be processed, and arithmetic parts AU1 to AU3 executing arithmetic operations decided by instructions received through a control bus CB on the weight data read from the local memories LM1 to LM3 and the numeric data stored in the registers R1 to R3.
The registers R1 to R3 are cascade-connected with each other, and an output part of the register R3 is connected to an input part of the register R1 through a register R4. The registers R1 to R4 have data transfer functions, and form a ring registers.
The processing units P#1 to P#3 are supplied with address signals for addressing the local memories LM1 to LM3 in common from a controller (not shown) through an address bus AB. The processing units P#1 to P#3 are also supplied with instructions specifying arithmetic operations to be executed from the controller (not shown) through a control bus CB. Referring to FIG. 54, the instructions which are received through the control bus CB are supplied to the arithmetic parts AU1 to AU3. In the structure shown in FIG. 54, therefore, the same address positions are specified in the local memories LM1 to LM3, and the same instructions are executed in the processing units P#1 to P#3.
The arithmetic processing units shown in FIG. 54 equivalently express the neurons Z1, Z2 and Z3 of the layer K shown in FIG. 53. The processing units P#1 to P#3 correspond to the neurons Z1 to Z3, respectively. The local memories LM1 to LM3 each store weight data in prescribed order. The output state values stored in the registers R1 to R3 are successively shifted, and accordingly positions of the weight data stored in the local memories LM1 to LM3 are adjusted, since the same addresses are specified. The operation is now described.
FIG. 55 shows a state of a first cycle. Referring to FIG. 55, output state values S1 to S3 are stored in the registers R1 to R3 respectively. Weight data W11, W22 and W33 are read from the local memories LM1 to LM3 respectively. The arithmetic parts AU1 to AU3 perform operations of products of the weight data read W11, W22 and W33 from the corresponding local memories LM1 to LM3 and the output state values S1 to S3 stored in the corresponding registers R1 to R3 respectively. Thus, the arithmetic parts AU1 to AU3 generate load values W11.multidot.S1, W22.multidot.S2 and W33.multidot.S3 respectively. The load values calculated by the arithmetic parts AU1 to AU3 are stored in internal registers (not shown).
FIG. 56 shows a state of a second cycle. In the second cycle, the output state values S1 to S4 stored in the registers R1 to R4 are shifted anticlockwise. Thus, the registers R1 to R3 store the output state values S2 to S4 respectively. Each address supplied to the address bus AB is incremented by 1, and next weight data W12, W23 and W34 are read from the local memories LM1 to LM3. In the arithmetic parts AU1 to AU3, load values W12.multidot.S2, W23.multidot.S3 and W34.multidot.S4 are calculated and added to the previously calculated load values. This operation is repeated also in a third cycle, and the registers R1 to R3 store the output state values S4, S1 and S2 in a fourth cycle as shown in FIG. 57. Weight data W14, W21 and W32 are read from the local memories LM1 to LM3. The arithmetic parts AU1, AU2 and AU3 calculate load values W14.multidot.S4, W21.multidot.S1 and W32.multidot.S2, and accumulate these values on the sums of the load values calculated in the precedent cycles. Thus, membrane potentials .SIGMA.W1j.multidot.Sj, .SIGMA.W2j.multidot.Sj and .SIGMA.W3j.multidot.Sj are calculated in the arithmetic parts AU1, AU2 and AU3 after completion of the fourth cycle, as shown in FIG. 54.
The above processing is executed for each neuron of the neural network. The membrane potential u obtained for each neuron is nonlinearly converted to provide the output state value of each neuron in one layer. The above operation is repeatedly executed for each layer of the neural network, to decide the output state value of every neuron. After the output state values of the output layer are decided, it is necessary to back-propagate the errors to correct the weight data of the respective neurons in learning. The errors employed for correcting the weight data are propagated as follows:
First, errors .delta. with respect to the neuron units of the upper layer are calculated and supplied to the arithmetic parts AU1 to AU3 respectively. In the first cycle, the weight data W11, W22 and W33 are read from the local memories LM1 to LM3 as shown in FIG. 58. The arithmetic parts AU1 to AU3 calculate products W11.multidot..delta.1, W22.multidot..delta.2 and W33.multidot..delta.3 respectively. The products are transmitted to the corresponding registers R1 to R3. Thus, a single term of each of errors e1 to e3 are obtained.
In the next cycle, error components stored in the registers R1 to R4 are shifted anticlockwise as shown in FIG. 59. In synchronization with this shifting, the next weight data W12, W23 and W34 are read from the local memories LM1 to LM3. The arithmetic parts AU1 to AU3 calculate and obtain products W12.multidot..delta.1, W23.multidot..delta.2 and W34.multidot..delta.3, which in turn are added to the error components stored in the corresponding registers R1 to R3, to be stored in the corresponding registers R1 to R3 again. Thus, next components of the errors are obtained. In the fourth cycle, the arithmetic parts AU1 to AU3 calculate and obtain products W14.multidot..delta.1, W21.multidot..delta.2 and W32.multidot..delta.3 through the weight data W14, W21 and W32 read from the local memories LM1 to LM3, as shown in FIG. 60. The products are added to the error components stored in the corresponding registers R1 to R3, to be stored in the corresponding registers R1 to R3 again. After completion of the four cycles, the registers R1 to R4 store .SIGMA.Wj1.multidot..delta., .SIGMA.Wj2.multidot..delta.j, .SIGMA.Wj3.multidot..delta.j and .SIGMA.Wj4.multidot..delta. j, as shown in FIG. 61. The respective total sums are obtained with respect to the subscript j. Thus, the errors e1, e2, e3 and e4 with respect to the neurons of the lower layer are obtained. Effective errors are calculated through these errors, to correct synapse load values, i.e., the weight data of the neurons in accordance with the above equations expressing weight correction amounts.
In such a conventional SIMD system arithmetic processing unit, an instruction is supplied to the control bus CB so that the processing units P#1, P#2 and P#3 execute the same arithmetic processing, thereby calculating the membrane potentials of the neurons and the load correction amounts in parallel with each other. All processing units P#1 to P#3 operate in the respective arithmetic cycles, whereby the arithmetic operations can be efficiently carried out.
In the processing system and/or the processing mechanism which has been proposed in general, various devises are structured in consideration of an improvement in utilization efficiency of available resource (arithmetic processing unit). However, no consideration is made on the contents of the arithmetic operations to be executed, and it is premised that all arithmetic operations are executed. In the aforementioned parallel arithmetic processing unit, for example, local memory addresses are supplied to the processing units P# in common to read the weight data, and arithmetic operations are executed on all output state values S and all effective errors .delta.. In the conventional parallelization method, therefore, it is impossible in principle to make high-speed processing exceeding the so-called "peak performance" indicating processability in parallel operations of all resources.
Such non-dependence on the contents of individual arithmetic operations guarantees versatility of the contents of processing, and is regarded as necessary for ensuring easiness in design by standardization of the processing mechanism.
When the contents of the arithmetic operations are predictable to some extent, however, it may be possible to arrange the individual processing contents for effectively improving the processability. The term "content" of each processing indicates "the content of single integrated processing" such as "calculation of membrane potential of every neuron", for example, and the term "content" of each arithmetic processing indicates "content of each arithmetic operation executed in certain processing" such as individual input data and arithmetic result data in the product of one weight data (synapse load value) in membrane potential calculation and an output state value, for example.
Namely, when it is determinable that a predicted content of arithmetic processing has only a negligible influence in the processing, it is possible to omit this arithmetic processing and that utilizing the result thereof, thereby reducing the processing time. However, a structure of changing a subsequent procedure in consideration of the content of individual arithmetic processing has not yet been implemented.
In the prior art shown in FIG. 54, for example, the output state values S and the effective errors .delta. are stored in the ring register and successively shifted to execute arithmetic operations. When certain output state values Si or certain effective errors .delta.i have small values which exert no influence on membrane potential calculation and error calculation, it is possible to reduce the number of arithmetic processing times by omitting arithmetic operations related to these numeric data, thereby reducing the processing time. In the structure of this prior art, however, the addresses are supplied to the local memories in common to successively read the weight data. When the data are stored in the ring register with omission of small value numeric data, therefore, the weight data cannot be read from the local memories in correspondence to the numeric data stored in the ring register, and hence no correct arithmetic operations can be performed. In this prior art, further, no consideration is made on a structure of omitting fine data values when such output state values or effective errors reach small values in the process of the arithmetic operations.