1. Field of the Invention
The present invention relates to a neural net signal processor, particularly to a neural net LSI (large-scale integration) processor for realizing neuro-computing models based on biological nervous systems. Possible applications for such a processor include fields such as image recognition, speech recognition and speech synthesis in which decisions based on large quantities of input information need to be computed with speed and flexibility.
2. Description of the Prior Art
Examples of the prior art in this field include the following two references.
(1) Koike, et al, "Special Purpose Machines for Neurocomputing", v. 29 No. 9 (September 1988), p 974-983 of Journal of the Information Processing Society of Japan.
(2) S. Y. Kung, "Parallel Architectures for Artificial Neural Nets", Digest of the IEEE International Conference on Systolic Arrays, 1988.
Neural nets consist of simple computing elements interconnected by weighted, directional links into networks for parallel processing of input information. These neural net systems do not use programs such as those used for conventional computers. Instead, neural nets are adapted to the various processing tasks according to a prescribed set of initial values for weights and functions, and in accordance with specific learning rules.
Weights and functions are computed by neural circuits such as the one shown in FIG. 25. A neural net is constituted as an interconnected plurality of these neurons. Inputting N inputs {X1, X2, . . . XN} to the circuit produces the output shown by equation (1). ##EQU2##
Here, coefficient mij is for adding a weight to input Xj, and f( ) is the function applied.
f( ) is a nonlinear function that is important inasmuch as it determines the convergence of the neural net. Generally the sigmoid function shown in equation (2), which is analogous to a biological neuron, is employed. EQU f(u)=1/(1+EXP(-u)). . . (2)
In the case of a single layer neural net, the above neuron is connected as shown in FIG. 26. Use of such neural nets makes image and voice pattern recognition possible. In such applications, the patterns to be matched are established by applying neuron weight coefficients.
Here, each of the inputs X1 to XN is supplied to each of M neurons 25 and Y1 to YM are output in parallel. With this arrangement, each input has to drive M neurons (nodes).
The total number of links NM involved becomes very large. To receive an input pattern comprised of 100 by 100 pixels, for example, and sort it into any of 1,000 categories requires a minimum of N=10,000 input terminals and M=1,000 neurons. And, as each input terminal has to drive 1,000 neurons, a total of million links are required.
The most direct approach to realizing such a neural net is through the use of analog ICs. However, as pointed out in reference (1), with current technology the large numbers of connections involved preclude further increases in density, so one chip can accommodate only several tens or several hundred neurons. This means that, for the above example, at least 10 chips would be required. Propagating analog values between chips with a high level of precision also involves advanced technology.
Weights have to be implemented using resistances or transistor conductance, which makes it difficult to realize neural nets with learning capabilities in which the weight coefficients have to be programmable.
The use of digital circuitry to form neural nets solves the problems of chip-to-chip signal propagation and programmable weights. One neuron can be processed by one computing unit, or a plurality of neurons can be processed by one computing unit using virtual techniques. The former approach involves extensive hardware requirements, as each unit has to be provided with multiply-accumulate arithmetic capabilities and function capabilities, which limits the number of neurons to several tens of thousands.
With the latter approach, hardware is no problem, but having to process a plurality of neurons with a single computing unit results in lower processing rates than those of an analog system. However, as such systems can be adapted to the various neural net configurations merely by reprogramming, nearly all of the digital processing systems currently being proposal adopt this approach.
Owing to the large numbers of node interconnections in a neural net, in both approaches there is a large volume of communication between processors, so the choice of the connection configuration has a major effect on the overall performance of the system.
At present the latter system is implemented using DSPs (digital signal processors) and floating-point arithmetic processors. However, system performance is constrained by the use of standard conventional chips. Even with fifteen M68020 CPUs with their floating-point coprocessors, performance is limited to a maximum of around 500 kilolinks/sec. (One link corresponds to one weight calculation.)
A performance of 8 megalinks/sec can be achieved with the arrangement according to Newman (Reference (1)) in which eight processors are used, but this requires a total of 16 floating-point coprocessors, two for each processor element.
Real-time processing of video signals requires the ability to process 15 frames of information per second, each frame consisting of 512 by 512 pixels for a total of 256,000 pixels. For this, the simple single layer net comprised of M neurons as illustrated in FIG. 26 needs to be capable of a processing speed of 3.8 million megalinks/sec. Even with one hundred neurons a very high-speed capability of 380 megalinks/sec would be required, which is impossible with conventional chips. Even if it were possible, the scale of the circuitry involved would be impractically large.
One solution is the approach of reference (2), which describes the use of a system array. FIG. 27 shows the SPEi unit used to form the array, and FIG. 28 shows an example of the single layer neural nets of FIG. 26 formed using these SPEs.
Each of the SPEs 136 is comprised of a shift register 133 in which coefficients are stored and, after read-out, sequentially moves each data element to the next location in the register; a multiplier 2 for multiplying data input {Xj}(j=1-N) by weight coefficient mij; an adder 3 and an accumulator 4 for multiply-accumulate processing of the multiplication products; a nonlinear function generator 135 for applying nonlinear function f() to the multiply-accumulate product (accumulation result) 14; and an I/O multiplexor 134 for outputting the finished multiply-accumulate output of N data inputs to output terminal Qi.
The systolic array single layer neural net illustrated in FIG. 28 is configured by connecting M systolic processor elements 136 to an input data {Xj} feedback line 138 which transfers data from the leftmost processor element SPE-1 to the rightmost element SPE-M. Input data 17 is supplied to this feedback line 13B and output data 18 is output from the same feedback line 138. Thus, since other than the feedback line 138 the only connections are those between adjacent processor elements, the number of links is very low and the signal transfer rate is correspondingly high. An outline of the operation of the above circuit in which the number of neurons M of the single layer neural net is equal to the number of data inputs N (in the neural nets of FIG. 26, the number of input terms) will now be described with reference to FIG. 30.
Between times T0 and T1, data {Xj} is sequentially input from the left at each system clock. The system clock cycle unit is T (seconds). At each clock unit, data {Xj}, which has entered each processor element 136 from the right, is transferred to the next processor element to the left. Thus, N * T (seconds) after TO, data has been passed through all of the processor elements and arithmetic operations commence in each element. When there is a means of placing data {Xj} in parallel into the processor elements, this data transfer procedure (between T0 and T1) is unnecessary. However, parallel input would require that there by as many links as there are input data elements N, thereby negating the feature of the systolic array, i.e., the low number of links.
Simultaneously with the arithmetic processing, at each clock pulse the input data is transferred to the processor element immediately to the left. From the leftmost processor SPE-1, the data is transferred to the rightmost element SPE-M via the feedback line 138.
As data are input to element SPE-1 in the order X1, X2, X3, . . . , coefficients mij are read out of the coefficient shift register 133 in the order mM1, mM2, mM3, . . . , . However, as data are input to the second element SPE-2 in the order X2, X3, X4, . . . , coefficients mij are read out also in the order m22, m23, m24, . . . , m21. The shift register 133 is used to facilitate this coefficient read-out sequence. However, this requires a consideration of the sequence of coefficient read-out between SPEs at the time the coefficients are being set, making the setting procedure troublesome.
With respect to processing speed, for each clock M weight calculations can be performed in parallel. That is, M links can be calculated per clock. With chips fabricated using a 1-micrometer-feature CMOS process, the processors can be operated at a clock speed of about 20 MHz. Some 100 of the processors could be integrated on a chip 10 mm square. A neural net comprised of such chips would be capable of processing at a rate of 2,000 megalinks per second, which is fast enough to process images and other such high-speed signals on a real-time basis.
FIG. 29 illustrates an example of processor operation when the number of data inputs N exceeds the number of processors M. With this arrangement, processing can start at time T1, but if this coincides with an attempt to feed back the data from left to right via the feedback line 138, as there is data XM+1, . . . , XN yet to come, there is a data conflict between T1 and T4, which prevents the circuit from operating normally. (There would be no problem if parallel input of {xj} to all elements were possible, but this would require an impracticably large number of links.) Moreover (and this also applies to the above arrangement in which M =N), as output data {Yi} uses the same signal line 138, while {Yi} is being transferred (between T2 and T3) the next data {Xj} cannot be input. To summarize, then, with the known systolic array arrangements, high-speed processing of images in real time is possible using parallel processing, and complex neural nets can be configured by linking adjacent processor elements. On the other hand, there is the problem of data conflicts caused by the same signal line being used for input data {Xj} and output data {Yi}, while another problem is that as the read-out sequence of coefficient mij differs from processor to processor, special techniques are needed when setting coefficient values. In addition, the prior art examples do not mention extending the application to include ordinary neural nets, but refer only to Hopfield and back-propagation models.