1. Field of the Invention
The present invention relates to a method and apparatus for speech recognition using a neural network. More particularly, the present invention relates to a neural network, a learning method of the neural network, and a phoneme recognition apparatus using the neural network.
2. Background Information
Neural networks are a recent technology which mimics the information processing of human cerebral nerves, and have attracted much attention. The neural network is constituted by a neuron device network having a plurality of neuron devices for transmitting data, and a learning controller for controlling learning of the plurality of neuron devices. The neuron device network is generally made up on an input layer to which data are input, an output layer from which data are output based on the inputted data, and one or more hidden layers provided between the input and output layers. Each of the neuron devices provided within the respective layers of the neuron device network is connected with another neuron device with (i.e., through) a predetermined strength (connection weight), and an output signal varies in accordance with the values of the connection weights between the devices.
In the conventional neural network having the above-described hierarchic structure, a process called "learning" is carried out by changing the connection weight between the respective neuron devices by the learning controller.
Learning is performed by supplying analog or binary data (patterns) which correspond to a number of inputs/outputs of the input and output layers. If it is assumed that g1 to g6 are supplied as input data, then the output signals p1 to p3 are output from the output layer when g1 to g3 are received as learning patterns from the input layer. If the correct answers are received from the output signals based on the input signals g4 to g6, the signals g4 to g6 are generally referred to as instructor signals. Further, learning is performed by executing a correction process of the connection weights of the respective neuron devices for a plurality of learning patterns in order to minimize the margin of error of the output signals p1 to p3 based on the instructor signals g4 to g6, or until these two types of signals coincide with each other.
Specifically, a process for correcting the connection weights between the respective neuron devices in the neuron device network so that the output signals coincide with the instructor signals, is error back-propagation (often referred to as BP) which has been conventionally used.
In order to minimize the margin of error of the output values from the instructor values in the output layer, the error back-propagation is used to correct the connection weights of the respective neuron devices between all of the layers constituting the neural network. That is, the error in the output layer is determined as a product obtained from individual errors generated from the neuron devices in the respective hidden layers, and the connection weight and is corrected so that not only the error from the output layer, but also the error of the neuron devices in the respective hidden layers, which is a cause of the error from the output layer, are minimized. Thus, all errors are computed in accordance with each neuron device in both the output layer and the respective hidden layers.
According to error back-propagation processing, individual error values of the neuron devices in the output layer are given as initial conditions, and the processing is executed in the reverse order, namely, a first target of computation is an error value of each neuron device in an nth hidden layer, a second target is an error value of each neuron device in an (n-1)th hidden layer, and the last target is an error value of each neuron device in the first hidden layer. A correction value is calculated based on the thus-obtained error value for each neuron device and the current connection weight.
Learning is completed by repeating the above-described learning processing with respect to all of the learning patterns a predetermined number of times, or until the magnitude of error of the output signal from the instructor signal is below a predetermined value.
Typically, neural networks have been used in systems for pattern recognition, such as characters or graphics of various data, processes for analyzing or synthesizing voices, or prediction of occurrence of time series patterns of movement.
In the conventional neural network, however, these layers of the neuron device network have not been implemented in such a manner that learning can be effectively performed when carrying out speech recognition, character recognition or form recognition. Thus, in the case where the conventional neural network is used in, e.g., a speech recognition apparatus, an input spectrum is segmented to coincide with a size of the neural network. Therefore, it is difficult to apply the neural network to the recognition of a continuous stream of speech because the uttering speed and a length of each phoneme may vary greatly. At present, speech recognition is performed at each phoneme level after the phoneme is subjected to segment processing to match the size of the neural network.
In addition, the input spectrum must be adapted to coincide with an initial position of the speech recognition neural network. Therefore, it is impossible to perform the recognition of a continuous stream of speech when the start time of a phoneme is unpredictable.
Further, in the conventional neural network, each spectrum of the phoneme is individually processed during the speech recognition. However, since the state of a current phoneme is affected by the state of a phoneme which immediately precedes the current phoneme during continuous speech recognition, the previous phoneme information cannot be used in speech recognition of the current phoneme by the conventional neural network where each phoneme is individually requested, thus the conventional neural network is not suitable for continuous speech recognition.