Recently, it is proposed to use a RNN as an acoustic model for speech recognition. A RNN is a neural network incorporating information of time sequence.
FIG. 1 schematically shows a principle of a common neural network. A neural network 30 includes: an input layer 40 receiving an input vector 46; a hidden layer 42 connected to input layer 40 to receive outputs from input layer 40; and an output layer 44 connected to receive outputs of hidden layer 42 and outputting an output vector 48. Though FIG. 1 shows an example having only one hidden layer for simplicity of drawing, the number of hidden layers is not limited to one.
In such a neural network, data flows in one direction from input layer 40 to hidden layer 42 and from hidden layer 42 to output layer 44. Therefore, this type of neural network is referred to as a feed-forward neural network (FFNN). Each connection from one node to another is sometimes weighted or biased, and the values of such weights and biases are determined through training. At the time of training, training data is given as input vector 46 to hidden layer 42, and output vector 48 is obtained from output layer 44. Error between the output vector 48 and correct data is given from the side of output layer 44 to each node of hidden layer 42 and input layer 40, and the values of weights and biases are optimized so that the error of neural network 30 is minimized.
Different from a FFNN in which nodes are connected in one direction, a RNN includes node connections in opposite directions, connections of nodes in the same layer, and self-loop of each node. FIG. 2 schematically shows architecture related to node connections of an example of hidden layer in a RNN. Referring to FIG. 2, this hidden layer 70 includes, for example, three nodes. Each of these three nodes has connections for receiving data from a lower layer (closer to the input layer), connections for passing data to an upper layer (closer to the output layer), connections for passing data to nodes of a lower layer, connections with nodes in the same hidden layer 70, and a self loop. Each of these connections is weighted, or has a parameter as a weight allocated. The number of such parameters could be millions to tens of millions. For an application as an acoustic model for speech recognition, these must be automatically learned from a speech corpus (pairs of speech data and texts).
Back-propagation through time method (hereinafter referred to as “BPTT”) and its modification, Truncated back-propagation through time method (hereinafter referred to as “Truncated BPTT”) have been known as methods of RNN training.