1. Field
The following description relates to speech recognition technology and to a method and an apparatus for performing incremental speech recognition that uses a deep neural network.
2. Description of Related Art
A speech recognition engine generally includes a decoder, an acoustic model and a language model. The decoder uses the acoustic model and the language model to perform the decoding of an input audio signal. For instance, in response to receiving an input audio signal, the speech recognition engine may use the acoustic model to calculate pronunciation probabilities of each frame of the input audio signal, and the language model may provide information on the frequency of use of specific words or sentences. The decoder calculates and outputs similarities of the input audio signal to words or sentences based on information provided by the acoustic model and the language model in order to convert the input audio signal into a sequence or a word. A Gaussian mixture model is often used as an acoustic model; however, a deep neural network (DNN)-based acoustic model has been recently introduced and has shown potentials for significantly improved speech recognition performance. A bidirectional recurrent deep neural network (BRDNN), for instance, is suitable for modeling data that changes with time, such as speech.
However, the BRDNN calculates pronunciation probabilities of each frame of an audio signal by considering bidirectional information, i.e., information on previous and subsequent frames. Thus, when a BRDNN is used in speech recognition, an entire speech is provided as an input audio signal. Accordingly, a BRDNN is not suitable for incremental decoding, in which a speech recognition result is incrementally output while a user is delivering a speech.