1. Field
The following description generally relates to a technology of speech recognition with acoustic modeling, and more particularly to a speech recognition apparatus and method with acoustic modelling.
2. Description of Related Art
A speech recognition engine is generally a hardware device that implements an acoustic model, a language model, and a decoder. The acoustic model may calculate pronunciation probabilities of each frame of an input audio signal, and the language model may provide information on frequency of use of, or connectivity between, specific words, phrases, or sentences. The decoder may calculate and output similarities of the input audio signal to specific words or sentences based on consideration of the respective information provided by the acoustic model and the language model. Here, because such automated speech recognition are implemented through computer or processor technologies, corresponding problems specifically arise in such computer or processor technologies. The technology behind such automated speech recognition is a challenging one due to varying degrees of freedom exercised by speakers in their utterances, phrasings, dialect, languages, or idiolect, and challenging due to technical failings of the underlying hardware and hardware capabilities, such as the technological problems of being able to recognize speech with sufficient correctness and speed without potentially failing to recognize the corresponding speech altogether.
A Gaussian Mixture Model (GMM) approach has generally been used to implement such probability determinations in acoustic models, but recently a Deep Neural Network (DNN) approach has been implemented to calculate the probability determinations in acoustic models, which has significantly improved speech recognition performance over the speech recognition performance of acoustic modeling that implemented the GMM approach.
Still further, a Bidirectional Recurrent Deep Neural Network (BRDNN) approach has also been used for modeling data, such as speech, which changes with time. For example, the BRDNN approach may improve accuracy in calculating pronunciation probabilities of each frame of an audio signal by considering bidirectional information, i.e., information on previous and subsequent frames.
However, because of the extra frame information provided to the DNN, as well the temporal considerations made by the DNN, a required time for calculating pronunciation probabilities corresponding to respective speech units may increase, especially as the lengths of such speech units increase. Thus, there are technological problems in automated speech recognition systems.