The vocal signal is changed variously according to non-linguistic factors such as age, sex, microphone, background noise, or the like. Therefore speech recognition is required to be robust with respect to non-linguistic factors. In recent years, acoustic invariant structure is being proposed as one technique for realizing this type of speech recognition (Non-Patent Literature 1). According to this method, in contrast to traditional speech processing, the absolute features of speech are entirely discarded, and f-divergence is used for modeling of the relative relationships between phonemes. Isolated word recognition (Non-Patent Literatures 2, 3, and 4), foreign language pronunciation evaluation (Non-Patent Literature 5), or the like have been proposed heretofore using acoustic invariant structure, and robustness and good performance are being displayed by acoustic invariant structure.
However, according to the aforementioned literature, acoustic invariant structure has not been used for continuous speech recognition. This has been due to the lack of a suitable decoding algorithm for use of acoustic invariant structure. Although a decoding algorithm performs hypothesis-by-hypothesis alignment of the feature vector sequence, alignment of phonemes becomes necessary in order to use acoustic invariant structure. Although there has also been research that attempts to solve the aforementioned problem by using bottom-up clustering of the short time interval of the feature vector sequence and the Hidden Structure Model (HSM), this approach was only applied to an artificial task and was not indicated to be effective for actual tasks (Non-Patent Literature 6).
Therefore under the aforementioned circumstances, a method was newly proposed for realization of continuous speech recognition by using acoustic invariant structure for an N-best ranking framework (Non-Patent Literature 7). According to this method, firstly based on the traditional hidden Markov model (HMM) based speech recognition processing, an N-best list is acquired together with a speech recognition score. Thereafter, acoustic invariant structure is extracted from phoneme alignment for each N-best hypothesis, and appropriateness of a hypothesis from the standpoint of this invariant structure is acquired as a structure score. Finally, the multiple hypotheses of N-best are ranked according to the values of the sums of the speech recognition scores and the structural scores.