1. Technical Field
The present invention relates to recognizing the natural speech of a person and converting the speech into text, and more particularly, to automatically removing meaningless words called disfluencies from the derived text.
2. Description of the Related Art
A statistic method for performing speech recognition using an acoustic model and a language model is disclosed in, for example, “A Maximum Likelihood Approach to Continuous Speech Recognition”, L. R. Bahl et al. IEEE Trans. Vol. PAMI-5, No. 2, March 1983; or in “Word-based Approach To Large-vocabulary Continuous Speech Recognition For Japanese”, Nishimura, et al., Information Processing Thesis, Vol. 40, No. 4, April, 1999. Further, an N-gram estimate, a common language model technique, is disclosed on page 15 of IBM ViaVoice98 Application Edition (Info-creates Publication Department, issued on Sep. 30, 1998).
Disfluencies, such as “eh”, frequently appear during the recognition of natural speech and are important for applications. “Statistical Language Modeling for Speech Disfluencies”, by A. Stolcke and E. Shriberg, Proc. of ICASSP96, discloses a method for handling such disfluencies in the N-gram model and automatically removing them from the recognition result. With this method, however, it is difficult to avoid a phenomenon whereby the validity of a word when originally used is not recognized and the word is thereafter determined to be a disfluency and removed. Further, the kind and frequency of disfluencies varies depending on the speaker and the speaking environment (e.g., with or without a draft and in a formal or an informal setting), making it difficult to use an average model for the prediction of disfluencies.