1. Field of the Invention
The present invention is related to speech recognition and more particularly to speech recognition on multiple connected computer systems connected together over a network.
2. Background Description
Automatic speech recognition (ASR) systems for voice dictation and the like use any of several well known approaches to for word recognition.
For example, L. R. Bahl, P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo, and M. Picheny, xe2x80x9cRobust Methods for Using Context-dependent Features and Models in Continuous Speech Recognizer,xe2x80x9d Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. I, pp. 533-36, Adelaide, 1994, describe an acoustic ranking method useful for speech recognition. Acoustic decision trees, also useful for speech recognition are described by L. R. Bahl, P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo, and M. Picheny, in xe2x80x9cDecision Trees for Phonological Rules in Continuous Speech,xe2x80x9d Proceedings of the 1991 International Conference on Acoustic, Speech, and Signal Processing, Toronto, Canada, May 1991. Frederick Jelinek in Statistical Methods for Speech Recognition, The MIT Press, Cambridge, January 1999, describes identifying parameters that control decoding process.
While generally recognizing spoken words with a relatively high degree of accuracy, especially in a single user system, these prior speech recognition systems still, frequently, make inappropriate recognition errors. Generally, for single user systems, these errors can be reduced with additional user specific training. However, additional training time and increased data volume that must be handled during training are undesirable. So, for expediency, recognition accuracy is traded to minimize training time and data.
Speaker independent automatic speech recognition systems, such as what are normally referred to as interactive voice response systems, have a different set of problems, because they are intended to recognize speech from a wide variety of individual speakers. Typically, the approach with speaker independent ASR systems is to improve recognition accuracy by assigning individual speakers or recognition system users to user clusters. User clusters are groups of users with similar speech characteristics or patterns. As each speaker or user uses the system, the speaker is identified as belonging to one cluster. For each user cluster, acoustic prototypes are developed and are used for speech decoding.
For example, speakers may be clustered, according to language or accent. Various techniques for language identification are taught by D. Matrouf, M. Adda-Decker, L. Lamel and J. Gauvain, in xe2x80x9cLanguage Identification Incorporating Lexical Informationxe2x80x9d in Proceedings of the 1998 International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia, December 1998. A well known method of determining an accent from acoustic features is taught by M. Lincoln, S. Cox and S. Ringland, in xe2x80x9cA Comparison of Two Unsupervised Approaches to Accent Identificationxe2x80x9d Proceedings of the 1998 International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia, December 1998. However, the approach of Lincoln et al., if there is a very large speaker variability, as is normally the case, that variability may not be accounted for in training. Accordingly, speaker clusters that are accumulated in a normal ASR training period, generally, do not provide for all potential ASR users.
Consequently, to provide some improvement over speaker dependent methods, ASR decoding system approaches are used that are based on various adaptation schemes for acoustic models. These recognition adaptation schemes use additional data that is gathered subsequent to training by the ASR system every time a user dictates to the system. The speaker or user, usually, interactively corrects any errors in the recognition result, and those corrected scripts are used for what is normally referred to as a supervised adaptation.
See for example, Jerome R. Bellegarda, in xe2x80x9cContext-dependent Vector Clustering for Speech Recognition,xe2x80x9d in Automatic Speech and Speaker Recognition, edited by Chin-Hui Lee, Frank K. Song, 1996, Kluwer academic Publishers, Boston, pp. 133-153 which teaches an adaptation of acoustic prototypes in response to subsequent speech data collected from other users. Also, M. J. F. Gales and P.C. Woodland, xe2x80x9cMean and variance adaptation within the MLLR framework,xe2x80x9d Computer Speech and Language (1996) 10, 249-264 teach incremental adaptation of HMM parameters derived from speech data from additional subsequent users.
The drawback with the above approaches of Bellegarda or Gales et al. is that during typical dictation sessions the user uses a relatively small number of phrases. So, it may take several user sessions to gather sufficient acoustic data to show any significant recognition accuracy improvement using such a supervised adaptation procedure. As might be expected, in the initial sessions the decoding accuracy may be very low, requiring significant interactive error correction.
Further, similar or even worse problems arise in unsupervised ASR applications when users do not correct ASR output. For example, unsupervised ASR is used in voice response systems wherein each user calls in to a service that uses ASR to process user voice input. C.H. Lee and J.L. Gauvain, xe2x80x9cBayesian adaptive Learning and MAP Estimation of HMMxe2x80x9d, in Automatic Speech and Speaker Recognition, edited by Chin-Hui Lee, Frank K. Song, 1996, Kluwer academic Publishers, Boston, pp. 109-132 describe for supervised and unsupervised acoustic model adaptation methods. While it is still possible to adapt speech recognition for any new users using unsupervised adaptation, sufficient data must be collected prior to unsupervised use to insure adequate decoding accuracy for every new user.
Thus, there is a need for increasing the amount of usable acoustic data that are available for speech recognition of individual speakers in supervised and unsupervised speech recognition sessions.
It is a purpose of the invention to improve speech recognition by computers;
It is yet another purpose of the invention to expand the data available for speech recognition.
The present invention is a speech recognition system, method and program product for recognizing speech input from computer users connected together over a network of computers, each computer including at least one user based acoustic model trained for a particular user. Computer users on the network are clustered into classes of similar users according their similarities, including characteristics nationality, profession, sex, age, etc. Characteristics of users are collected from databases over the network and from users using the speech recognition system and distributed over the networks during or after user activities. As recognition progresses, similar language models among similar users are identified on the network. The acoustic models include an acoustic model domain, with similar acoustic models being clustered according to an identified domain. Existing acoustic models are modified in response to user production activities. Update information, including information about user activities and user acoustic model data, is transmitted over the network. Acoustic models improve for users that are connected over the network as similar users use their respective voice recognition system.