In communication, data processing and similar systems, a user interface using audio facilities is often advantageous especially when it is anticipated that the user would be physically engaged in an activity (e.g., driving a car) while he/she is operating one such system. Techniques for recognizing human speech in such systems to perform certain tasks have been developed.
In particular, techniques for recognizing a sequence of spoken digits to perform voice dialing in such equipment as mobile phones, cellular terminals and computer-telephony integrated (CTI) systems are well-known. One such technique is a connected digit recognition technique involving use of Hidden Markov Models (HMMs). For details on this technique, one may refer to: L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE, Vol. 77, No. 2, February 1989, pp. 257-285.
In accordance with the connected digit recognition technique, individual digits are each characterized by an HMM, and a Viterbi algorithm is used to identify an optimum sequence of HMMs which best matches, in a maximum likelihood sense, the unknown, spoken connected digit sequence. The Viterbi algorithm forms a plurality of sequences of tentative decisions as to what the spoken digits were. These sequences of tentative decisions define the so-called "survival paths." The theory of the Viterbi algorithm predicts that these survival paths merge to the "maximum-likelihood path" going back in time. See G. D. Forney, "The Viterbi Algorithm," Proceedings of the IEEE, Vol. 761, No. 3, Mar. 1973, pp. 268-278. In this instance, such a maximum-likelihood path corresponds to a particular digit sequence which maximizes a cumulative conditional probability that it matches the unknown, spoken sequence given the acoustic input thereof. This particular digit sequence is referred to as the "maximum-likelihood digit sequence."
In practice, the sequences of tentative decisions each have a probability score (normally expressed in logarithm) associated therewith, which is updated to reflect the cumulative conditional probability as the tentative decisions are made along the sequence. Based on the respective probability scores of the sequences, the maximum-likelihood digit sequence is identified in accordance with a dynamic programming approach. For details on this approach, one may refer to: H. Ney, "The use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP. 32, No. 2, April 1984, pp. 263-271. In addition, a number of the next best sequences, relative to the maximum-likelihood sequence, may further be identified based on their probability scores, and can be of different lengths (i.e., different numbers of digits in the sequences). All of these sequences including the maximum-likelihood digit sequence are then subject to further processing including validity tests (e.g., one based on duration) to eliminate unreasonable candidates. The most likely and reasonable digit sequence is thus identified and presumed to be the spoken sequence.
The recognition accuracy of a prior art connected digit recognizer, and thus the accuracy of voice dialing, invariably decreases as the digit sequence length increases. In a noisy environment, the recognition accuracy of a long digit sequence, e.g, 7 to 10 digits long, which is typical of a phone number, is undesirably low. The uncertainty to the recognizer of the actual number of digits in the sequence also contributes to the inaccuracy of voice dialing.
Accordingly, there exists a need for a speech recognition system capable of accurately recognizing connected digits to perform voice dialing effectively.