The development of robust, speaker-independent, speech recognition systems that perform well over dialed-up telephones line has been a topic of interest for over a decade. Initially, speech recognition systems could recognize a small number of vocabulary items spoken in isolation; more recently systems have been disclosed that can recognize medium-size vocabulary sets spoken fluently, as set out in U.S. Pat. No. 4,783,804 assigned to B-H. Juang et al, issued Nov. 8, 1988. A basic assumption for most speech recognition systems is that the input to be recognized consists solely of words from the recognition vocabulary and background silence. However, recent studies on the recognition of a limited set of isolated command phrases for making "operator assisted calls" have shown that it is extremely difficult, if not impossible, to always get real-world subscribers to such a service to speak only the allowed input words. In a large scale trial of speaker independent, isolated word, speech recognition technology, carried out at an AT&T central office in Hayward, Calif. (in the San Francisco Bay area), live telephone customer traffic was used to evaluate the call handling procedures being developed for a new generation of telephone switching equipment. Customers, making operator assisted calls, were requested to verbally identify the type of call they wished to make (i.e. collect, calling-card, person-to-person, bill-to-third, and operator). Each caller was requested to speak one of five orally prompted commands in an isolated fashion. While 82% of the users actually spoke one of the command words, only 79% of these inputs were spoken in isolation (i.e. only 65% of all the callers followed the protocol). Monitoring the customer's spoken responses showed that 17% of all responses contained a valid vocabulary item along with extraneous speech input. Examples included the following: .cndot.&lt;silence&gt; collect call please &lt;silence&gt; .cndot. Um? Gee, ok I'd like to place a calling-card call .cndot. Collect from Tom &lt;silence&gt; .cndot. I want a person call .cndot. &lt;silence&gt; Please give me the operator
Most conventional isolated word recognition algorithms have not been designed to recognize vocabulary items embedded in carrier sentences. As such, modifications to the algorithms have to be made to allow for the recognition of the defined vocabulary words embedded in extraneous speech, i.e. to spot keywords.
While much research has been performed on the general wordspotting problem, most of it has not been published. The published wordspotting techniques are primarily template-based, dynamic time-warping approaches. For example, in the article "Detecting and Locating Key Words in Continuous Speech Using Linear Predictive Coding", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol ASSP. 25 No. 5, pp. 362-367, October, 1977, Christiansen and Rushforth describe a speaker trained keyword spotting system which uses an LPC representation of the speech signal without any syntactic or semantic information about the task. Using this approach they achieved good wordspotting accuracy on a vocabulary set of four keywords and ten digits.
Higgins and Wohlford in "Keyword Recognition Using Template Concatenation", Conf. Rec. IEEE Int. Conf. Acous. Speech, and Signal Processing, pp. 1233-1236, Tampa, Fla., March, 1985, proposed a dynamic-time-warping based system for keyword spotting. In their system, knowledge about the vocabulary and syntax of the input speech was used. A set of keyword templates and non-keyword templates was created and compared against several pooled filler templates in order to detect keywords in fluent speech. These filler templates were generated (1) using data from six `function` words, or (2) by clustering non-vocabulary words into segments roughly equal to syllables using hand-marked data. Their results indicated that while explicit knowledge of the vocabulary may not be that important, the use of filler templates may be important. However, they found that the number of such filler templates greatly influenced the performance of the keyword spotter. Additionally, they determined that the durations of the filler templates controlled the accuracy of their system. As the number of templates increased and the duration of the average filler template shortened, the system accuracy improved. Duration constraints are a major problem in any dynamic-time-warping based template matching recognition system, since each template has a physical duration and the algorithms are forced to adhere to some local time duration constraints.
Similarly, in the prior patent of one of us, Chin-Hui Lee with John W. Klovstad and Kalyan Ganesan, U.S. Pat. No. 4,713,777, issued Dec. 15, 1987, a Hidden Markov Model (HMM) was used to model silence. Fixed score thresholds were used to eliminate false alarms.
In the article, "Application of Hidden Markov Models to Automatic Speech Endpoint Detection, Computer Speech and Language, Vol. 2, 3/4 pp. 321-341, December, 1987, two of us, Wilpon and Rabiner, presented a statistically-based recognition algorithm, in which explicit endpoint detection of speech was removed entirely from the recognition system while maintaining high recognition accuracy. To achieve this, the recognition system modeled the incoming signal as a sequence of background signal and vocabulary words. However, this work was limited in that the vocabulary words had to be spoken in isolation, i.e., with no extraneous verbal input.