This invention relates to speech recognition and, more particularly, to an improved word boundary detector for an "isolated" word speech recognition system.
There have been previously developed various equipments that attempt to recognize limited vocabularies of spoken words by analysis of acoustic events. Typically, such equipments are envisioned as being useful in "voice command" applications wherein, upon recognizing particular words, the equipment produces electrical signals which control the operation of a companion system. For example, a voice command could be used to control a conveyor belt to move in a specified manner or may control a computer to perform specified calculations.
Previous efforts to develop automatic methods of speech recognition have had limited success and have led to the realization of the exceedingly complex nature of speech communication. Normal speech has a high information content with considerable variability from speaker to speaker and some variability even in the same word when spoken by the same individual. Therefore, a "perfect" recognition scheme is unattainable since the nature of the speech signal to be recognized cannot be precisely defined. As a result, the preferred schemes have been empirical approaches which have yielded at least a reasonable level of confidence, from a statistical standpoint, that a particular spoken word corresponded to a selected one of a limited machine vocabulary. The desirability of such schemes are thus not determinable by theoretical examination, but rather by a straightforward measure of recognition accuracy over an extended period of operation.
In the copending application Ser. No. 531,543, filed Dec. 11, 1974, and assigned to the same assignee as the present application, there is disclosed an apparatus which receives spoken input "training" words and a subsequent spoken input "command" word and generates a correlation function that is indicative of the resemblance of the command word to each training word. A feature extraction means processes received input words and generates digital feature output signals on particular ones of a number of feature output lines, the particular ones depending on the characteristic features of the word being spoken. The status of the feature signals which occur during each training word are stored as a normalized time dependent matrix. Subsequently, the status of the feature signals which occur during a command word are also stored as a normalized time dependent matrix. The command word matrix is compared, member by member, with each training word matrix and a correlation figure is generated for each comparison. If a sufficently high correlation is found between the command word matrix and a particular training word matrix, the command word is deemed to correspond to the particular training word. This type of system has found important application where command words are spoken in "isolation"; i.e., where there are discernable pauses between words, the pauses defining the word boundaries. (As used herein, reference to a word spoken in isolation is intended to include a short phrase meant to be uttered without a substantial pause.) In general terms, apparatus of this type includes circuitry which senses the onset of speech-like sounds and then senses the next substantial absence of speech-like sounds. These occurrences are considered the boundaries of a word and the speech feature events which occur between these boundaries are used to form the matrix referred to above. Since the matrix is correlated, member-by-member with a time dependent training word matrix, it will be apparent that the accuracy of the word boundary determination is critical if accurate speech recognition is to be attained. For example, even in cases where the command word matrix has a feature pattern that corresponds closely with a certain training word matrix feature pattern, the correlation process may not reveal the true level of coincidence if the command word matrix includes extraneous "features" in its initial or terminal columns due to incorrect word boundary determination. Also, incorrect time normalization of the command word matrix can be another unfortunate consequence of incorrect boundary determination.
The type of system described in the above-reference application has been employed with success in various commercial applications, but problems with word boundary determination have been a limiting factor on recognition accuracy. It is found that under continuous and long working conditions operators have difficulty uttering command words in true isolation, so the pause between adjacent words shrinks and renders word boundary determination difficult from the onset. Adding to the problem is the presence of interfering acoustical sounds and background noise in the user environment. If the word recognition equipment employs a high quality wide-ranging microphone as its input, the microphone will naturally pick up extraneous sounds and other background noise from within the immediate vicinity of the user. One solution to this problem might be to reduce interfering sounds by placing the operator/user in an acoustically shielded environment. However, the restrictions resulting from an acoustic enclosure are generally such that the mobility of the individual user is reduced, thereby restricting his ability to perform other functions. Since practical speech recognition equipments are largely justifiable on the basis of their allowing users to perform multiple functions (e.g., by replacing necessary push-button or writing inputs with voice command inputs), the restriction of the individual's mobility can tend to defeat the purpose of the equipment.
A more viable method of reducing interfering sounds is to eliminate noise at the microphone itself by utilizing a close-talking noise-canceling microphone as the equipment input. Thus, in practical applications close-talking noise-canceling microphones are typically worn on a lightweight headband and reasonably good results are obtained. However, for reasons heretofore unclear, it has been found that the user of a close-talking noise-canceling microphone aggravates the word boundary determination problem.
It is an object of this invention to provide solutions to the prior art problems as set forth.