This invention relates to automatic speech recognition.
The object of automatic speech recognition is to capture an acoustic signal representative of speech and determine the words that were spoken by pattern matching. Speech recognizers typically have a set of stored acoustic and language models represented as patterns in a computer database (which are the result of training and stored rules of interpreting the language). These models are then compared to the captured signals. The contents of the computer database, how it is trained, and the techniques used to determine the best match are distinguishing features of different types of speech recognition systems. In small vocabulary applications (less than 50 words), a model can be generated for each of the words in the recognition vocabulary. Above this vocabulary size, training models and recognition algorithms require impractically large computations. Therefore, large vocabulary systems (greater than about 1000 words) train models for a smaller number of sub-word speech segments, e.g., phonemes. These phonemes can then be concatenated to produce a model of one or more words.
Various speech recognition schemes are known. In a segmental models approach, it is assumed that there are distinct phonetic units (e.g., phonemes) in spoken language that can be characterized by a set of properties (features) in the speech signal over time. Input speech signals are segmented into discrete sections in which the acoustic properties represent one or more phonetic units and labels are attached to these regions according to these properties. A valid vocabulary word, consistent with the constraints of the speech recognition task, is then determined from the sequence of assigned phonetic labels.
Template-based approaches use the speech patterns directly without explicit feature determination and without segmentation. A template-based recognition system is initially trained using known speech patterns. During recognition, unknown speech signals are compared with each possible pattern learned in the training phase and classified according to how well the unknown patterns match the known patterns.
Recently, hybrid approaches to speech recognition have become popular. Hybrid approaches combine certain features of the above-mentioned segmental model and pattern-matching approaches. In certain systems an expert system has been used for segmentation so that more than just acoustic information is used in the recognition process. Also, neural networks have been used for speech recognition. For example, D. W. Tank and J. J. Hopfield (in "Neural Computation by Concentrating Information in Time," Proc. Nat. Academy Sciences, 84: 1896-1900, April 1987) describe a pattern recognition scheme in which conventional neural network architectures are used to estimate the acoustic features of speech. A pattern classifier detects the acoustic feature vectors and convolves them with filters matched to the acoustic features and sums up the results over time.
One of the major complicating factors in the recognition of continuous speech is that word boundaries are often difficult to classify from other spectrum information (e.g., interfering background signals). The difficulty of classifying boundaries between words or sub-word units (e.g., phonemes) in continuous speech recognizers has added complexity to training and recognition algorithms. This directly affects the cost and the performance of such systems. As used herein a boundary frame is defined as a frame that is between two acoustic events (e.g., the frame that is between the phoneme "k" and the phoneme "aa" in the utterance "cat"). An interior frame is defined as a frame that is within an acoustic event (e.g., a frame that is within the phoneme "aa").
Known continuous speech recognition systems based on segmental models hypothesize a boundary at a fixed rate that is selected to guarantee that boundaries are not missed (e.g., at 100 Hz, or every 10 milliseconds). However, this scheme is computationally intensive and requires further classification processing downstream to compensate for artificially generated segments (i.e., processing to delete incorrectly assumed boundaries, which can be measured as a "boundary deletion rate").