1. The Field of the Invention
The present invention relates generally to speech recognition and more specifically to a system and method of enabling speech pattern recognition in high-dimensional space.
2. Description of Related Art
Speech recognition techniques continually advance but have yet to achieve an acceptable word error rate. Many factors influence the acoustic characteristics of speech signals besides the text of the spoken message. Large acoustic variability exists among men, women, and different dialects and causes the greatest obstacle in achieving high accuracy in automatic speech recognition (ASR) systems. ASR technology presently delivers a reasonable performance level of around 90% correct word recognition for carefully prepared “clean” speech. However, performance degrades for unprepared spontaneous real speech.
Since speech signal vary widely from word to word, and also within individual words, ASR systems analyze speech using smaller units of sound refereed to as phonemes. The English language comprises approximately 40 phonemes, with average durations of approximately 125 msec. The duration of a phoneme can vary considerably from one phoneme to another and from one word to another. Other languages may have as many as 45 or as few as 13. A string of phonemes comprise words that form the building blocks for sentences, paragraphs and language. Although the number of phonemes used in the English language is not very large, the number of acoustic patterns corresponding to these phonemes can be extremely large. For example, people using different dialects across the United States may use the same 40 phonemes, but pronounce them differently, thus introducing challenges to ASR systems. A speech recognizer must be able to map accurately different acoustic realizations (dialects) of the same phoneme to a single pattern.
The process of speech recognition involves first storing a series of voice patterns. A variety of speech recognition databases have previously been tested and stored. One such database is the TIMIT database (speech recorded at TI and transcribed at MIT). The TIMIT corpus of read speech was designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. The TIMIT database contains broadband recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences. The database is divided into two parts: “train”, consisting of 462 speakers, is used for training a speech recognizer, and “test”, consisting of 168 speakers, is used for testing the speech recognize'r. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance. The corpus design was a joint effort between the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (11). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NISI).
The 630 individuals were tested and their voice signals were labeled into 51 phonemes and silence from which all words and sentences in the TIMIT database are spoken. The 8 dialects are further divided into male and female speakers. “Labeling” is the process of cataloging and organizing the 51 phonemes and silence into dialects and male/female voices.
Once the phonemes have been recorded and labeled, the ASR process involves receiving the speech signal of a speaking person, dividing the speech signal into segments associated with individual phonemes, comparing each such segment to each stored phoneme to determine what the individual is saying. All speech recognition methods must recognize patterns by comparing an unknown pattern with a known pattern in memory. The system will make a judgment call as to which stored phoneme pattern relates most closely to the received phoneme pattern. The general scenario requires that you already have a stored a number of patterns. The system desires to determine which one of the stored patterns relates to the received pattern. Comparing in this sense means computing some distance, scoring function, or some kind of index of similarity in the comparison between the stored value and the received value. That measure decides which of the stored patterns is close to the received pattern. If the received pattern is close to a certain stored pattern, then the system returns the stored pattern as being recognized as associated with the received pattern.
The success rate of many speech recognition systems in recognizing phonemes is around 75%. The trend in speech recognition technologies has been to utility: low-dimensional space in providing a framework to compare a received phoneme with a stored phoneme to attempt to recognize the received phone. For example, see S. B. Davis and P. Mermelstein entitled “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences”, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP 28 No. 4 pp. 357-366, August, 1980; U.S. Pat. No. 4,956,865 to Lennig, et al. There are difficulties in using low dimensional space for speech recognition. Each phoneme can be represented as a point in a multi-dimensional space. As is known in the art, each phoneme has an associated set of acoustic parameters, such as, for example, the power spectrum and/or cepstrum. Other parameters may be used to characterize the phonemes. Once the appropriate parameters are assigned, a scattered cloud of points in a multi-dimensional space represents the phonemes.
FIG. 1 represents a scatter plot 10 of the phoneme /aa/ and phoneme /s/. The scatter plot 10 is in two-dimensional space of energy in two frequency bands. The horizontal axis 12 represents the energy in the frequency band between 0 to 1 kHz within each phoneme and the vertical axis 14 represents the energy of the phonemes between 2 and 3 kHz. In order for a speech recognizer to discriminate one phoneme from another, the respective clouds must not overlap. Although there is a heavy concentration of points in the main body of clouds, significant scatter exists at the edges creating confusion between two phonemes. Such scatter could be avoided if the boundaries of these clouds are distinct and have sharp edges.
The dominant technology used in ASR is called the “Hidden Markov Model”, or HMM. This technology recognizes speech by estimating the likelihood of each phoneme at contiguous, small regions (frames) of the speech signal. Each word in a vocabulary list is specified in terms of its component phonemes. A search procedure, called Viterbi search, is used to determine the sequence of phonemes with the highest likelihood. This search is constrained to only look for phoneme sequences that correspond to words in the vocabulary list, and the phoneme sequence with the highest total likelihood is identified with the word that was spoken. In standard HMMs, the likelihoods are computed using a Gaussian Mixture Model. See Ronald A. Cole, et al., “Survey of the State of the Art in Human Language Technology, National Science Foundation,” Directorate XIII-E of the Commission of the European Communities Center for Spoken Language Understanding, Oregon Graduate Institute, Nov. 21, 1995 (http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html).
However, statistical pattern recognition by itself cannot provide accurate discrimination between patterns unless the likelihood for the correct pattern is always greater than that of the incorrect pattern. FIG. 1 illustrates the difficulty in using the statistical models. It is difficult to insure that the probabilities that the correct or incorrect pattern will be recognized do not overlap.
The “holy grail” of ASR research is to allow a computer to recognize with 100% accuracy all words that are intelligibly spoken by any person, independent of vocabulary size, noise, speaker characteristics and accent, or channel conditions. Despite several decades of research in this area, high word accuracy (greater than 90%) is only attained when the task is constrained in some way. Depending on how the task is constrained, different levels of performance can be attained. If the system is trained to learn an individual speaker's voice, then much larger vocabularies are possible, although accuracy drops to somewhere between 90% and 95% for commercially-available systems.