Speech recognition is a complex field requiring knowledge in various otherwise unrelated fields including: signal processing, pattern recognition, artificial intelligence, statistics, information theory, probability theory, computer algorithms, psychology, linguistics, and biology. Even drawing upon this vast body of knowledge, researchers in speech recognition still struggle with developing how to best incorporate all the various areas of knowledge in the recognition systems, how to select units of speech that are context insensitive, and how to distinguish between multiple-speakers. As a result, speech recognition systems place constraints on the speech data to be processed such as speaker dependance, isolated word phrasing, a small vocabulary, and constrained grammar.
Speech uses voiced and unvoiced audio signals. Voiced speech consists of sounds like vowels, while unvoiced speech includes whispers or sounds like the letter "S". Voiced speech begins with the larynx, which generates a signal similar to a pulse train by forcing air between the vocal chords, which in turn slap together and produce a pulsed sound. Speech is generated by moving the mouth and tongue to change the timbre of the voiced sound. The rate at which the vocal chords slap together determines the pitch of the speech. Pitch is naturally higher for children and female speakers. But even for a single speaker, pitch changes from word to word, and even within words. Speech also includes unvoiced components that use an unmodulated air stream passing through the separated vocal chords of the relaxed larynx.
The position of the tongue and jaw (and to some extent the lips) determines the resonant frequencies of the vocal tract referred to as formants. The pitch (rich in harmonics) is filtered or modulated by the speaker's vocal tract. Like pitch, formants vary from speaker to speaker. Formants generally occur at higher frequencies than the base frequency of the pitch (FO). The first formant (F1) is the lowest frequency formant, or major peak in the spectrum envelope, once pitch (F0) is removed. Formants move constantly in speech and classifying sounds is made more difficult by formant transitions.
There are other variables that also complicate speech recognition. For example, noise is virtually always present in real world speech. Only some noise may be filtered prior to digitization. Even the process of training a neural network to recognize speech itself adds Gaussian noise. Moreover, because most speech recognition systems are digital, the analog speech signal must first be converted in a digital format, before it can be processed and recognized using a computer based system. Unfortunately, in digitizing a signal, some of the signal context is lost.
Since speech signals vary widely from word to word, and also within individual words, speech is analyzed using smaller units of sound generally referred to as a phoneme. Different sounds are enumerated by phonetic alphabets, and words can be phonetically spelled using these alphabets. A phonetic alphabet describes how each word is to be spoken aloud. More formally, however, a phoneme is the smallest unit of sound in a given language that changes the meaning of a word. English has about 31 to 38 phonemes. Some languages have as many as 45 or as few as 13 (Hawaiian). In general, neural networks for speech recognition encode speech as a sequence of phonemes.
While the present invention uses the term phoneme for purposes of explanation, units of speech other than the phoneme could of course be used. For example, if a large vocabulary system is being developed, it may be important to take co-articulatory effects into account, i.e. the way adjacent phonemes change one another when they occur in the same word.
Because of the large number of variables involved in the speech recognition decision making process (a few of which were just described), several signal analysis techniques for pattern recognition are based on nonanalytical methods which use "training" to arrive at parameters of the system later used to perform that pattern analysis. Training is a method wherein a system such as a neural network is presented with examples of the pattern(s) to be recognized, system performance/response is measured, and system parameters are modified to reduce the error of the output/performance/response. This iterative process ultimately improves system performance. In addition, such a trained system can be developed without expert knowledge of the pattern(s) that is to be recognized.
This is the foundation of neural networks: internal network parameters, e.g., neural network weights, that allow it to recognize particular phoneme patterns are determined using training examples repeatedly presented to the network. A training algorithm uses the networks response/performance during training to modify/correct the network parameters. Clearly then, optimum performance of the neural network depends upon the quality of the training examples presented to it.
The present invention provides a tool for developing a training set for training a neural network for phoneme recognition and uses this training data set to develop the neural network for phoneme recognition itself. First, an input speech signal is digitized and segmented. A segment is a sequence of speech samples that occur sequentially in time where a sample is a digitized audio amplitude value of speech at a moment in time. The segmentation of speech is based upon visually discernible features or patterns of the speech. Next, segments are then transformed from the time domain into another domain, e.g. the frequency domain, where it is easier to analyze component parts (sounds) unique to that speech signal. Transformed segments of speech are represented mathematically as sets of one or more vectors, each vector having multiple dimensions, e.g., a 5-dimensional vector is defined by five elements or variables. By transforming the speech into a series of multidimensional vectors, similar or substantially similar vectors representing essentially the same phonemes may be grouped together to represent a segment of speech signals. The neural network generally classifies or "codes" each vector set corresponding to a segment as being one of a predetermined set of phonemes.
In testing the trained neural network, another uncoded speech signal is digitized and automatically divided into individual segments, with each segment being transformed into a set of one or more vectors, each vector having plural dimensions. In contrast to development, during testing an operator does not assign a phoneme code to these segments. Instead, the trained neural network processes the vector sets to recognize phonemes and automatically assign a phoneme code to each of the vectors, each code corresponding to a recognized phoneme. The phoneme code most frequently assigned in a particular vector set is selected for the vector set and assigned to the corresponding speech segment.
The present invention enables a relatively unskilled operator to "train" a neural network coding scheme to recognize phonemes and to educate the operator regarding a large number of techniques that may be employed in that training. The training of a neural network is the foundation upon which the neural network recognizes speech sounds, and the recognition is only as accurate as the initial known phonemes in the training set. Thus, the preparation and testing of a suitable and accurate training set is a very significant factor in determining how well the neural network will ultimately recognize speech.
After each segment is transformed into a vector set, a reverse transformation mechanism allows the user qualitative audio verification of the information content of the transformation vector sets corresponding to a phoneme segment. Since the input vectors and their phoneme codes are known, the parameters of the neural network are iteratively modified to output the appropriate phoneme code. A reverse coding mechanism (decode) allows qualitative audio verification of the content of the coded segments by audibly reproducing sounds corresponding to phoneme codes selected by the trained neural network.
The present invention permits both visual and audible evaluation of the performance of the developed speech recognition system. For example, corresponding segments, vectors, assigned phoneme codes, and representations of the internal state of the neural network (e.g., a centroid indicating the neural network's exemplary or "best" vector corresponding to a phoneme) may be simultaneously displayed on a single display screen for comparison by an operator.
Still further, the digitized signals and one or more of the segments may also be audibly reproduced to determine qualitatively the acceptability of the digitization technique and segmentation techniques previously used. If the produced results are unacceptable, the digitization and/or the segmentation may be modified by an operator. To ensure that the transformation procedure does not distort the phoneme training set, the present invention provides for reverse transformation of one or more of the vector sets and an audible reproduction of those reverse transformed vectors to permit an operator to confirm audibly the acceptability of the transformation process. In other words, if the operator can recognize one or more of the training set of phonemes when the reverse transform vectors are audibly reproduced, then the transformation process (and hence the training data) are acceptable for training a neural network. Otherwise, if the operator cannot audibly discern the original set of training phonemes, the training set is likely to be unacceptable and some parameter must be changed.
A neural network assigned phoneme code can be automatically decoded to generate an estimate of an exemplary vector representing a corresponding phoneme. The estimate exemplary vector may be for example (depending upon the neural network used) a centroid of the set of vectors in vector space used to train the neural network to recognize a particular phoneme having plural dimensions corresponding to the code generated while training the neural network. For other neural networks, the exemplary vector may be a best guess of such a centroid. The centroid reflects ultimately the internal structure and parameters of the trained neural network which evolved during training to recognize vectors corresponding to a particular phoneme. The estimate exemplary vector is then reverse transformed into a time varying signal. The time varying signal is audibly reproduced to evaluate the performance of the speech recognition system.
In other words, in response to a testing input signal, the neural network assigned phoneme codes are the neural network's speech recognition result. By decoding those codes and generating an estimate vector, the estimate vector can further be reverse transformed into a signal that can be audibly reproduced to permit an operator to audibly determine how well the neural network did in recognizing the speech. For example, if the neural network was supposed to recognize a phoneme "OO" but assigned a code corresponding to an "AA" phoneme, the audibly reproduced signal would immediately enable an operator to determine that an error had been made. Upon detection of such errors, modifications can be made to retrain the neural network.
The present invention permits the dividing of the digitized audio signals into segments to be accomplished manually or automatically. Segments are intended to represent a single unit of speech, a phoneme, where each segment contains a set of one or more vectors. When developing a training set for a supervised neural network, the user will examine the segments, both visually and audibly, and assign a correct phoneme code to each. When the segment is added to the training data set, each transform vector in the vector set corresponding to the segment is identified as being an example of this phoneme. In this way, the user does not need to assign a phoneme to every transform vector individually. This is both a time saving measure, and one that permits the user to work with longer stretches of speech data, when visually and audibly confirming their phoneme content.
Once the neural network is trained and operating, the neural network assigns a phoneme code to each segment (phoneme codes are sometimes simply referred to as codes). The neural network examines each vector in the vector set corresponding to the segment and assigns to it a phoneme code. The entire segment is then assigned the "most popular" (i.e., most frequently assigned) phoneme code in the vector set.
In many cases, continuous speech can be so fast that it is impossible to assign a segment small enough, or positioned with enough accuracy, to contain single unambiguous phoneme. In this case the invention offers the option of assigning two possible phoneme codes corresponding to two possible phonemes that might be contained in the segment. This dual phoneme assignment is referred to as the "pairs method," because a segment is assigned a pair of possible phoneme codes. For example, assume four pseudo-words are each spoken quickly: "boo", "goo", "ba", "ga". Each pseudo-word is assigned a single segment, where each of the four segments encompasses an entire pseudo-word and contains some number of transform vectors. The user adds these segments to the training set, assigning to the segments, (and therefore to each transform vector contained in the vector set corresponding to the segment), the phoneme pairs, respectively: "b or oo", "g or oo", "b or ah", "g or ah". The neural network trained with this data set can then determine, when presented with a new test pseudo-word ending in "oo", whether or not the vowel was preceded by a "b" or a "g", etc. This decision occurs despite the absence of a segment in the training set containing only the sound "g" or only the sound "b". The method whereby the supervised neural network is trained with the pairs method differs slightly for each different type of neural network. However, the common aspect of these training methods is that when a transform vector is presented during training, the neural network will not be penalized (i.e., error will be deemed low) for identifying the vector as being either of the pair of phonemes. Without the pairs method, all but one correct response of the neural network during training would be penalized (i.e., would result in a high error).
The present invention provides a user interface which permits an operator to select and modify various inputs while developing the neural network. For example, the operator can select with a mouse click whether he wants to manually or automatically segment speech. Speech can be segmented manually using that mouse pointing device with each manually selected segment being highlighted on the display screen. The operator selects one of plural transformation algorithms including a frequency based transformation such as a fast Fourier transform (FFT) and a nonfrequency base transformation such as linear predictive coding (LPC). Both transformation algorithms convert each segment into a set of one or more multiple dimension vectors. A display of these multidimensional vectors is provided on a two dimensional display screen. Moreover, an operator can select between a number of different types of neural networks including for example supervised, unsupervised, auto associative, and back propagation types of neural networks.
Ultimately, the input audio signal (i.e., a digitized version thereof), the phoneme segments, the corresponding transform vector sets, phoneme codes, and estimate vectors (e.g., centroids) are all displayed on a single screen. An operator may then visually confirm the acceptability of any one of these parameters based on a degree of visual similarity between the corresponding portions of the displayed audio signal segments, transform vector sets, phoneme codes, and/or centroids. Each of the above displays corresponding to the same phoneme should have a similar visual appearance. A dissimilar visual appearance is an indication to the operator that adjustment needs to be made.
These features as well as other features and advantages will be described more fully below in conjunction with the figures in the detailed description of the invention.