Speech recognition can be generally defined as the ability of a computer or machine to identify and respond to the sounds produced in human speech. Speech recognition processes are often referred to generally as “automatic speech recognition” (“ASR”), “computer speech recognition”, and/or “speech to text.” Voice recognition is a related process that generally refers to finding the identity of the person who is speaking, in contrast to determining what the speaker is saying.
Speech recognition systems can be broadly categorized as isolated-word recognition systems and continuous speech recognition systems. Isolated-word recognition systems handle speech with short pauses between spoken words, typically involve a restricted vocabulary that they must recognize, and are often employed in command/control type applications. Continuous speech recognition systems involve the recognition and transcription of naturally spoken speech (often performed in real time), and thus require a more universal vocabulary and the ability to discriminate words that can often run together when spoken naturally with the words that are spoken immediately before and after.
Examples of isolated-word recognition systems include machines deployed in call centers that initiate and receive calls and navigate humans through menu options to avoid or minimize human interaction. Cell phones employ such systems to perform functions such as name-dialing, answering calls, Internet navigation, and other simple menu options. Voice-control of menu options also finds application in, for example, computers, televisions and vehicles. Continuous speech recognition systems are typically employed in applications such as voice to text, speaker recognition and natural language translation.
A typical speech recognition system consists of: a) a front-end section for extracting a set of spectral-temporal speech features from a temporal sample of the time-domain speech signal from which speech is to be recognized; b) an intermediate section that consists of statistical acoustic speech models that represent a distribution of the speech features that occur for each of a set of speech sounds when uttered. These speech sounds are referred to as phonemes, which can be defined as the smallest unit of speech that can be used to make one word different than another. Such models can also be used to represent sub-phonemes; and c) a speech decoder that uses various language rules and word models by which to determine from the combination of detected sub-phonemes and phonemes what words are being spoken. Often the prediction can be enhanced by considering the typical order in which various words are used in the language in which the speech is uttered. The intermediate and decoder sections are often lumped together and referred to as a speech recognition engine.
While there have been many advances in ASR in recent years, accurate generalized speech recognition remains a very difficult problem to solve. Enabling a computer to do what we as humans take for granted is no easy task. The most basic task in any automatic speech recognition system is to use extracted features to predict which phoneme (or sub-phoneme) is most likely being uttered during each temporal sample (typically referred to as a window or frame of data) based on the features captured for that window. The models against which these features are compared are “pre-trained” statistical models of the distributions of speech features typically found when sounds are uttered. The reason that these models are “pre-trained” is that they must take into account the vast statistical variation of the features that are extracted for any given speaker. Put another way, no person says the same thing in the exact same way, and thus the features extracted for each speaker of the exact same thing vary commensurately.
Thus, the most basic task in speech recognition is also the arguably the most difficult one. There are a large number of variables that contribute to the variations in speech from one speaker to another. They include for example, the time duration of the spoken word. Not only does this vary from person to person, it even varies for the same person each time the same word is spoken. To make things more complicated, the variation in the duration of a word is not even uniform over the various sounds (i.e. phonemes and sub-phonemes) that form the word.
Another form of speaker variability lies in the fact that the content of one's speech is highly dependent upon a person's anatomical proportions and functionality. As is well known in the art, there are numerous resonances in the human body that contribute to the human voice, and these resonances are directly related to the speaker's anatomy. Gender is a very obvious manifestation of these factors, as the fundamental frequency of speech uttered by men is typically much lower overall when compared to the fundamental frequency of speech uttered by women. In addition, the emotional state and overall health of a speaker will also cause variations on top of the anatomical ones.
Speakers also develop accents, which can have a major effect on speech characteristics and on speech recognition performance. These accents range from national to regional accents and can include very different pronunciations of certain words. Because of the mobility of the general population, these accents are often melded together.
Further complicating the task, particularly with regard to continuous speech recognition, is that the characteristic of a phoneme or sub-phoneme can be greatly affected by the acoustic and phonetic context of those phonemes or sub-phonemes preceding or succeeding it. A similar issue, called co-articulation, refers generally to a situation in which a conceptually isolated speech sound is influenced by, and becomes more like, a preceding or following speech sound.
There are numerous techniques by which known speech recognitions systems deal with these problems in speech variability. As previously discussed, one way is to limit the vocabulary that the system is required to understand (which limits the number of models and permits them to be more specialized), as well as to simplify the speech into single words or very short phrases to minimize issues such as context and co-articulation.
Another technique is to use individualized training, where the statistical distribution of the models are tailored (through a learning process) to a particular user's voice characteristics to aid in recognizing what that person is saying. Such systems are referred to as speaker dependent systems. Of course, it is far more desirable to render systems that are speaker independent, which require more generalized statistical models of speech that do not depend or otherwise employ individualized training for a particular speaker (referred to as “speaker independent” systems). Many developers of such speaker independent systems gather vast amounts of speech from as many speakers as possible to create a massive corpus with the goal of creating models that are intended to statistically represent distributions of these many variables over virtually entire populations for all possible sounds. One of the downsides of this approach is clearly the vast amount of data that must be gathered and maintained. Another is the question of whether such models that have become so generalized as to represent every speaker in a given population can lose their ability to even distinguish speech.
A general methodology commonly employed by known speech recognition systems as discussed above can be illustrated by the simple and high-level representation of a known speech recognition system 100 as is illustrated in FIG. 1. Speech is captured with a transducer (e.g. a microphone) at block 104 in the form of a time domain analog audio signal 101, and is partitioned for analysis using a continuous series of overlapping windows of short time duration (e.g. they are each advanced in time by less than the duration of each window). The portion of the audio signal 101 falling within each window is sampled using an analog to digital converter (ADC) that samples the analog signal at a predetermined sampling rate over each window, and therefore converts the analog time domain signal into a digital time domain audio signal.
At 106, the digital audio signal is then converted, on a frame by frame basis, into a frequency domain representation of the portion of the time domain signal that falls within each window using any of a number of transforms such as the Fast Fourier Transform (FFT), the Discrete Fourier Transform (DFT) the Discrete Cosine Transform (DCT) or possibly other related transforms. The use of one or more of these transforms serves to represent and permit identification of the spectral constituents of the speech signal. As discussed above, these features can provide clues as to what sounds are being uttered over the course of each frame.
These features, as extracted from each window, are then typically formed into a frame of data referred to as a feature vector, and the feature vectors can be stored at 108. The foregoing process is often referred to as the front-end 102 of system 100, and the features extracted thereby can then form the input to a speech recognition engine 110. Speech recognition engine 110 can compare the feature vectors on a frame by frame basis to the statistical models that represent the typical distribution of such features for phonemes and sub-phonemes. Because of an overlap in the statistical distributions of the models, this comparison process typically leads to a statistical prediction of the likelihood that the feature vectors represent the spectral constituents of any one or more of the phonemes or sub-phonemes. Thus, there may be a number of possible matches for each feature vector, and each of those possible matches can be ranked using a probability score.
Ultimately, the probabilities and perhaps even groupings of the extracted feature vectors are fed to a back-end portion of the speech recognition engine 110 of the speech recognition system 100, where they are further processed to predict through statistical probabilities what words and phrases are being uttered over the course of several consecutive overlapping windows. From there, the engine 110 outputs its best guess of what the speech is, and that output 112 can be used for any purpose that suits the application. For example, the output 112 can be transcribed text, or control outputs based on recognized menu commands as discussed above.
One of the most commonly used forms of feature data extracted from speech at the front end of the speech recognition process are known as cepstral coefficients. Cepstral coefficients are derived from an inverse discrete Fourier transform (IDFT) of the logarithm of the short-term power spectrum of a speech segment defined by a window. Put another way cepstral coefficients encode the shape of the log-spectrum of the signal segment. A widely used form of cepstral coefficients is the Mel Frequency Cepstral Coefficients (MFCC). To obtain MFCC features, the spectral magnitude of FFT frequency bins are averaged within frequency bands spaced according to the Mel scale, which is based on a model of human auditory perception. The scale is approximately linear up to about 1000 Hz and approximates the sensitivity of the human ear.
Because cepstral coefficients are primarily concerned with capturing and encoding the power distribution of the speech signal over a range of frequencies, statistical models must be used to account for the variability between speakers who are uttering the same sounds (e.g. words, phonemes, phrases or utterances). Put another way, these variations in speaker characteristics make it very difficult to discriminate between speech phonemes uttered by different individuals based on spectral power alone, because those varying characteristics (such as the fundamental frequency of a speaker and the duration of that speakers speech) are not directly reflected in the spectral power. One of the few variables that may be renormalized out (i.e. made constant for all speakers) for the MFCCs is volume of the speech.
Another known type of feature data is in the form of oscillator peaks. Oscillator peaks are derived to represent the presence, for example, of short-term stable sinusoidal components in each frame of the audio signal. Recent innovations regarding the identification and analysis of such oscillator peaks has made them a more practical means by which to encode the spectral constituents of an audio signal of interest. For example, in the publication by Kevin M. Short and Ricardo A. Garcia entitled “Signal Analysis Using the Complex Spectral Phase Evolution (CSPE) Method,” AES 120th Convention, Paris France, May 20-23, 2006, a method of attaining super-resolution of the frequencies of such short-stable oscillators is presented by examining the evolution of the phase of the complex signal spectrum over time-shifted windows of the audio signal being analyzed. This publication is incorporated herein in its entirety by this reference.
In the U.S. patent application Ser. No. 13/886,902 entitled “Systems & Methods for Source Signal Separation,” several additional improvements are disclosed that further enhance the CSPE method discussed above, leading to even greater resolution of the properties of the oscillator peaks. One of these techniques includes the ability to establish oscillator peaks even when the audio is frequency modulated such that no short-term stabilized oscillators otherwise exist in the signal. Another improvement eliminates smearing of the oscillator peaks that is caused by transient or amplitude modulation effects. The application of these techniques has markedly improved the ability to distinguish and to thereby identify individual sources contributing to a signal being analyzed. The above-noted application is hereby incorporated herein in its entirety by this reference.
The foregoing improvements permit the underlying signal elements to be represented as essentially delta functions with only a few parameters, and these parameters are determined at a super-resolution that is much finer than the transform resolution of a typical and previously known approach to such analysis. Consequently, one can, for example, look at frequencies of the oscillator peaks on a resolution that is on a fractional period basis, whereas the original transform analysis results in only integer period output. This improved resolution allows for the examination of single excitations periods of an audio signal as it would be produced by the vocal tract, and then one can examine how the effects of the vocal tract (or other environmental conditions) will alter the single excitation period over time.
While such highly accurate oscillator peaks can potentially provide effective feature information for applications such as speech recognition, to be used as direct input to a speech recognition engine, the vectors must still be placed in a format that permits effective comparison to speech that has been similarly encoded by which to accurately predict phonemes and sub-phoneme sounds that are present in the speech signal of interest, notwithstanding the wide variation in speaker characteristics.