Speech recognition is an automatic process to convert the voice signal of speech into text, which has three steps. The first step, acoustic processing, reduces the speech signal into a parametric representation. The second step is to find the most possible sequences of phonemes from the said parametrical representation of the speech signal. The third step is to find the most possible sequence of words from the possible phoneme sequence and a language model. The current invention is related to a new type of parametric representation of speech signal and the process of converting speech signal into that parametric representation.
In current commercial speech recognition systems, the speech signal is first multiplied by a shifting process window, typically a Hamming window of duration about 25 msec and a shifts about 10 msec, to form a frame, see FIG. 2(A). A set of parameters is produced from each windowed speech signal. Therefore, for each 10 msec, a set of parameters representing the speech signal in the 25 msec window duration is produced. The most widely used parameter representations are linear prediction coefficients (LPC) and mel-frequency cepstral coefficients (MFCC). Such a method has flaws. First, the positions of the processing windows are unrelated to the pitch periods. Therefore, pitch information and spectral information cannot be cleanly separated. Second, because the window duration is typically 2.5 times greater that the shift time, a phoneme boundary is always crossed by two or three consecutive windows. In other words, large number of frames cross phoneme boundaries, see FIG. 2(A).
A better way of parameterizing the speech signal is first to segment the speech signals into frames that are synchronous to the pitch periods, see FIG. 2(B). For voiced section of the speech signals, 211, each frame is a single pitch period, 213. For unvoiced signals, 212, the frames 214 are segmented for convenience, typically into frames approximately equal to the average pitch periods of the voiced sections. The advantages of the pitch-synchronous parameterization are: First, the speech signal in a single frame only represent the spectrum or timbre of the speech, decoupled from pitch. Therefore, timbre information is cleanly separated from pitch information. Second, because a phoneme boundary must be either a boundary between a voiced section and an unvoiced section, or at a pitch-period boundary, each frame has a unique phoneme identity. Therefore, each parameter set has a unique phoneme identity. The accuracy of speech recognition can be improved. (See Part E of Springer Handbook of Speech Processing, Springer Verlag 2008).