The present invention relates generally to speech processing systems, and in particular to apparatus and methods for determining pitch synchronous frames.
Pitch detection is an important component of a variety of speech processing systems such as automatic speech recognition (ASR) systems, speech reconstruction systems for the hearing-impaired, and speech analysis-synthesis systems like vocoders. Speech recognition and synthesis involves a complicated process of extracting and identifying the individual sounds that make human speech. Wide variations between speakers"" dialect, intonation, rhythm, stress, volume, and pitch, coupled with extraneous background noises, make speech processing difficult.
Many conventional speech processing systems divide an audio speech signal into signal segments and extract speech characteristics, or features, from each segment. Vocoders, for example, analyze speech by first segmenting the speech and then determining excitation parameters, the voiced/unvoiced decision, and pitch period for each segment. In vocoders, for example, the features are used to reconstruct and synthesize speech. Because pitch is an important speech parameter, inaccurate estimation of the pitch will often result in poor quality synthesized speech. In an ASR system, features are used to estimate the probabilities of phonemes, speech units that form words and sentences. Accurate estimation of the pitch period decreases the amount of noise in the extracted features, and therefore increases the probability of selecting the right phoneme.
The accuracy of speech processing systems therefore depends in large part on the accuracy of the pitch measurement and the pitch period. Pitch period is defined as the elapsed time between two successive laryngeal pulses. There are several difficulties in accurately determining pitch and pitch period. First, a speech waveform is not perfectly periodic, varying both in period and in the detailed structure of the waveform within a period, making exact periods difficult to detect. Second, fundamental frequencies vary widely not just between speakers but for a single speaker due to vocal tract abnormalities that produce irregular glottal excitations, making pitch difficult to determine. Third, there is no uniformity in the way the beginning and endings of pitch periods are determined. The measurement of the period may begin and end at any arbitrary point within the glottal cycle, however, an easily recognizable extreme, such as a zero crossing or wave peak, is frequently chosen. Wave peaks, however, may be altered by Formant structure and zero crossings are sensitive to formants and noise. Fourth, the transitions between low-level voiced speech and unvoiced speech segments may be very subtle and difficult to detect.
Despite these difficulties, several pitch detection algorithms (PDAs) have been developed. A PDA is a method or device that identifies voiced and unvoiced areas of a speech signal and provides a measurement of the pitch period. Examples of conventional PDAs include the: 1) cepstrum method (CEP), modified autocorrelation using clipping (AUTOC); simplified inverse filtering technique (SIFT), data reduction method (DARD), parallel processing time-domain method (PPROC), spectral flattening, linear predictive coding (LPC), and average magnitude difference function (AMDF). Although many of the conventional PDAs work well under ideal conditions, none of them work perfectly for every voice and set of environmental conditions. PDAs are described and compared in Rabiner et al., xe2x80x9cA Comparative Performance Study of Several Pitch Detection Algorithms,xe2x80x9d IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-24, No.5, October 1976, pp. 399-418.
PDAs may be grouped into categories depending on whether they make determinations principally on the time-domain (long-term), frequency-domain (short-term), or a hybrid of properties of the speech signal. Time-domain PDAs may be further categorized into fundamental-harmonic extraction and temporal structure analysis PDAs. Fundamental-harmonic extraction PDAs preprocess the signal by, for example, low pass filtering, to attenuate high frequencies above approximately 300hz. After preprocessing, pitch periods are extracted either between zero crossings, at threshold crossings, or with reference to multiple thresholds.
In temporal structure analysis, the signal envelope is modeled and searched for discontinuities which mark the beginning of individual periods (xe2x80x9cenvelope analysisxe2x80x9d) or features are extracted from the signal to define anchor points from which periodicity is derived by an iterative process of selection and elimination (xe2x80x9csignal analysisxe2x80x9d).
Short term analysis PDAs, such as the autocorrelation, cepstral, harmonic compression, and maximum likelihood algorithms, perform a sequence of events similar to that of time-domain pitch PDAs. An optional pre-processing step such as moderate low-pass filtering, approximation of the inverse filter, or an adaptive center clipping is performed. The signal is then divided into short segments which include several pitch periods. A short-term transform is performed on each segment. Using autocorrelation, the input signal is compared with itself with some delay lag factor. If the signal is periodic, there will be a high degree of correlation when the lag equals one period or a multiple of the lag. The cepstrum PDA transforms the spectrum back into the time domain, generating a large peak at its period duration T0. Using harmonic compression, the log power spectrum is compressed along the frequency axis by integer factors. Adding the compressed spectra causes the harmonics to contribute coherently to the distinct peak at the fundamental frequency. The maximum likelihood procedure is used to find a periodic estimate depending on a trial period which is most likely to represent the original periodic component. In each of these cases, peaks generated by the transform are detected and labeled as a pitch estimate for each segment. These PDAs are unable to track instantaneous periods since the phase relationship with the original signal is lost through short-term transformation.
The present invention discloses methods and apparatus for dividing a speech signal into frames in synchrony with pitch of the speech signal. In a method consistent with the present invention, an optimal filter frequency is determined and the speech signal is filtered with the filter cutoff frequency to obtain a filtered signal that approximates a fundamental frequency. The filtered speech signal is segmented and voiced periods are determined. The speech signal is divided into frames based on the voiced periods. Also consistent with the present invention is a computer-readable medium containing instructions for controlling a computer system to perform a method for dividing a speech signal into frames in synchrony with pitch of the speech signal as disclosed herein.
Consistent with the present invention, a speech processing apparatus for dividing a speech signal into frames comprises means for determining an optimal filter cutoff frequency; and means for filtering the speech signal with the filter to obtain a filtered signal that approximates a fundamental frequency. The apparatus also comprises means for segmenting the filtered speech signal into a plurality of speech segments; and means for determining which speech segments are voiced periods. The speech signal is divided into frames based on the voiced periods.
A speech recognition system consistent with the present invention comprises an input device for receiving a speech signal; a first processor for determining an optimal filter cutoff frequency; a filter for filtering the speech signal based on the optimal filter cutoff frequency to obtain a filtered signal that approximates a fundamental frequency; a segmentation module for segmenting the filtered speech signal into a plurality of speech segments; a second processor for determining which speech segments are voiced periods; and a third processor for dividing the speech signal into frames based on the voiced periods.