The present invention relates to the field of speech recognition. More specifically, the present invention relates to a method and system for Chinese speech pitch extraction in speech recognition using local optimized dynamic programming pitch path-tracking.
Pitch extraction is an essential component in a variety of speech processing systems. Besides providing valuable insights into the nature of the excitation source for speech production, the pitch contour of an utterance is useful for recognizing a speaker, and is required in almost all speech analysis-synthesis systems. Because of the importance of pitch extraction, a wide variety of methods and systems for pitch extraction have been proposed in the speech recognition field.
Basically, the method or system for pitch extraction makes a voiced/unvoiced decision, and during the periods of voiced speech, provides a measurement of the pitch period. Methods and systems for pitch extraction can be roughly divided into the following three broad categories:
1. A group which utilizes principally the time-domain properties of speech signals.
2. A group which utilizes principally the frequency-domain properties of speech signals.
3. A group which utilizes both the time and frequency domain properties of speech signals.
Time-domain pitch extractors operate directly on the speech waveform to estimate the pitch period. For these pitch extractors, the measurements most often made are peak and valley measurements, zero-crossing measurements, and auto-correction measurements. The basic assumption that is made in all these cases is that if a quasi-periodic signal has been suitably processed to minimize the effect of the format structure, then simple time-domain measurements will provide good estimates of the period.
The class of frequency-domain pitch extractors uses the property that if the signal is periodic in the time domain, then the frequency spectrum of the signal will consist of a series of impulses at the fundamental frequency and its harmonics. Thus, simple measurements can be made on the frequency spectrum of the signal to estimate the period of the signal.
The class of hybrid pitch extractors incorporates features of both the time-domain and the frequency-domain approaches to pitch extraction. For example, a hybrid extractor might use frequency-domain techniques to provide a spectrally flattened time waveform, and then use autocorrelation measurements to estimate the pitch period.
Though the above conventional methods and systems for pitch extraction are accurate and reliable, they are only suitable for feature analysis, and not for speech recognition in real time. In addition, due to the differences between most European languages and the Chinese language, there are some special aspects to be taken into account for Chinese speech pitch extraction.
In contrast to most European languages, Mandarin Chinese uses tones for lexical distinction. A tone occurs over the duration of a syllable. There exist five lexical tones that play very important roles in meaning disambiguation. The direct acoustic representative of these tones is the pitch contour variation pattern illustrated in FIG. 1. The most direct acoustic manifestation of tone is fundamental frequency. Thus, for Chinese speech pitch extraction, the effect of fundamental frequency shall be taken into account.
Paul Boersma""s article entitled xe2x80x9cAccurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound,xe2x80x9d IFA Proceedings 17, 1993, pp. 97-110, gives a detailed and advanced pitch extraction method based on the processing of fundamental frequency. The main concept of Paul Boersma""s article includes the anti-bias auto-correlation and viterbi algorithm (Dynamic Programming) technology, which integrates the voiced/unvoiced decision, pitch candidate estimator, and best path finding into one pass and can efficiently improve the extraction accuracy.
However, the global optimized dynamic programming pitch path-tracking of Paul Boersma is not suitable for practical application for time delay. The time delay of pitch extraction depends on two factors: one is the CPU computation power and another is the algorithm structural issue. As in the algorithm of Paul Boersma, when pitch extraction in current windows (frames) depends on the later windows (frames), whatever the CPU speed is, the system will have structural delay for response. For example, in the algorithm of Paul Boersma, if the speech length is L seconds, then the structural delay time is L seconds. Sometimes it is unacceptable for a real-time speech recognition application. Therefore, it is apparent to one with ordinary skill in the art that an improved method and system is needed.
The present invention discloses methods and apparatuses for Chinese speech pitch extraction using local optimized dynamic programming pitch path-tracking to meet the low time-delay requirements for a real-time speech recognition application.
In one aspect of the invention, an exemplary method includes:
pre-computing an anti-bias auto-correlation of a Hamming window function; for at least one frame, saving a first candidate as an unvoiced candidate, and detecting other voiced candidates from the anti-bias auto-correlation function; and calculating a cost value for a pitch path according to a voiced/unvoiced intensity function based on the unvoiced and voice candidates, saving a predetermined number of least-cost paths; and outputting at least a portion of contiguous frames with low time delay.
In one particular embodiment, the method includes removing global and local DC components from the speech signal. In another embodiment, the method includes segmenting the speech signal into a plurality of frames, and for each frame, calculating spectrum, power spectrum, and auto-correlation. In a further embodiment, the method includes performing an MFCC extraction.
The present invention includes apparatuses which perform these methods, and machine-readable media which, when executed on a data processing system, cause the system to perform these methods. Other features of the present invention will be apparent from the accompanying drawings and from the detailed description which follows.