This invention relates to a voiced/unvoiced speech classifier, which can be used in, for example, speech recognition systems and/or speech coding systems.
A voiced sound is one generated by the vocal cords opening and closing at a constant rate giving off pulses of air. The distance between the peaks of the pulses is known as the pitch period. An example of a voiced sound is the xe2x80x9cixe2x80x9d sound as found in the word xe2x80x9cpillxe2x80x9d. An unvoiced sound is one generated by a single rush of air which results in turbulent air flow. Unvoiced sounds have no defined pitch. An example of an unvoiced sound is the xe2x80x9cpxe2x80x9d sound in the word xe2x80x9cpillxe2x80x9d. A combination of voiced and unvoiced sounds can thus be found in the word xe2x80x9cpillxe2x80x9d, as the xe2x80x9cpxe2x80x9d requires the single rush of air and the xe2x80x9cillxe2x80x9d requires a series of air pulses.
Although essentially all languages use voiced and unvoiced sounds, in tonal languages, the tone occurs only in the voiced segments of the words.
Speech recognition techniques are well known for recognising words spoken in English or other non-tonal languages. These known speech recognition techniques basically perform transformations on segments (frames) of speech, each segment having a plurality of speech samples, into sets of parameters sometimes called xe2x80x9cfeature vectorsxe2x80x9d. Each set of parameters is then passed through a set of models, which has been previously trained, to determine the probability that the set of parameters represents a particular known word or part-word, known as a phoneme, the most likely phoneme being output as the recognised speech. However, when these known techniques are applied to tonal languages, they generally fail to deal adequately with the tone-confusable words and phonemes that occur. Many Asian languages fall in this category of tonal languages. Unlike English, a tonal language is one in which tones have lexical meanings and have to be considered during recognition.
It is therefore important to be able to distinguish between the voiced and unvoiced speech segments to facilitate both speech recognition, especially of tonal languages, and speech coding, since the recognition and coding techniques can be substantially different for voiced and unvoiced speech segments and more efficient systems can be designed to deal with the two types in different ways.
The present invention therefore seeks to provide a voiced/unvoiced speech classifier, especially one that can be used in speech recognition systems or in speech coding systems.
Accordingly, in a first aspect, the invention provides voiced/unvoiced speech classifier comprising an input terminal for receiving a digitized speech signal, a feature extractor having an input coupled to the input terminal and an output providing feature vectors of the input speech signal, a correlator having an input coupled to the output of the feature extractor and an output providing an indication of the degree of autocorrelation of the feature vectors of the input speech signal, and a decision maker having a first input coupled to the output of the correlator, a second input for receiving a threshold value and an output providing a signal indicative of whether a measure of the input speech signal at least partly based on the degree of autocorrelation of the feature vectors of the input speech signal is above or below the threshold value.
In a preferred embodiment, the voiced/unvoiced speech classifier further comprises a Signal to Noise Ratio (SNR) calculator having an input coupled to the input terminal and an output providing a SNR signal, and a threshold value adjuster having an input coupled to the output of the SNR calculator and an output coupled to the second input of the comparator to provide thereto the threshold value adjusted according to the SNR signal.
Preferably, the measure of the input speech signal is based at least partly on the degree of autocorrelation of the input speech signal and on the energy of the input speech signal.
The voiced/unvoiced speech classifier preferably further comprises a signal energy calculator having an input coupled to the input terminal and an output providing an indication of the energy of the input speech signal, and a combiner having a first input coupled to the output of the correlator, an output coupled to the first input of the comparator and a second input coupled to the output of the signal energy calculator providing the measure of the input speech signal.
The measure (M) of the input speech signal is preferably provided by:
M=xcex11E+xcex12A.
where xcex11 and xcex12 are predetermined constants, E is the energy of the input speech signal and A is the degree of autocorrelation of the feature vectors of the input speech signal. xcex11 preferably has a value between 0.1 and 0.5, most preferably 0.3, and xcex12 preferably has a value between 0.5 and 0.9, most preferably 0.7.
According to a second aspect, the invention provides a voiced/unvoiced speech classifier comprising an input terminal for receiving a digitized speech signal, a speech segmentor having an input coupled to the input terminal for segmenting the input digitized speech waveform into frames of speech provided at an output of the speech segmentor, a band-pass filter having an input coupled to the output of the speech segmentor for filtering the frames of speech and an output for providing filtered frames of speech, a relative energy generator having an input coupled to the output of the band-pass filter for generating a relative energy value for each filtered frame of speech and an output, a decision parameter generator comprising an autocorrelation calculator having an input coupled to the output of the band-pass filter for generating a decision parameter at an output of the decision parameter generator based on an autocorrelation function for the filtered frames of speech, and a comparator having a first input coupled to the output of the relative energy generator, a second input coupled to the output of the decision parameter generator and an output providing a signal indicative of whether a frame of speech is voiced speech or unvoiced speech depending on a comparison of the decision parameter and the relative energy value for each filtered frame of speech.
Preferably, the band-pass filter has a bandwidth covering a majority of pitch frequencies of a human voice.
In a preferred embodiment, the relative energy generator comprises a first energy calculator having an input coupled to the band-pass filter and an output for providing an energy value for each filtered frame of speech, a second energy calculator having an input coupled to the speech segmentor and an output for providing an energy value for each unfiltered frame of speech, and a relative energy value calculator having a first input coupled to the output of the first energy calculator, a second input coupled to the output of the second energy calculator, and an output for providing a relative energy value for each frame of speech based on the energy values for the filtered and unfiltered frame of speech.
The voiced/unvoiced speech classifier preferably further comprises a threshold generator having an input coupled to the output of the relative energy generator for providing an adjusted threshold at an output of the threshold generator. The threshold generator preferably comprises a threshold calculation unit having an input coupled to the output of the relative energy generator for calculating an initial threshold from the average relative energy value of a first section of input speech including a plurality of frames of speech. Preferably, the threshold generator further comprises a normalized relative energy calculator having a first input coupled to the output of the relative energy generator, a second input coupled to an output of the threshold calculation unit, and an output coupled to the comparator for providing a normalized relative energy value.
In one preferred embodiment, the decision parameter generator further comprises a pitch frequency estimator having an input coupled to the output of the band-pass filter and an output for providing an estimated pitch frequency index, and a decision parameter calculation unit having a first input coupled to an output of the autocorrelation calculator, a second input coupled to the input of the pitch frequency estimator, and an output for providing the decision parameter based on the autocorrelation function and the estimated pitch frequency index.
According to a third aspect, the invention provides a speech classifier comprising an input terminal for receiving input speech samples, an energy calculator having an input coupled to the input terminal for calculating the energy of a frame of speech samples to provide an energy value for each frame of speech samples at an output thereof, an autocorrelator having an input coupled to the output of the energy calculator for correlating the energy value of a frame of speech samples to provide correlation values indicating a periodicity of the speech samples at an output thereof, a parameter generator having a first input coupled to the output of the energy calculator, a second input coupled to the output of the autocorrelator, and an output for providing at least one parameter based on the energy value and the correlation values indicative of the periodicity and the energy of a frame of speech samples, and a comparator having an input coupled to the output of the parameter generator for comparing the parameter with at least one threshold value to provide an indication, at an output of the classifier, of whether each frame of speech samples is voiced speech or not
Preferably, the speech classifier further comprises a threshold adjuster having an input coupled to the output of the energy calculator and an output for providing the at least one threshold value adjusted according to a measure of ambient noise level in the frame of speech samples.