1. Technical Field
A preferred embodiment of the present invention generally relates to speech processing methods and systems (i.e., systems that accept human voice as input). More specifically, the invention is directed to speech processing to be performed in the context of speech or speaker recognition.
2. Description of Related Art
Almost every speech processing system uses some form of frame-based processing, in which speech signals are divided according to intervals of time called frames. This includes speech recognition systems (which are used to identify spoken words in an audio signal), speaker recognition systems (which are used to ascertain the identity of a speaker), and other systems that use speech as input, such as speech-to-speech translators, stress detectors, etc. All of the above systems typically employ digitally-sampled signal speech signals divided into frames having a fixed frame size. By fixed frame size, it is meant that each frame contains a fixed number of digital samples of the input speech (obtained from an audio signal via an analog-to-digital converter, for example).
Dividing speech into frames allows the speech signal to be analyzed frame-by-frame in order to match a particular frame with the phoneme or portion of a phoneme contained within the frame. Although such a frame-by-frame analysis does reduce the otherwise overwhelming computational complexity of the analysis, in some ways the frame-based approach oversimplifies the analysis, at least with respect to real human speakers.
Voiced speech is speech in which the vocal cords vibrate. One of ordinary skill in the art will recognize that some speech sounds constitute voiced speech (like the sound of the letter “v” in English or any vowel sound), while others (such as the letter “s” in English) are unvoiced (i.e., are emitted without vocal cord vibration). The human voice, just like a musical instrument, emits tones by generating periodic vibrations that have a fundamental frequency or pitch. In voiced human speech, this frequency varies according to the speaker, context, emotion, and other factors. In these periodic tones, a single period of vocal cord vibration is called a “pitch cycle.”
Current speech- and speaker recognition systems generally do not take into account the actual current fundamental frequency of the speaker. It would be advantageous if there were a technique that would allow speech recognition systems to account for variations in the speaker's pitch without requiring a burdensome amount of computational overhead.