Speech is an acoustic signal produced by the human vocal apparatus. Physically, speech is a longitudinal sound pressure wave. A microphone converts the sound pressure wave into an electrical signal. The electrical signal can be converted from the analog domain to the digital domain by sampling at discrete time intervals. Such a digitized speech signal can be stored in digital format.
A central problem in digital speech processing is the segmentation of the sampled waveform of a speech utterance into units describing some specific form of content of the utterance. Such contents used in segmentation can be    1. Words    2. Phones    3. Phonetic features    4. Pitch periods
Word segmentation aligns each separate word or a sequence of words of a sentence with the start and ending point of the word or the sequence in the speech waveform.
Phone segmentation aligns each phone of an utterance with the according start and ending point of the phone in the speech waveform. (H. Romsdorfer and B. Pfister. Phonetic labeling and segmentation of mixed-lingual prosody databases. Proceedings of Interspeech 2005, pages 3281-3284, Lisbon, Portugal, 2005) and (J.-P. Hosom. Speaker-independent phoneme alignment using transition-dependent states. Speech Communication, 2008) describe examples of such phone segmentation systems. These segmentation systems achieve phone segment boundary accuracies of about 1 ms for the majority of segments, cf. (H. Romsdorfer. Polyglot Text-to-Speech Synthesis. Text Analysis and Prosody Control. PhD thesis, No. 18210, Computer Engineering and Networks Laboratory, ETH Zurich (TIK-Schriftenreihe Nr. 101), January 2009) or (J.-P. Hosom. Speaker-independent phoneme alignment using transition-dependent states. Speech Communication, 2008).
Phonetic features describe certain phonetic properties of the speech signal, such as voicing information. The voicing information of a speech segment describes whether this segment was uttered with vibrating vocal chords (voiced segment) or without (unvoiced or voiceless segment). (S. Ahmadi and A. S. Spanias. Cepstrum-based pitch detection using a new statistical v/uv classification algorithm. IEEE Transactions on Speech and Audio Processing, 7(3), May 1999) describes an algorithm for voiced/unvoiced classification. The frequency of the vocal chord vibration is often termed the fundamental frequency or the pitch of the speech segment. Fundamental frequency detection algorithms are described in, e.g., (S. Ahmadi and A. S. Spanias. Cepstrum-based pitch detection using a new statistical v/uv classification algorithm. IEEE Transactions on Speech and Audio Processing, 7(3), May 1999) or in (A. de Cheveigne and H. Kawahara. YIN, a fundamental frequency estimator for speech and music. Journal of the Acoustical Society of America, 111(4):1917-1930, April 2002). In case nothing is uttered, the segment is referred to as being silent. Boundaries of phonetic feature segments do not necessarily coincide with phone segment boundaries. Phonetic segments may even span several phone segments, as shown in FIG. 1. Pitch period segmentation must be highly accurate, as the pitch period lengths Tp can typically be between 2 ms and 20 ms. The pitch period is the inverse of the fundamental frequency F0, cf. Eq. 1, that typically ranges for male voices between 50 and 180 Hz and for female voices between 100 and 500 Hz. FIG. 2 shows some pitch periods of a voiced speech segment having a fundamental frequency of approximately 200 Hz.Tp=1/F0   (Eq. 1)
Segmentation of speech waveforms can be done manually. However, this is very time consuming and the manual placement of segment boundaries is not consistent. Automatic segmentation of speech waveforms drastically improves segmentation speed and places segment boundaries consistently. This comes sometimes at the cost of decreased segmentation accuracy. While for word, phone, and several phonetic features automatic segmentation procedures do exist and provide the necessary accuracy, see for example (J.-P. Hosom. Speaker-independent phoneme alignment using transition-dependent states. Speech Communication, 2008) for very accurate phone segmentation, no automatic segmentation algorithm for pitch periods is known.
It is an object of the invention to enable segmentation of pitch periods of speech waveforms.