In human-machine communication scenario, it is necessary to automatically extract information other than text-based information (hereinafter referred to as “paralinguistic information”) in speech. Conventionally, prosodic features such as pitch, power and duration have been used as acoustic features for extracting paralinguistic information. Recent studies, however, have reported that voice quality information due to modality in the laryngeal voice source, for example, breathiness, creakiness and harshness also takes an important role in the perception of paralinguistic information.
VF, creak, creaky voice, glottal fry, pulse register and laryngealization are terminologies conventionally found in the literature for a voice quality characterized by a train of relatively discrete laryngeal (or glottal) excitations (or pulses of brief duration), with almost complete damping of the vocal tract between successive glottal pulses, usually accompanied by extremely low fundamental frequencies, and irregular durations of glottal cycles. The auditory perception of VF is of “rapid series of taps like a stick being run along a railing” or the “imitated sound of motor boat engine” or similar to “food cooking in a hot frying pan.”
VF carries important linguistic and paralinguistic information depending on the language. In German, VF often occurs near morpheme boundaries. In Japanese, besides the VF appearing in low tension voices, it also appears in expressive emphasizing utterances as a pressed voice. Such pressed voice carries paralinguistic information primarily associated with feelings or attitudes of surprise, admiration and suffering. VF utterance portions (hereinafter referred to as “VF segments”) in such pressed voices are often observed to have very low fundamental frequencies.
Further, VF segments have characteristic irregularities, possibly causing severe errors in pitch determination algorithms, which are important for prosodic information extraction. Thus, knowledge about the location of VF could be useful in extracting paralinguistic information as well as in improvement of pitch determination performance.
There are many studies reporting physiological, perceptual and acoustic properties of VF in several research areas. Many of them report qualitative or descriptive analyses of acoustic features that are related with different voice qualities. However, only a few evaluate their performance for automatic detection purposes.
Non-Patent Document 1: Ishi, C. T., “Analysis of Autocorrelation-based parameters for Creaky Voice Detection,” Proc. of The 2nd International Conference on Speech Prosody: 643-646, 2004.