First, words used in this section will be defined.
“Pressed sound” refers to a sound produced with one's glottis closed tight, so that the air does not smoothly flow through the glottis and the acceleration of the airflow passing through the glottis becomes large. Here, the glottal flow waveform is much deformed from a sine curve, and a gradient of its differential waveform locally becomes large. When a speech has such characteristics, the speech will be referred to as “pressed” speech.
“Breathy sound” refers to a sound produced with one's glottis opened and not tight, so that air flows smoothly and as a result, the glottal flow waveform becomes closer to a sine curve. Here, the gradient of the differential waveform of the glottal flow waveform does not locally become large. When a speech has such characteristics, the speech will be referred to as “breathy” sound.
“Modal” refers to a sound between the pressed and breathy sounds.
“AQ (Amplitude Quotient)” is a peak-to-peak amplitude of the glottal flow waveform divided by the amplitude of the minimum of the flow derivative.
Speech synthesis is as important a field of phonetic study as speech recognition. Recent development in signal processing technology promoted use of speech synthesis in many fields. Conventional speech synthesis is, however, simple production of speech from text information, and subtle emotional expression observed in human conversation cannot be expected.
By way of example, human conversation transmits information such as anger, joy and sadness through vocal sound and the like, other than the information of the speech contents. Information other than the language, accompanying the speech will be referred to as paralinguistic information. Such information cannot be represented with text information only. In the conventional speech synthesis, however, it has been difficult to transmit such paralinguistic information. For higher efficiency of man-machine interface, it is desirable to transmit not only the text information but also the paralinguistic information at the time of speech synthesis.
As a solution to this problem, continuous speech synthesis in various utterance styles has been proposed. A specific approach is as follows. Speeches are recorded and converted to data-processable form to prepare a database, and speech units in the database that are considered to express desired features (such as anger, joy, and sadness) are labeled correspondingly. At the time of speech synthesis, a speech having a label corresponding to the desired paralinguistic information is utilized.
However, the preparation of a database with sufficient coverage of speaking-styles necessarily implies processing of huge amounts of recorded speech. Therefore, automatic feature extraction and labeling without operator supervision must be ensured.
Examples of the paralinguistic information are as follows. One of the speaking styles is the discrimination between pressed sound and breathy sound. The pressed sound is produced rather strongly, because the glottis is tight. The breathy sound is not perceived as strong, because the voice has a near-sine curve. Accordingly, discrimination between pressed sound and breathy sound is a significant speaking style, and if represented in a numerical value, the degree thereof may possibly be utilized as paralinguistic information.
A great deal of research has been reported on the acoustic cues, which differentiate breathiness from pressed voice-quality. See, for example, ‘The science of the singing voice,’ Sundberg, J., Northern Illinois University Press, Delcalb, Ill., (1987)(hereafter ‘Soundberg’). The majority of such studies, however, have been limited to speech (or singing) data recorded during sustained phonation of steady-state vowels. It indeed remains a challenge to quantify with high reliability the degree of pressedness or breathiness, from acoustic measurements in large amounts of recorded speech data, and if realized, this would be very helpful.
While various measures have been proposed which approximate properties of the voice-source in the spectral domain, the most direct estimates are obtained from a combination of the glottal-flow waveform and its derivative. An example of such approximation is AQ proposed in Reference 2 listed on the last part of the specification.
One advantage of AQ is explained in ‘Amplitude domain quotient for characterization of the glottal volume velocity waveform estimated by inverse filtering’, Alku, P. & Vilkman, E., Speech Comm., 18(2), 131-138, (1996)(hereafter ‘Alku’). In Alku, it is explained that one advantage of AQ is its relative independence of the sound pressure level (SPL) and its reliance primarily on phonatory quality. Another possible advantage is that it is a purely amplitude-domain parameter and should therefore be relatively immune to the sources of error in measuring time-domain features of the estimated glottal waveform. Alku have found that for all of four male and four female speakers producing the sustained vowel “a” with a range of phonation types, the value of AQ decreased monotonically when phonation was changed from breathy to pressed (See Alku, p. 136). AQ seems therefore promising in our efforts to solve the problem discussed in the foregoing. It is noted, however, that the following conditions must be satisfied, to have AQs effectively applied:
1) AQs can be measured robustly and reliably in recorded natural speech; and
2) Perceptual salience of the parameter as measured under such conditions can be validated.
To satisfy such conditions, it is of importance how to reliably extract, from speech waveforms representative of physical quantities, such as naturally produced voices, parameters representative of features of the speech waveforms. Particularly, speeches may have portions that are reliable and not reliable to extract parameters, when the utterances are not fully and closely controlled by the speaker or when various speakers give utterances in various styles. Therefore, it is important to choose which portion of the speech waveform as the object of processing. To this end, a central portion of a syllable (tentatively referred to as “syllabic nuclei”) must correctly be extracted where a syllable serves as a unit of sound production, as in the case of Japanese.