(Not Applicable)
(Not Applicable)
1. Technical Field
This invention relates to the field of speech synthesis and more particularly to a method and apparatus for synthesizing vowels in a speech synthesizer.
2. Description of the Related Art
Phonetics is the scientific study of all aspects of speech. Phonetics can be divided into acoustic phonetics and articulatory phonetics. Acoustic phonetics is concerned with the structures and patterns of acoustic signals. Articulatory phonetics is concerned with the ways sounds are produced, for example by describing speech sounds in terms of the positions of the vocal organs when producing any given sound. By comparison, speech synthesis is the process of producing audibly recognizable speech output in a computing system. Speech synthesizers, for example Text-to-Speech (TTS) Engines, can process computer-readable text into synthesized speech by applying the principles of acoustic and articulatory phonetics to the structure and composition of the computer-readable text in order to computationally produce speech.
The conventional division of speech sounds both in the study of phonetics and in the synthesis of speech can be classified into vowels and consonants. Consonants can be characterized by the human formation of the consonant sound. Specifically, to form a consonant, the airstream through the human vocal tract typically is obstructed in some manner. As such, consonants are classified according to this obstruction, for instance, the place of articulation, the manner of articulation and the presence or absence of voicing. In contrast, vowels, unlike consonants, exhibit a great deal of dialectic variation. This variation can depend on factors such as geographical region, age and gender. Vowels can be differentiated from consonants by the relatively wide opening in the human mouth as air passes from the lungs out of the human body. Accordingly, there is very little obstruction of the airstream in comparison to consonants. Typically, vowels can be described in terms of tongue position and lip shaping.
Notably, vowel sounds produced by speech synthesizers can have a buzzing quality which can prove undesirable to the user of a TTS Engine. It has been shown, however, that the application of non-stationary additive noise (NSAN) to synthesized vowels can mask this buzzing quality. Furthermore, experimentally it has been shown that the application of NSAN to synthesized vowels can improve the perceived naturalness of the vowel sounds. Accordingly, it can be preferable to apply NSAN to synthesized vowel sounds in a TTS engine.
A method for generating non-stationary additive noise (NSAN) for addition to synthesized speech can include selecting a group of pitch pulses in a recorded sample of a spoken vowel; computing a frequency spectrum for the selected group of pitch pulses; identifying formant values in the computed frequency spectrum; creating an all-zero filter based upon the identified formant values; populating a zero-padded matrix with the selected group of pitch pulses; and, applying the all-zero filter to the matrix. The application of the all-zero filter to the matrix can produce NSAN vectors, each NSAN vector corresponding to a pitch pulse in the group of pitch pulses.
In one aspect of the invention, the step of selecting a group of pitch pulses can include selecting twenty pitch pulses in the recorded sample of speech. Additionally, the twenty pitch pulses can be positioned in the center of the recorded sample. In another aspect of the invention, the identifying step can include identifying the first three formant values in the computed frequency spectrum. In yet another aspect of the invention, the step of computing a frequency spectrum can include applying a linear predictive coding (LPC) process to the selected group of pitch pulses. Notably, the LPC process can extract predictive coefficients from the selected group of pitch pulses. As a result, the step of creating an all-zero filter can further include configuring the all-zero filter with the extracted predictive coefficients.
The method of the invention also can include low-pass filtering the recorded sample and selecting a group of filtered pitch pulses in the filtered sample, wherein each filtered pitch pulse in the selected group of the filtered sample corresponds to a pitch pulse in the selected group of the recorded sample. Subsequently, each NSAN vector can be added to a corresponding filtered pitch pulse in the selected group of the filtered sample. Moreover, each added NSAN vector can correspond to a filtred pitch pulse which corresponds to a pulse in the recorded sample having a correspondence with the added NSAN vector.
Notably, the step of low-pass filtering can include determining a fundamental frequency for the recorded sample; and, passing the recorded sample through a low-pass cut-off filter configured with cut-off frequencies corresponding to the first formant and the fundamental frequency. Furthermore, the step of passing can include passing the recorded sample through the low-pass cut-off filter both forwards and backwards.
By comparison, a method for producing vowel sounds in a waveform generator using NSAN can include computing a frequency spectrum for a selected group of pitch pulses in a recorded sample of a spoken vowel; identifying a set of formant values in the computed frequency spectrum and creating an all-zero filter for the set of identified formant values; populating a zero-padded matrix with the selected group of pitch pulses and applying the all-zero filter to the matrix, the application of the filter producing a set of NSAN vectors; synthesizing a vowel sound in the waveform generator, the synthesis producing a further group of pitch pulses; and, adding the NSAN vectors to the further group of pitch pulses.
The step of computing a frequency spectrum can include applying a linear predictive coding (LPC) process to the selected group of pitch pulses. Notably, the LPC process can extract predictive coefficients from the selected group of pitch pulses. As a result, the step of creating an all-zero filter can further include configuring the all-zero filter with the extracted predictive coefficients.
The identifying step can include identifying the first three formant values in the computed frequency spectrum. Finally, the adding step can include sampling the synthesized vowel sound and selecting a group of pitch pulses in the sampled vowel sound; and, for each pitch pulse in the sample, re-sampling a corresponding NSAN vector to the length of the pitch pulse, multiplying the re-sampled NSAN vector by a scaling factor and adding the NSAN vector to the pitch pulse.