1. Field of the Invention
The present invention relates generally to improvements in synthetic voice systems and, more particularly, pertains to a new and improved voice source for synthetic speech systems.
2. Description of the Prior Art
An increasing amount of research and development work is being done in text-to-speech systems. These are systems which can take someone's typing or a computer file and turn it into the spoken word. Such a system is very different from the system used in, for example, automobiles that warn that a door is open. A text-to-speech system is not limited to a few "canned" expressions. The commercially available systems are being put to such uses as reading machines for the blind and telephone computer based information.
The presently available systems are reasonably understandable. However, they still produce voices which are noticeably nonhuman. In other words, it is obvious that they are produced by a machine. This characteristic limits their range of application. Many people are reluctant to accept conversation from something that sounds like a machine.
One of the most important problems in producing natural-sounding synthetic speech occurs at the voice source. In a human being, the vocal cords produce a sound source which is modified by the varying shape of the vocal tract to produce the different sounds. The prior art has had considerable success in computationally mimicking the effects of the vocal tract. Mimicking the effects of the vocal cords, however, has proved much more difficult. Accordingly, the research in text-to-speech in the last few years has been largely dedicated to producing a more human-like sound.
The essential scheme of a typical text-to-speech system is illustrated in FIG. 1. The text input 11 comes from a keyboard or a computer file or port. This input is filtered by a preprocessor 15 into a language processing component which attempts a syntactic and lexical analysis. The preprocessor stage section 15 must deal with unrestricted text and convert it into words that can be spoken. The text-to-speech system of FIG. 1, for example, may be called upon to act as a computer monitor, and must express abbreviations, mathematical symbols and, possibly, computer escape sequences, as word strings. An erroneous input such as a binary file can also come in, and must be filtered out.
The output from the preprocessor 15 is supplied to the language processor 17, which performs an analysis of the words that come in. In English text-to-speech systems, it is common to include a small "exceptions" dictionary for words that violate the normal correspondences between spelling and pronunciation. The lexicon entries are not only used for pronunciation. The system extracts syntactic information as well, which can be used by the parser. Therefore, for each word, there are entries for parts of speech, verb type, verb singular or plural, etc. Words that have no lexicon entry pass through a set of letter-to-sound rules which govern, for example, how to pronounce the sequence. The letter-to-sound rules thus provide phoneme strings that are later passed on to the acoustic processing section 19. The parser has an important but narrowly-defined task. It provides such syntactic, semantic, and pragmatic information as is relevant for pronunciation.
All this information is passed on to the acoustic processing component 19, which modifies the phoneme strings by the applicable rules and generates time varying acoustic parameters. One of the parameters that this component has to set is the duration of the segments which are affected by a number of different conditions. A variety of factors affect the duration of vowels, such as the intrinsic duration of the vowels, the type of following consonant, the stress (accent) on a syllable, the location of the word in a sentence, speech rate, dialect, speaker, and random variations.
A major part of the acoustic processing component consists of converting the phoneme strings to a parameter array. An array of target parameters for each phoneme is used to create some initial values. These values are modified as a result of the surrounding phonemes, the duration of the phoneme, the stress or accent value of the phoneme, etc. Finally, the acoustic parameters are converted to coefficients which are passed on to the formant synthesizer 21. The cascade/parallel formant synthesizer 21 is preferably common across all languages.
Working within source-and-filter theory, most of the work on the acoustic and synthesizer portions of text-to-speech systems in the past years has been devoted to improving filter characteristics; that is, the formant frequencies and bandwidths. The emphasis has now turned to improving the characteristics of the voice source; that is, the signal which, in humans, is created by the vocal folds.
In earlier work toward this end, conducted almost entirely on male speech, a reasonable approximation of the voice source, was obtained by filtering a pulse string to achieve an approximately 6 dB-per-octave rolloff. Now that the attention has turned from improving filter characteristics, it has turned to improving the voice source itself.
Moreover, the interest in female speech has also made work on the voice source important. A female voice source cannot be adequately synthesized using a simple pulse train and filter.
This work is quite difficult. Data on a human voice source is difficult to obtain. The source from the vocal folds is filtered by the vocal tract, greatly modifying its spectrum and time waveform. Although this is a linear process which can be reversed by electronic or digital inverse filtering, it is difficult and time consuming to determine the time varying transfer function with sufficient precision to accurately set the inverse filters. However, the researchers have undertaken voice source research despite these inherent difficulties.
FIGS. 2, 3, and 4 illustrate time domain waveforms 23, 25, and 27. These waveforms illustrate the output of inverse filtering for the purpose of recovering a glottal waveform. FIG. 2 shows the original time waveform 23 for the vowel "a." FIG. 3 shows the waveform 25 from which the formants have been filtered. Waveform 25 still shows the effect of lip radiation, which emphasizes high frequencies with a slope of about 60 dB per octave. Integration of waveform 25 produces waveform 27 (FIG. 4), which is the waveform produced after the lip radiation effect is removed.
A text-to-speech system must have a synthetic voice source. In order to produce a synthetic source, it has been suggested to synthesize the glottal source as the concatenation of a polynomial and an exponential decay, as shown by waveform 29 in FIG. 5. The waveform is specified by four parameters, TO, AV, OQ, and CRF. TO is the period which is the inverse of the frequency FO expressed in sample points. AV is the amplitude of voicing. OQ is the open quotient; that is, the percentage of the period during which the glottis is open. These first three parameters uniquely determine the polynomial portion of the curve. To simulate the closing of the glottis, an exponential decay is used, which has a time constant CRF (corner rounding factor). A larger CRF has the effect of softening the sharpness of an otherwise abrupt simulated glottal closure.
Control of the glottal pulse is designed to minimize the number of required input parameters. TO is, of course, necessary, and is supplied to the acoustic processing component. Target values for AV and for initial values of OQ are maintained in table entries for all phonemes. A set of rules govern the interpolation between the points where OQ and AV are specified.
Voiceless sounds have an AV value of zero. Although the OQ value is meaningless during a voiceless sound, these nevertheless are stored with varying OQ values so that interpolating rules provide the proper OQ for voice sounds in the vicinity of voiceless sounds. CRF is strongly correlated to the other parameters in natural speech. For example, high pitch is correlated with a relatively high CRF. A higher voice pitch is associated with smoother voice quality (low spectral tilt). Higher amplitude correlates with a harsher voice quality (high spectral tilt). A higher open quotient is correlated with a breathy voice, which has a very high CRF.
One of the most important elements in producing natural sounding synthetic speech concerns voice quality, or the "timbre" of the voice. This characteristic is largely determined at the voice source. In a human being, the vocal cords produce the sound source which is modified by the varying shape of the vocal tract to produce different sounds. All prior art techniques have been directed to computationally mimicking the effects of the vocal tract. There has been considerable success in this endeavor. However, computationally mimicking the effects of the vocal cords has proved quite difficult. The prior art approach to this problem has been to use the well-established research technique of taking the recorded speech of a human speaker and removing the effects of the mouth, leaving only the voice source. As discussed above, the voice source was then utilized by extracting parameters, and then using these parameters for synthetic voice generation. The present invention approaches the problem from a completely different direction in that it uses the time waveform of the voice source itself. This idea was explored by John N. Holmes in his paper, The Influence of Glottal Waveforms on the Naturalness of Speech from a Parallel Formant Synthesizer, in the IEEE Transactions on Audio and Electroacoustics, Vol. R, AU-21, No. 3, June 1973.
The objective of providing a source signal which is capable of quickly and reliably producing voice quality that is indistinguishable from human voice nevertheless has not been obtained until the present invention.