1. Field of the Invention
The present invention relates to translating text in a computer system to synthesized speech; and more particularly to techniques used in such systems for control of intonation in synthesized speech.
2. Description of the Related Art
In text-to-speech systems, stored text in a computer is translated to synthesized speech. As can be appreciated, this kind of system would have wide spread application if it were of reasonable cost. For instance, a text-to-speech system could be used for reviewing electronic mail remotely across a telephone line, by causing the computer storing the electronic mail to synthesize speech representing the electronic mail. Also, such systems could be used for reading to people who are visually impaired. In the word processing context, text-to-speech systems might be used to assist in proofreading a large document.
However in prior art systems which have reasonable cost, the quality of the speech has been relatively poor making it uncomfortable to use or difficult to understand. In order to achieve good quality speech, prior art speech synthesis systems need specialized hardware which is very expensive, and/or a large amount of memory space in the computer system generating the sound.
Prior art systems which have addressed this problem are described in part in U.S. Pat. No. 8,452,168, entitled COMPRESSION OF STORED WAVE FORMS FOR ARTIFICIAL SPEECH, invented by Sprague; and U.S. Pat. No. 4,692,941, entitled REAL-TIME TEXT-TO-SPEECH CONVERSION SYSTEM, invented by Jacks, et al. Further background concerning speech synthesis may be found in U.S. Pat. No. 4,384,169, entitled METHOD AND APPARATUS FOR SPEECH SYNTHESIZING, invented by Mozer, et al.
In text-to-speech systems, an algorithm reviews an input text string, and translates the words in the text string into a sequence of diphones which must be translated into synthesized speech. Also, text-to-speech systems analyze the text based on word type and context to generate intonation control used for adjusting the duration of the sounds and the pitch of the sounds involved in the speech.
Diphones consist of a unit of speech composed of the transition between one sound, or phoneme, and an adjacent sound, or phoneme. Diphones typically are encoded as a sequence of frames of sound data starting at the center of one phoneme and ending at the center of a neighboring phoneme. This preserves the transition between the sounds relatively well. The encoded diphones have a nominal pitch determined by the length of a pitch period in the encoded speech and a nominal duration determined by the number of pitch periods corresponding to a particular encoded sound. These nominal values must be adjusted to synthesize natural sounding speech.
Intonation control in such systems involves lengthening or shortening particular frames, or pitch periods, of speech data for pitch control, and inserting or deleting frames associated with particular sounds for duration control. Prior art systems have accomplished these modifications by relatively crude clipping and extrapolation on pitch period boundaries that introduce discontinuities in output speech data sequences. In some cases, these discontinuities may introduce audible clicks or other noise.
Notwithstanding the prior work in this area, the use of text-to-speech systems has not gained widespread acceptance. It is desireable therefore to provide a software only text-to-speech system which is portable to a wide variety of microcomputer platforms, and conserves memory space in such platforms for other uses, and performs intonation control with high quality.