1. Field of the Invention
The present invention relates generally to the generation of synthetic speech, and more specifically, to the generation of synthetic speech at remote client devices.
2. Description of Related Art
Speech synthesis, which refers to the artificial generation of speech from written text, is increasingly becoming an important technology for accessing information. Two areas in which speech synthesis looks particularly promising is in increasing the availability of information to sight-impaired individuals and in enriching the information content of web-based devices that have minimal or no viewing screens.
FIG. 1 is a diagram illustrating a conventional web-based speech synthesis system. Synthesizing text 101 into a digital waveform file 110 is performed by the three sequential steps of text analysis 102, prosodic analysis 103, and speech waveform generation 104.
In textual analysis, text 101 is analyzed into some form of linguistic representation. The analyzed text is next decomposed into sounds, more generally described as acoustic units. Most of the acoustic units for languages like English are obtained from a pronunciation dictionary. Other acoustic units corresponding to words not in the dictionary are generated by letter-to-sound rules for each language. The symbols representing acoustic units produced by the dictionary and letter-to-sound rules typically correspond to phonemes or syllables in a particular language.
Prosodic analysis 103 includes the identification of points within sentences that require changes in the intonation or pitch contour (up, down, flattening) and the defining of durations for certain syllabes. The pitch contour may be further refined by segmenting the current sentence into intonational phrases. Intonational phrases are sections of speech characterized by a distinctive pitch contour, which usually declines at the end of each phrase.
The speech waveform generation section 104 receives the acoustic sequence specification of the original sentence from the prosodic analysis section 103, and generates a human sounding digital audio waveform (waveform file 110). The speech waveform generation section 104 may generate an audible signal by employing a model of the vocal tract to produce a base waveform that is modulated according to the acoustic sequence specification to produce a digital audio waveform file. Another known method of generating an audible signal is through the concatenation of small portions of pre-recorded digital audio. These digital audio units are typically obtained by recording utterances from a human speaker. The series of concatenated units is then modulated according to the parameters of the acoustic sequence specification to produce an output digital audio waveform file. In most cases, the concatenated digital audio units will have a one-to-one correspondence to the acoustic units in the acoustic sequence specification. The resulting digital audio waveform file 110 may be rendered into audio by converting it into an analog signal, and then transmitting the analog signal to a speaker.
In the context of a web-based application, text 101 may be specifically designated by a web-page designer as text that viewers of the web site can hear as well as read. There are several methods that may be utilized to prepare a portion of web text for rendering into speech in the form a digital audio waveform. A human speaker may read aloud text into a collection of digital audio recordings. A remote client can then download and listen to the digital audio files corresponding to selected portions of the text. In another approach, a web-page author may elect to perform the steps of text analysis 102, prosodic analysis 103, and speech waveform generation 104, for each portion of text, producing a collection of digital audio files that could be stored on the web-server, and then transferred on request to the remote client.
An advantage of the above techniques is that rendering the binary speech waveform file 110 into audio at the client is a simple process that requires very little client resources. The digital audio files can be rendered into audio on web-access devices possessing minimal amounts of computer memory and little if any computational power. A disadvantage, however, is that digital audio files corresponding to speech waveforms 110 tend to be large files that require a lot of network bandwidth. This can be particularly problematic for clients connected to network 115 using a relatively slow connection such as a dial-up modem or a wireless cell-modem connection.
Another conventional speech synthesis technique for generating synthesized speech at a client computer is implemented using a process similar to that shown in FIG. 1, with the exception that text analysis section 102, prosodic analysis section 103, and speech waveform generation section 104 are all located locally at the client. In operation, text 101 is transmitted over the network to the client, and all the speech synthesis steps are then performed locally. A problem associated with this method of speech synthesis is that it can be computationally burdensome to the client. Additionally, programs for performing textual analysis, prosodic analysis, and speech waveform generation may be large programs containing extensive look-up dictionaries. Such programs are not suitable for web-terminals or for small portable browsers such as those incorporated into cellular phones or personal digital assistant (PDA) devices.
Accordingly, there is a need in the art to be able to efficiently deliver and synthesize speech at client devices, especially when the client devices have limited processing ability and low bandwidth connections.