This invention relates generally to text-to-speech synthesis. More particularly, it relates to a client/server architecture for very high quality and efficient text-to-speech synthesis.
Text-to-speech (TTS) synthesis systems are useful in a wide variety of applications such as automated information services, auto-attendants, avatars, computer-based instruction, and computer systems for the vision impaired. An ideal system converts a piece of text into high-quality, natural-sounding speech in near real time. Producing high-quality speech requires a large number of potential acoustic units and complex rules and exceptions for combining the units, i.e., large storage capability and high computational power. A prior art text-to-speech system 10 is shown schematically in FIG. 1. An original piece of text is converted to speech by a number of processing modules. The input text specification usually contains punctuation, abbreviations, acronyms, and non-word symbols. A text normalization unit 12 converts the input text to a normalized text containing a sequence of non-abbreviated words only. Most punctuation is useful in suggesting appropriate prosody, and so the text normalization unit 12 filters out punctuation to be used as input to a prosody generation unit 16. Other punctuation is extraneous and filtered out completely. Abbreviations and acronyms are converted to their equivalent word sequences, which may or may not depend on context. The most complex task of the text normalization unit 12 is to convert symbols to word sequences. For example, numbers, currency amounts, dates, times, and email addresses are detected, classified, and then converted to text that depends on the symbol""s position in the sentence. The normalized text is sent to a pronunciation unit 14 that first analyzes each word to determine its simplest morphological representation. This is trivial in English, but in a language in which words are strung together (e.g., German), words must be divided into base words and prefixes and suffixes. The resulting words are then converted to a phoneme sequence or its pronunciation. The pronunciation may depend on a word""s position in a sentence or its context (i.e., the surrounding words). Three resources are used by the pronunciation unit 14 to perform conversion: letter-to-sound rules; statistical representations that convert letter sequences to most probable phoneme sequences based on language statistics; and dictionaries, which are simple word/pronunciation pairs. Conversion can be performed without statistical representations, but all three resources are preferably used. Rules can distinguish between different pronunciations of the same word depending on its context. Other rules are used to predict pronunciations of unseen letter combinations based on human knowledge. Dictionaries contain exceptions that cannot be generated from rules or statistical methods. The collection of rules, statistical models, and dictionary forms the database needed for the pronunciation unit 14. This database is usually quite large in size, particularly for high-quality text-to-speech conversion.
The resulting phonemes are sent to the prosody generation unit 16, along with punctuation extracted from the text normalization unit 12. The prosody generation unit 16 produces the timing and pitch information needed for speech synthesis from sentence structure, punctuation, specific words, and surrounding sentences of the text. In the simplest case, pitch begins at one level and decreases toward the end of a sentence. The pitch contour can also be varied around this mean trajectory. Dates, times, and currencies are examples of parts of a sentence that are identified as special pieces; the pitch of each is determined from a rule set or statistical model that is crafted for that type of information. For example, the final number in a number sequence is almost always at a lower pitch than the preceding numbers. The rhythms, or phoneme durations, of a date and a phone number are typically different from each other. Usually a rule set or statistical model determines the phoneme durations based on the actual word, its part of the sentence, and the surrounding sentences. These rule sets or statistical models form the database needed for this module; for the more natural sounding synthesizers, this database is also quite large.
The final unit, an acoustic signal synthesis unit 18, combines the pitch, duration and phoneme information from the pronunciation unit 14 and the prosody generation unit 16 to produce the actual acoustic signal. There are two dominant methods in state of the art speech synthesizers. The first is formant synthesis, in which a human vocal track is modeled and phonemes are synthesized by producing the necessary formants. Formant synthesizers are very small, but the acoustic quality is insufficient for most applications. The more widely used high-quality synthesis technique is concatenative synthesis, in which a voice artist is recorded to produce a database of sub-phonetic, phonetic, and larger multi-phonetic units. Concatenative synthesis is a two-step process: deciding which sequence of units to use, and concatenating them in such a way that duration and pitch are modified to obtain the desired prosody. The quality of such a system is usually proportional to the size of the phonetic unit database.
A high quality text-to-speech synthesis system thus requires large pronunciation, prosody, and phonetic unit databases. While it is certainly possible to create and efficiently search such large databases, it is much less feasible for a single user to own and maintain such databases. One solution is to provide a text-to-speech system at a server machine and available to a number of client machines over a computer network. For example, the clients provide the system with a piece of text, and the server transmits the converted speech signal to the user. Standard speech coders can be used to decrease the amount of data transmitted to the client.
One problem with such a system is that the quality of speech eventually produced at the client depends on the amount of data transmitted from the server. Unless an unusually high bandwidth connection is available between the server and the client, the connection is such that an unacceptably long delay is required to receive data producing high quality sound at the client. For typical client applications, the amount of data transmitted must be reduced so that the communication traffic is at an acceptable level. This data reduction is necessarily accompanied by approximations and loss of speech quality. The client/server connection is therefore the limiting factor in determining speech quality, and the high-quality speech synthesis at the server is not fully exploited.
U.S. Pat. No. 5,940,796, issued to Matsumoto, provides a speech synthesis client/server system. A voice synthesizing server generates a voice waveform based on data sent from the client, encodes the waveform, and sends it to the client. The client then receives the encoded waveform, decodes it, and outputs it as voice. There are a number of problems with the Matsumoto system. First, it uses signal synthesis methods such as formant synthesis, in which a human vocal track is modeled according to particular parameters. The acoustic quality of formant synthesizers is insufficient for most applications. Second, the Matsumoto system uses standard speech compression algorithms for compressing the generated waveforms. While these algorithms do reduce the data rate, they still suffer the quality/speed tradeoff mentioned above for standard speech coders. Generic speech coders are designed for the transmission of unknown speech, resulting in adequate acoustic quality and graceful degradation in the presence of transmission noise. The design criteria are somewhat different for a text-to-speech system in which a pleasant sounding voice (i.e., higher than adequate acoustic quality) is desired, the speech is known beforehand, and there is sufficient time to retransmit data to correct for transmission errors. In addition, a client with sufficient CPU resources is capable of implementing a more demanding decompression scheme. Given these different criteria, a more optimal and higher compression methodology is possible.
There is still a need, therefore, for a client/server architecture for text-to-speech synthesis that outputs high-quality, natural-sounding speech at the client.
Accordingly, the present invention provides a client/server system and method for high-quality text-to-speech synthesis. The method is divided between the client and server such that the server performs steps requiring large amounts of storage, while the client performs the more computationally intensive steps. The data transmitted between client and server consist of acoustic units that are highly compressed using an optimized compression method.
A text-to-speech synthesis method of the invention includes the steps of obtaining a normalized text, selecting acoustic units corresponding to the text from a database that stores a predetermined number of possible acoustic units, transmitting compressed acoustic units from a server machine to a client machine, and, in the client, decompressing and concatenating the units. The selected units are compressed before being transmitted to the client using a compression method that depends on the predetermined number of possible acoustic units. This results in minimal degradation between the original and received acoustic units. For example, parameters of the compression method can be selected to minimize the amount of data transmitted between the server and client while maintaining a desired quality level. The acoustic units stored in the database are preferably compressed acoustic units.
Preferably, the client machine concatenates the acoustic units in dependence on prosody data that the server generates and transmits to the client. The client can also store cached acoustic units that it concatenates with the acoustic units received from the server. Preferably, transmission of acoustic units and concatenation by the client occur simultaneously for sequential acoustic units. The server can also normalize a standard text to obtain the normalized text.
The invention also provides a text-to-speech synthesis method performed in a server machine. The method includes obtaining a normalized text, selecting acoustic units corresponding to the normalized text from a database that stores a predetermined number of possible acoustic units, and transmitting compressed acoustic units to a client machine. The selected acoustic units are compressed using a compression method that depends on the specific predetermined number of possible acoustic units. Specifically, the compression method minimizes the amount of data transmitted by the server while maintaining a desired quality level. Preferably, the acoustic units in the database are compressed acoustic units. The server may also generate prosody data corresponding to the text and transmit the prosody data to the client. It may also normalize a standard text to obtain the normalized text.
Also provided is a text-to-speech synthesis method performed in a client machine. The client receives compressed acoustic units corresponding to a normalized text from a server machine, decompresses the units, and concatenates them, preferably in dependence on prosody data also received from the server. Preferably, the method steps are performed concurrently. The units are selected from a predetermined number of possible units and compressed according to a compression method that depends on the predetermined number of possible units. For example, parameters of the compression method can be selected to minimize the amount of data transmitted to the client machine. The client can also store at least one cached acoustic unit that it concatenates with the received acoustic units. The client can also transmit a standard text to be converted to speech to the server, or it can normalize a standard text and send the normalized text to the server.
The present invention also provides a text-to-speech synthesis system containing a database of predetermined acoustic units, a server machine communicating with the database, and a client machine communicating with the server machine. The server machine selects acoustic units and generates prosody data corresponding to a normalized text and then transmits compressed units and the prosody data to the client. The client machine concatenates the received acoustic units in dependence on the prosody data. The compressed acoustic units are obtained by compressing the selected acoustic units using a compression method that depends on the predetermined acoustic units. Preferably, parameters of the compression method are selected to minimize the data transmitted from the server to the client. The database preferably stores compressed acoustic units. Either the client or server can also normalize a standard text to obtain the normalized text. Preferably, the client stores cached acoustic units that are concatenated with the received acoustic units.
The present invention also provides a program storage device accessible by a server machine and tangibly embodying a program of instructions executable by the server machine to perform method steps for the above-described text-to-speech synthesis method.