The present invention relates generally to the field of text-to-speech conversion systems and in particular to a method and apparatus for performing text-to-speech conversion in a client/server environment such as, for example, across a wireless network from a base station (a server) to a mobile unit such as a cell phone (a client).
Text-to-speech systems in which input text is converted into audible human-like speech sounds have become commonly employed tools in a variety of fields such as automated telecommunications systems, navigation systems, and even in children""s toys. Although such systems have existed for quite some time, over the past several years the quality of these systems has improved dramatically, thereby allowing applications which employ text-to-speech functionality to be far more than mere novelties. In fact, state-of-the-art text-to-speech systems can now automatically synthesize speech which sounds quite close to a human voice, and can do so from essentially arbitrary input text.
One well known use of text-to-speech systems is in the synthesis of speech in telecommunications applications. For example, many automated telephone response systems respond to a caller with synthesized speech automatically generated xe2x80x9ceon the flyxe2x80x9d from a set of contemporaneously derived text. As is well recognized by both businesses and consumers alike, the purpose of these systems is typically to provide a customer with the assistance he or she desires, but to do so without incurring the enormous cost associated with a large staff of human operators.
When telecommunications applications involving text-to-speech conversion are used in wireless (e.g., cellular phone) environments, the approach invariably employed is that the text-to-speech system resides at some non-mobile location where the input text is converted to a synthesized speech signal, and then the resultant speech signal is transmitted to the cell phone in a conventional manner (i.e., as any human speech would be transmitted to the cell phone). The central location may, for example, be a cellular base station, or it may be even further xe2x80x9cbackxe2x80x9d in the telecommunications xe2x80x9cchainxe2x80x9d, such as at a central location which is independent from the particular base station with which the cell phone is communicating. The conventional means of transmitting the synthesized speech to the cell phone typically involves the process of encoding the speech signal with a conventional audio coder (fully familiar to those skilled in the art), transmitting the coded speech signal, and then decoding the received signal at the cell phone.
This conventional approach, however, often leads to unsatisfactory sound quality. Speech data requires a great deal of bandwidth, and the information is subject to data loss in the wireless transmission process. Moreover, since in speech synthesis the parameters are decoded to produce a speech signal and in wireless transmission the speech is encoded and subsequently decoded for efficient transmission, there may be an incompatibility between the coding for synthesis and the coding for transmission that may introduce further degradation in the synthesized speech signal.
One theoretical alternative to the above approach might be to place the text-to-speech system on the cell phone itself, thereby requiring only the text which is to be converted to be transmitted across the wireless channel. Obviously, such text could be transmitted quite easily with minimal bandwidth requirements. Unfortunately, a high quality text-to-speech system is quite algorithmically complex and therefore requires significant processing power, which may not be available on a hand-held device such as a cell phone. And more importantly, a high quality text-to-speech system requires a relatively substantial amount of memory to store tables of data which are needed by the conversion process. In particular, present text-to-speech systems usually require between five and eighty megabytes of storage, an amount of memory which is obviously impractical to be included on a hand-held device such as a cell phone, even with today""s state-of-the-art memory technology. Therefore, another more practical approach is needed to improve the quality of text-to-speech in wireless applications.
In accordance with the principles of the present invention, a method and apparatus for performing text-to-speech conversion in a client/server environment advantageously partitions an otherwise conventional text-to-speech conversion algorithm into two portions: a first xe2x80x9ctext analysisxe2x80x9d portion, which generates from an original input text an intermediate representation thereof, and a second xe2x80x9cspeech synthesisxe2x80x9d portion, which synthesizes speech waveforms from the intermediate representation generated by the first portion (i.e., the text analysis portion). Moreover, in accordance with the principles of the present invention, the text analysis portion of the algorithm is executed exclusively on a server while the speech synthesis portion is executed exclusively on a client which may be associated therewith. In accordance with certain illustrative embodiments of the present invention, the client may comprise a hand-held device such as, for example, a cell phone.
In accordance with various illustrative embodiments of the present invention,the intermediate representation of the input text advantageously comprises at least a sequence of phonemes representative of the input text. In addition, phoneme duration information and/or phoneme pitch information for the speech to be synthesized may be advantageously determined either at the server (i.e., as part of the text analysis portion of the partitioned text-to-speech system) or at the client (i.e., as part of the speech synthesis portion of the partitioned text-to-speech system). Similarly, other prosodic information which may be employed by the speech synthesis process may be alternatively determined by either of these two partitions.
And also, in accordance with one illustrative embodiment of the present invention, certain audio segment information which is to be used by the speech synthesis,portion of the text-to-speech process may be advantageously transmitted by the server to the client, and a cache of such audio segments may then be advantageously maintained at the client (e.g., in the cell phone) for use by the speech synthesis process in order to obtain improved quality of the synthesized speech. The server may also advantageously maintain a model of said client cache in order to keep track of its contents over time.