There are many systems in use for transmitting voice messages from one place to another. While public and private telephone networks are the most common example, voice or audio messages are also transmitted via computer networks, including the Internet and the part of the Internet known as the World Wide Web. In a relatively small number of telephone systems, and in most computer contexts, voice messages are transmitted in a digital, compressed, encoded form. Most often, various forms of linear predictive coding (LPC) and adaptive LPC are used to compress voice signals from a raw data rate of 8 to 10 kilobytes per second to data rates in the range of 1 to 3 kilobytes per second. Voice quality is usually rather poor for voice signals compressed using LPC techniques down to data rates under 1.5 kilobytes per second.
Messages are also commonly transmitted via telephone and computer networks in text form. Text is enormously more efficient in its use of bandwidth than voice, at least in terms of the amount of data required to transmit a given amount of information. While text transmission (including the transmission of various binary document files) is fine for recipients who have the facilities and inclination to read the transmitted text, there are many contexts in which it is either essential or desirable for recipients to have information communicated to them orally. In such contexts, the transmission of text to the recipient is feasible only if the receiving system includes text to speech conversion apparatus or software.
Text to speech conversion is the process by which raw text, such as the words in a memorandum or other document or file, are converted into audio signals. There are a number of competing approaches for text to speech conversion. The text to speech conversion methodology used by the present invention is described in some detail in U.S. Pat. No. 4,979,216.
In addition to the efficient transmission of voice messages, the present invention addresses another problem associated with real time distribution of digitized voice messages via computer network connections. In particular, it is very common for data transmissions between a network server, such as World Wide Web (hereinafter Web) server and a client computer to experience periods during which the rate of transmission is highly variable, often including periods of one or more seconds in which the data rate is zero. This produces unsettling results when the receiving client computer is playing the received data stream as an audio signal in real time, because the result can be that speech stops and restarts mid-word or mid-phrase with silent periods of unpredictable length.
Yet another problem with existing speech message transmission systems is that there is very little the receiving system can do with the received message other than "play it" as an audio signal. That is, the receiving system generally cannot determine what is being said, cannot modify the voice characteristics of received signals except in very primitive ways (e.g, with a graphic band equalizer), and cannot perform any actions, such as generating a corresponding animation of a speaking person, that would require information about the words or phonemes being spoken.
It is therefore an object of the present invention to provide a speech signal distribution system that efficiently transmits data representing speech signals and that enables receiving systems a high degree of control over the use of that data.
It is another object of the present invention to use text to speech conversion to convert text into a data stream of parameters suitable for driving an audio signal generator that converts the stream of parameters into an audio speech signal in accordance with a vocal tract model, and for transmission of the data stream to receiving systems having such audio signal generators.
Another object of the present invention is to transmit a high quality speech signal to receiving systems using a bandwidth of less than 1.5 kilobytes per second.
Another object of the present invention is to transmit a speech signal to receiving systems with sentence boundary data embedded in the speech signal so as to enable the receiving systems to present audio speech signals as full, uninterrupted sentences, despite any interruptions in the transmission of said speech signal.
Yet another object of the present invention is to transmit a speech signal to receiving systems with lip position data embedded in the speech signal so as to enable the receiving systems to generate an animated mouth-like image that moves in accordance with the lip position data in the received data stream.
Still another object of the present invention is to transmit a speech signal to receiving systems with voice setting data (e.g., indicating special effects to be applied to the speech signal) embedded in the speech signal so as to enable the receiving systems to control the generation of audio speech signals in accordance with the voice setting data in the received data stream.