The present invention is generally related to communication methods and systems employing text-to-speech engines and, more particularly, to a method and system for delivering text-to-speech in a real time telephony environment.
Text-to-speech (TTS) engines are computing devices which convert written text into audible computer generated speech. Telephony based applications require TTS engines to convert email, news, stock quotes, sports scores, and many other types of textual data into speech for delivery to telephony users. In these types of telephony applications, a speech version of a text document is demanded in real time by telephony users. Because the text which is requested by telephony users is not known beforehand, the text must be converted in real time and delivered without delay to the telephony users.
Performing high quality text-to-speech conversion or synthesis is resource intensive. For example, given 4,000 bytes of textual data, a typical TTS engine produces an audio or speech file having three million bytes to play for the telephony user. This is a 700 to one expansion ratio and presents a serious bottleneck for the synthesis of large textual documents. As a result, the telephony user will likely not wait for the several minutes it may take to convert the entire textual document into speech before the speech is provided to the telephony user. Synthesizing the text into speech before the telephony user requests the text is not a viable option as it is generally not known what the telephony user will request. Additionally, the physical storage requirements for a large number of pre-synthesized audio files is prohibitive in many environments.
Accordingly, it is an object of the present invention to provide a method and system for delivering text-to-speech (TTS) in a real time telephony environment in which text documents of any size are efficiently converted into speech which is provided immediately to a telephony user.
It is another object of the present invention to provide a method and system for delivering TTS in a real time telephony environment in which a first part of a text is converted into a first speech segment and the first speech segment is delivered to a telephony user while a second part of the text is being converted into second speech segment for delivery to the telephony user after the first speech segment has been delivered to the telephony user.
It is a further object of the present invention to provide a method and system for delivering TTS in a real time telephony environment in which a text is divided into text segments for conversion by a farm of TTS engines into speech segments which are then reassembled in the proper order and delivered to a telephony user.
It is still another object of the present invention to provide a method and system for delivering TTS in a real time telephony environment which employ a streaming buffer of speech converted from text for delivery to a telephony user in which the streaming buffer adapts to the bandwidth of the network delivering the speech to the telephony user.
It is still a further object of the present invention to provide a method and system for delivering TTS in a real time telephony environment which employ a streaming buffer for storing speech converted from text such that a first speech segment corresponding to a first text segment is delivered to the telephony user from the streaming buffer while a second speech segment corresponding to a second text segment is being delivered to the streaming buffer for future delivery to the telephony user.
In carrying out the above objects and other objects, the present invention provides a communication system for communicating information to a telephony user in response to a request for the information from the telephony user. The system includes a text data source having a plurality of text documents. A voice application is operable with the telephony user for receiving a request from the telephony user for information. The voice application is operable with the text data source for retrieving a text document related to the information requested by the telephony user. A text-to-speech (TTS) resource manager is operable for dividing the text document into text document segments and associating a sequence number with each text document segment. The TTS resource manager places the text document segments and the corresponding sequence numbers in a sequential order within a queue. A TTS engine farm has a plurality of TTS engines which are operable for receiving text document segments and the corresponding sequence numbers from the queue of the TTS resource manager in the sequential order for converting the text document segments into speech segments. Each text document segment is converted into a speech segment by one TTS engine. A buffer receives the speech segments and the corresponding sequence numbers from the TTS engines. The buffer uses the corresponding sequence numbers to reassemble the speech segments in the proper order and then delivers the speech segments in the proper order to the telephony user via the voice application in order to satisfy the request for information from the telephony user.
The TTS resource manager is operable to determine the rate at which speech segments are delivered to the telephony user from the buffer. The TTS resource manager divides the text document as a function of the rate at which speech segments are delivered to the telephony user such that the speech segments are delivered from the TTS engines to the buffer and from the buffer to the telephony user continuously.
The TTS resource manager is further operable to determine the load of each of the TTS engines. The TTS resource manager delivers the text document segments to the TTS engines as a function of the load of the TTS engines.
In operation, the buffer delivers a first speech segment to the telephony user via the voice application after the buffer has received a second speech segment from a TTS engine and while the buffer is receiving a third speech segment from a TTS engine such that the speech segments are delivered to the telephony user continuously. The buffer delivers the first speech segment to the telephony user via the voice application while a TTS engine is converting a fourth text document segment into a fourth speech segment.
The request from the telephony user may be an audio request. The voice application is operable for converting the audio request into a text request in order to retrieve a text document related to the information requested by the telephony user. Similarly, the request from the telephony user may be a dual tone multi-frequency request. The voice application is operable for converting the dual tone multi-frequency request into a text request in order to retrieve a text document related to the information requested by the telephony user.
Further, in carrying out the above objects and other objects, the present invention provides a communication method for communicating information from a text data source having a plurality of text documents to a telephony user in response to a request for the information from the telephony user. The method includes receiving a request from the telephony user for information. A text document related to the information requested by the telephony user is then retrieved. The text document is then divided into text document segments and a sequence number is associated with each text document segment. The text document segments and the corresponding sequence numbers are then placed in a sequential order within a queue. Respective text document segments and the corresponding sequence numbers are then transferred from the queue in the sequential order to respective TTS engines. Respective text document segments are then converted into speech segments using one TTS engine for each respective text document segment. The speech segments and the corresponding sequence numbers from the TTS engines are then stored in a buffer. The stored speech segments are then reassembled in the proper order in the buffer using the corresponding sequence numbers. The speech segments are then delivered in the proper order from the buffer to the telephony user in order to satisfy the request for information from the telephony user.
The method may further include determining the rate at which speech segments are delivered to the telephony user from the buffer. The text document is divided into text document segments as a function of the rate at which speech segments are delivered to the telephony user such that the speech segments are delivered from the buffer to the telephony user continuously.
The method may also include determining the load of each of the TTS engines, wherein transferring includes transferring the respective text document segments to the respective TTS engines as a function of the load of the TTS engines.
The advantages of the present invention are numerous. The present invention efficiently processes text documents of any size and begins playing the speech synthesis to the telephony user immediately. The present invention provides an immediate response to the telephony user and, in cases where the telephony user terminates the session by skipping to another text document request or by hanging up the telephone in the middle of a TTS conversion, the present invention intelligently terminates the conversion process of the TTS engines thus conserving otherwise wasted processing resources. This also provides an efficient means by which audio buffers are given to the telephony user at a rate to allow continuous playing of an audio stream while not overloading the voice application with unnecessary buffers which the voice application would need to manage and/or not use if the telephony user terminates the session.
The above objects and other objects, features, and advantages of the present invention are readily apparent from the following detailed description of the best mode for carrying out the present invention when taken in connection with the accompanying drawings.