1. Field of the Invention
This invention relates to a method that uses server-side storage of user's voice data for use by Instant Messaging clients for reading of text messages using text-to-speech synthesis.
2. Background of the Invention
Text-to-Speech Synthesis.
Traditional text-to-speech (“TTS”) synthesizing methods can be classified into two main phases, high and low-level synthesis. High-level synthesis takes into account words and grammatical usage of those words (e.g. beginning or endings of phrases, punctuation such as periods or question marks, etc.). Typically, text analysis is performed so the input text can be transcribed into a phonetic or some other linguistic representation, and phonetic information creates the speech generation in waveforms.
During high-level TTS processing, a text string to be spoken is analyzed to break it into words. The words are then broken into smaller units of spoken sound referred to as “phonemes”. Generally speaking, a phoneme is a basic, theoretical unit of sound that can distinguish words. Words are then defined or configured as collections of phonemes. Then, during low-level TTS, data is generated (or retrieved) for each phoneme, words are assembled, and phrases are completed.
Low-level synthesis actually generates data which can be converted into analog form using appropriate circuitry (e.g. sound card, D/A converter, etc.) to audible speech. There are three general methods for low-level TTS synthesis: (a) formant, (b) concatenative, and (c) articulatory synthesis.
Formant synthesis, also known as terminal analogy, models only the sound source and the formant frequencies. It does not use any human speech sample, but instead employs an acoustic model to create the synthesized speech output. Voicing, noise levels, and fundamental frequency are some of the parameters use over time to create a waveform of artificial speech.
Because formant synthesis generates more of a robotic-sounding speech, it does not have the naturalness of a real human's speech. One of the advantages of formant synthesized speech is its intelligence. It can avoid the acoustic glitches that often hinders concatenative systems even at high speeds. In addition, because formant-based systems have total control in its output speech, it can generate a variety of simulated emotions and voice tones.
Formant TTS synthesizing programs are smaller in size than concatenative systems, because it does not require a database of speech samples. Therefore, it can be use in situations where processor power and memory spaces are scarce.
The articulatory TTS synthesis approach models the human speech production directly, but without use of any actual recorded voice samples. Articulatory synthesis attempts to mathematically model the human vocal tract, and the articulation process occurring there. For these reasons, articulatory synthesis is often viewed as a more complex version of formant TTS synthesis.
Concatenative synthesis involves combining or “concatenating” a series of short, pre-recorded human voice samples to reproduce words, phrases and sentences, in a manner to have more human-like qualities. This method yields the most natural sounding synthesized speech. However, because of its natural variation, sometimes audible glitches plague its waveforms (e.g. clicks, pops, etc.), which reduces its naturalness. To speak a large vocabulary or dictionary, a concatenative TTS system also must have considerable data storage in order to hold all of the human voice samples. There are three subtypes of concatenative synthesis: unit selection, diphone, and domain-specific synthesis. All subtypes use pre-recorded words and phrases to create complete utterances depending on its methodologies.
To summarize, formant or articulatory TTS systems require less software and storage space, but do not yield a human-like voice having the character of any particular, real person. Formant TTS systems yield a voice sounding somewhat like the person from whom phoneme samples were taken, but these systems require considerably more storage space for the sample databases.
Text-Based Instant Messaging.
As the use of technology advances today, more people are using real-time messaging systems, such as America Online's (“AOL”) Instant Messaging (“AIM”)™, or International Business Machines' (“IBM”) SameTime™, as a way to communicate via their computer with one or more parties in a near real-time manner.
Both email and IM are generally text-based. In other words, they usually are used to send text-only messages, as their operation with graphics, movies, sound, etc., are either limited, inefficient, or unavailable, depending on the service or network being used.
Real-time messaging systems differ from electronic mail (“e-mail”) systems in that the messages are delivered immediately to the recipient, and if the recipient is not currently online, the message is not stored or queued for later delivery. With instant messaging, both (or all) users who are subscribers to the same service must be online at the same time in order to communicate, and the recipient(s) must also be willing to accept instant messages from the sender. An attempt to send a message to someone who is not online, or who is not willing to accept messages from a specific sender, will result in notification that the transmission can not be completed.
Thus, even though IM is generally text-based like e-mail, its communication, mechanism works more like a two-way radio or telephone than an e-mail system.
There are very few provisions in IM to assist users who are visually impaired. Text size, color and background can be adjusted to some degree. Additionally, some IM clients running on specific platforms, such as an IBM-compatible personal computer running Windows, can active a text-to-speech function which “speaks” text on the computer screen using a computer-like synthesized voice. This computer-like synthesized voice can be difficult to understand. Additionally, as the synthesized voice is the same tone and character for all text it reads, regardless of message author, the recipient of a message may find it difficult to determine who is sending IM messages to them.
Some new products have been introduced to enable sight-impaired people to communicate more effectively via IM. One such method is a completely client-based arrangement where the software allows the user to choose from several “stock” pre-recorded voices. The received text messages are audibly “read” using one of these voices to the receiver. The use hears the messages in the same voice and tone regardless of who originally sent the text messages. For example, if a user selects a male voice, that male voice will be used to read all messages, regardless of who authored the message, even if the author was female. Additionally, this type of formant-based TTS system requires storage space on the client device to hold the phoneme samples, which makes this system unattractive for low-cost, pervasive computing device use, such as personal digital assistants (“PDA”), smart phones, and the like.
Another approach offered currently in the market place is to couple a voice messaging system with an instant messaging system. If a message sender discovers that the intended recipient is not currently online, and thus cannot receive an IM message, the sender is given an opportunity to record a message in a voice mail system. The recorded voice message is then held for later retrieval by the intended recipient. This approach, however, doubles the effort required of the sender—first the sender must type a text message, then the sender must record a voice message. Additionally, this approach requires the intended recipient to use an interface besides the IM client—the recipient must somehow log into and retrieve a voice mail message.
Yet another attempt to address these issues has been to provide the client device of the IM message recipient with a capability to synthesize speech from IM message text with a user choice of assigning a particular “tone” of voice in the synthesizer based on the author of the message. This “tone” is not the tone or characteristic sound of the author, but instead is a computer-synthesized tone which can be used by the recipient to help differentiate between different authors of messages he or she receives.
Thus, the current instant text messaging technology lacks the intelligibly feature in enabling more effective communication for the sight-impaired users. None of these methods truly solves instant text messaging problem for the sight-impaired. Each of them exhibits one or more of the problems of requiring large amounts of code on the client device, requiring large amounts of sample storage on the client device, or failing to create speech which is similar in character and nature to that of a message sender or author.