The present invention relates generally to text-to-speech synthesis and more particularly to intelligent text-to-speech synthesis.
We receive a lot of information through hearing, especially when our visual attention is needed for other tasks, such as driving. Radio is a good source of audible documents, and some of us become quite dependent on it. Based on one study, on average, every family in the United States has five radios. Though radio might have become indispensable, the programs put forward by radio stations might not necessarily be what we are currently interested in.
Read-out documents or audio-documents, for example, novels, are available on the market. However, such tapes seem to be only available for a specific market sector. For example, there does not seem to be audio-documents for information with a short lifetime, such as news, weather forecasts or results of sport events. Some information, e.g. stock quotes, is only valuable for a very short period of time, and it would make no sense to produce such audio-documents.
A large number of audio-documents can be produced by automatically translating text into speech output. General discussions of such text-to-speech synthesis systems can be found, for example, in the following publications:
1. Multilingual Text-to-Speech Synthesis, The Bell Labs Approach, written by Richard Sproat, and published by Kluwer Academic Publishers, in 1998.
2. IBM ViaVoice.
Such systems typically perform direct word to sound transformation. The speech output is usually not very natural, and they tend to make mistakes. This might be because such systems are not xe2x80x9cawarexe2x80x9d of what they are reading.
The way we read takes into account what we are reading. For example, if we are reading the topic sentence of a news report, typically, we put in some emphasis. But, since existing systems do not seem to have any clue as to the meaning of the text they are transforming, they tend to transform input texts in the same speed, tone and volume. That is one of the reasons why the speech outputs of existing systems are typically monotonic and boring.
The way we read also should take into account our listener. If our listener is visually impaired and we are describing an object, we should include more details in the object. Moreover, the way we speak should also consider the hardware a listener employs to hear. For example, if your message is heard in a noisy room, probably, you should speak louder.
It should be apparent from the foregoing that there is still a need for an intelligent text-to-speech synthesizer that is, for example, sensitive to the content of the text, sensitive to the one hearing the text or adapts to the hardware the listener employs to hear the text.
The present invention provides methods and apparatus to synthesize speech from text intelligently. Different important, but previously ignored, factors in the present invention improve on the speech generated. The invented speech synthesizer can take into account the semantics of the input text. For example, if it is a man who should be speaking, a male voice will be used. The synthesizer can take into account the user profile of the person hearing the input text. The synthesizer can also be sensitive to the hardware the user employs to listen to the input text. Thus, the text-to-speech synthesizer is much more intelligent than those in the market.
There are a number of ways to implement the invention. In one embodiment, the synthesizer includes a transformer, a modifier, a text-to-speech software engine and a speech hardware. The transformer analyzes the input text and transforms it into a formatted text. The modifier then modifies this formatted text to fit the requirements of the text-to-speech software engine, whose outputs are fed to the speech hardware to generate the output speech.
The input text has a number of characteristics. It belongs to a class that has at least one specific pattern. For example, the pattern may be that the most important paragraphs of some type of articles are the first one and the last one, as in a newspaper.
The formatted text also has a number of characteristics. It can be independent of the text-to-speech software engine; for example, it is written in Extensible Markup Language (XML).
In one embodiment, the generation of the formatted text is based on the semantics of at least one word of the text. The semantics can be determined by an authorxe2x80x94a human being. In another approach, the semantics is generated through mapping the words to a database. For example, if the word is the name of a company, then the database can bring in additional information about the company, such as its stock price at a specific time. In another approach, the semantics is generated through an inference machine. For example, if the words are xe2x80x9cMr. Clinton,xe2x80x9d the inference machine, based on some pre-stored rules, will assume that the words refer to a male person. Then, a male voice might be used for that purpose.
In another embodiment, the transformation to generate the formatted text is based on at least one characteristic of the user listening to the synthesized speech. In yet another embodiment, the transformation to generate the formatted text depends on at least one characteristic of the hardware the user employs to listen to the synthesized speech. The above embodiments can be mixed and matched. For example, the transformation can be based on semantics of at least one word of the text and one characteristic of the user listening to the synthesized speech.
Based on the above approaches, a number of characteristics of the speech output can be determined. This can include the volume, the pitch, the gender of the voice, the tone, the wait period between one word from the next, and other special emphasis on a word. This special emphasis can be some type of sound that is based on the semantic, but not the syntactic meaning of the word. Examples of the sound made can be a deep sigh, a grunt or a gasp. These sound-based expressions can convey a lot of meaning. Just as a picture is worth a thousand words, appropriate sound or emphasis provides additional meaning that can be very fruitful in any communication process.
The formatted text can be further modified to fit the requirements of the text-to-speech software engine. In one embodiment, the modification is through tagging, where a tag can be a command interpreted by the engine, and is not a word pronounced by the engine. The modified text is then fed to the speech hardware, which generates the speech output.
Note that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Also, the features and advantages described in the specification are not all-inclusive. Other aspects and advantages of the present invention will become apparent to one of ordinary skill in the art, in view of the specification, which illustrates by way of example the principles of the invention.