1. Field of the Invention
The present invention relates to information processing apparatuses and text-to-speech methods, and in particular, relates to a technique for generating a text to be read aloud by an information processing apparatus including a function of reading text information aloud (a text-to-speech (TTS) engine).
2. Description of the Related Art
Recently, systems including a function of reading, upon inputting a text, the text aloud (a TTS engine) have been developed to be widely used in, for example, telephone answering service for cellular phones. For example, in personal mobile service, such systems are used as a voice service for reading aloud information such as electronic mail, news, and market trends in response to a phone call made by a user even when a mobile terminal, a computer, or the like does not exist near the user.
On the other hand, it is common to connect a device storing audio information to be reproduced to audio equipment and reproduce music on the basis of the audio information. Such audio information includes tune (song) data. Tune data includes, together with the digital data of a tune, tag data in which information such as the title and artist of the tune is described. For example, it has become possible to, on the basis of tag data, display the title of a tune that is being reproduced or the like on a display screen or read the title of the tune or the like aloud, using a TTS engine.
When text is read aloud (orally) using a TTS engine, a text input into the TTS engine is precisely converted to speech signals. Thus, it is necessary to input correct text into the TTS engine. Japanese Unexamined Patent Application Publication (Translation of PCT Application) No. 2007-509377 discloses a technique for inputting a correct text into a TTS engine, e.g., checking the spelling of a text to be input into a TTS engine and converting an ambiguous text to a correct text by checking with a user about the ambiguous text.
The oral reading of tag information of digital audio such as a tune name and an artist name (tune information) can be heard using a text-to-speech conversion function, as described above. Regarding such tune information, a text to be converted to speech may be generated by replacing predetermined replacement symbols in a text template that is prepared in advance with the characters of tune information acquired from digital audio equipment.
When all of the tune information used in a text template can be read aloud, the text that is generated includes no grammatical error, and thus an appropriate speech response is returned to a question.
However, when tag information includes characters that cannot be handled by a TTS engine, for example, Chinese characters in a case where the TTS engine supports American English, since the portion of the Chinese characters cannot be read aloud, in general, the portion is set as a blank portion. Thus, no speech is output for the blank portion. Even in this case, portions other than the portion replaced with blanks in a text template are converted to speech. As a result, an unnatural text is read aloud. For example, it is assumed that a text template of a response to a question “What song is this?” is “It is <Song> by <Artist>”, <Song> is replaced with a tune name, and <Artist> is replaced with an artist name. In this case, when the tune name is “Happy Song”, and the artist name does not exist, the response text is “It is Happy Song by,” which is an unnatural speech output.