1. Field of the Invention
The present invention relates to a speech translation apparatus for receiving a spoken original language and outputting a spoken target language equivalent in meaning to the original language, and a speech translation method and program for use in the apparatus.
2. Description of the Related Art
In recent years, research into elemental technologies such as speech recognition, machine translation and speech synthesis has progressed, and speech translation systems are now being put into practical use, which combine the technologies to output a spoken target language when receiving a certain spoken original language.
In most speech translation systems, an original-language text acquired by recognizing input speech in an original language by speech recognition is converted into a target-language text equivalent thereto in meaning, and speech in the target language is output utilizing speech synthesis.
In the above speech recognition, a text as a recognition result is generated mainly using the feature of phonemes contained in input speech. However, speech also contains prosody information, such as accents and intonations, which not only imparts constraints on language information concerning accents and/or structure, but also expresses information (para-language or phatic-language information) other than language, such as the feeling or intent of speakers. The para-language information enables enriched communications between speakers, although it does not appear in the text as the recognition result.
To realize more natural communication via speech translation systems, a scheme has been proposed in which para-language information expressed by prosody is reflected in output speech as a translation result. For instance, a scheme has been proposed in which a machine translation unit and speech synthesis unit require, when necessary, a speech recognition unit to supply prosody information (see, for example, JP-2001-117922 (KOKAI)).
Suppose here that English speech, “Taro stopped smoking <emph>surely</emph>” (the portion between the tags <emph>and </emph> is emphasized), is input to, for example, an English/Japanese speech translation system, with “surely” emphasized, pronounced with a greater volume or more slowly. In this case, the above-mentioned existing schemes enable the English/Japanese speech translation system to output a Japanese translation result, i.e., </emph>,” with a Japanese word group,  corresponding to “surely,” emphasized, pronounced, for example, with a greater volume.
However, when a conventional speech synthesis scheme is used, natural and appropriate emphasis of a to-be-emphasized portion cannot always be realized. For instance, in a synthesis target Japanese sentence  the Japanese word  (pronounced in ‘pittari’)” has an accent core “pi,” and hence it is natural to speak the word with a higher pitch. Thus, since in natural speech, the word  is spoken with a higher pitch, even if the next Japanese word  is spoken with a higher pitch to emphasize it, this word will not be so conspicuous. In contrast, if the volume or pitch of the word  is greatly changed to emphasize it, natural speech cannot be realized.
Namely, the prosody of sentences are produced based on both accents and intonations, and the to-be-produced prosody pattern of an emphasized portion is modified by the prosody pattern of the words around an emphasized word.
Further, in JP-2001-117922 (KOKAI) mentioned above, to make prosody information on an original language correspond to prosody information on a target language, examples of translation rules recited along with prosody information are disclosed. As described above, to always produce a translation that enables the speech synthesis unit to produce appropriate and natural prosody, it is necessary to consider the influence of information indicating, for example, the ambient words or syntax structure. However, it is difficult to write translation rules covering all these things. Further, writers of translation rules must be familiar to the prosody production patterns employed in the speech analysis unit.
In summary, the above-described conventional schemes have the following problems:
1. There are texts which it is difficult even for known prosody producing schemes considering to-be-emphasized portions to translate so that only to-be-emphasized portions are emphasized appropriately and naturally.
2. In machine translation, it is difficult to establish translation rules for outputting translation results that enable natural prosody to be produced by a later prosody producing process.
3. In machine translation, if a target-language text as a translation result is converted into emphasized syntax, using para-language information concerning the original language, which is an emphasized portion can be informed. In this method, however, the equivalence in meaning between the original language and target language may well be degraded. Accordingly, it is natural that emphasis information contained in the prosody of input speech is expressed as the prosody of a target-language speech.