A speech signal carries a wealth of paralinguistic information in addition to the linguistic message. Such information may include, for example, the gender and age of the speaker, the dialect or accent, emotions related to the spoken utterance or conversation, and intonation, which may indicate intent, such as question, command, statement, or confirmation seeking. Moreover, the linguistic message itself carries information beyond the meaning of the words it contains. For example, a sequence of words may reflect the educational background of the speaker. In some situations, the words can reveal whether a speaker is cooperative on a certain subject. In human-to-human communication this information is used to augment the linguistic message and guide the course of conversation to reach a certain goal. This may not be possible depending only on the words. In addition to the speech signal, human-to-human communication is often guided by visual information, such as, for example, facial expressions and other simple visual cues.
Modern speech-to-speech translation systems aim at breaking the language barriers between people. Ultimately, these systems should facilitate the conversation between two persons who do not speak a common language in the same manner as between people who speak the same or a common language.
In some languages, statements and questions differ only in terms of the intonation, and not the choice of words. When translating such sentences into these languages, it is important to notify the user as to whether these sentences are questions or statements. Current systems are not able to provide this function, and users can only make a best guess, which can lead to gross miscommunication.
In many cultures, spoken expressions are heavily influenced by the identities of the speaker and listener and the relationship between them. For example, gender plays a large role in the choice of words in many languages, and ignoring gender differences in speech-to-speech translation can result in awkward consequences. Furthermore, in many cultures, speaking to a teacher, an elder, or a close friend can greatly influence the manner of speech, and thus whether the translation is in a respectful or familiar form.
However, state-of-the-art implementations of speech-to-speech translation systems do not use paralinguistic information in the speech signal. This serious limitation may cause misunderstanding in many situations. In addition, it can affect the performance of the system by trying to model a large space of possible translations irrespective of the appropriate context. The use of paralinguistic information can be used to provide an appropriate context for the conversation, and hence, to improve system performance through focusing on the relevant parts of a potentially huge search space.