This invention relates to speech or voice translation systems. More particularly, this invention relates to style control in the natural language generation of a spoken language translation system.
Speech is the predominant mode of human communication because it is very efficient and convenient. Certainly, written language is very important, and much of the knowledge that is passed from generation to generation is in written form, but speech is a preferred mode for everyday interaction. Consequently, spoken language is typically the most natural, most efficient, and most expressive means of communicating information, intentions, and wishes. Speakers of different languages, however, face a formidable problem in that they cannot effectively communicate in the face of their language barrier. This poses a real problem in today""s world because of the ease and frequency of travel between countries. Furthermore, the global economy brings together business people of all nationalities in the execution of multinational business dealings, a forum requiring efficient and accurate communication. As a result, a need has developed for a machine-aided interpersonal communication system that accepts natural fluent speech input in one language and provides an accurate near real-time output comprising natural fluent speech in another language. This system would relieve users of the need to possess specialized linguistic or translational knowledge. Furthermore, there is a need for the machine-aided interpersonal communication system to be portable so that the user can easily transport it.
A typical language translation system functions by using natural language processing. Natural language processing is generally concerned with the attempt to recognize a large pattern or sentence by decomposing it into small subpatterns according to linguistic rules. Until recently, however, natural language processing systems have not been accurate or fast enough to support useful applications in the field of language translation, particularly in the field of spoken language translation.
While the same basic techniques for parsing, semantic interpretation, and contextual interpretation may be used for spoken or written language, there are some significant differences that affect system design. For instance, with spoken input the system has to deal with uncertainty. In written language the system knows exactly what words are to be processed. With spoken language it only has a guess at what was said. In addition, spoken language is structurally quite different than written language. In fact, sometimes a transcript of perfectly understandable speech is not comprehensible when read. Spoken language occurs a phrase at a time, and contains considerable intonational information that is not captured in written form. It also contains many repairs, in which the speaker corrects or rephrases something that was just said. In addition, spoken dialogue has a rich interaction of acknowledgment and confirmation that maintains the conversation, which does not appear in written forms.
The basic architecture of a typical spoken language translation or natural language processing system processes sounds produced by a speaker by converting them into digital form using an analog-to-digital converter. This signal is then processed to extract various features, such as the intensity of sound at different frequencies and the change in intensity over time. These features serve as the input to a speech recognition system, which generally uses Hidden Markov Model (HMM) techniques to identify the most likely sequence of words that could have produced the speech signal. The speech recognizer then outputs the most likely sequence of words to serve as input to a natural language processing system. When the natural language processing system needs to generate an utterance, it passes a sentence to a module that translates the words into phonemic sequence and determines an intonational contour, and then passes this information on to a speech synthesis system, which produces the spoken output.
A natural language processing system uses considerable knowledge about the structure of the language, including what the words are, how words combine to form sentences, what the words mean, and how word meanings contribute to sentence meanings. However, linguistic behavior cannot be completely accounted for without also taking into account another aspect of what makes humans intelligentxe2x80x94their general world knowledge and their reasoning abilities.
For example, to answer questions or to participate in a conversation, a person not only must have knowledge about the structure of the language being used, but also must know about the world in general and the conversational setting in particular.
The different forms of knowledge relevant for natural language processing comprise phonetic and phonological knowledge, morphological knowledge, syntactic knowledge, semantic knowledge, and pragmatic knowledge. Phonetic and phonological knowledge concerns how words are related to the sounds that realize them. Morphological knowledge concerns how words are constructed from more basic units called morphemes. Syntactic knowledge concerns how words can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Semantic knowledge concerns what words mean and how these meanings combine in sentences to form sentence meanings. This is the study of context-independent meaningxe2x80x94the meaning a sentence has regardless of the context in which it is used. Pragmatic knowledge concerns how sentences are used in different situations and how use affects the interpretation of the sentence.
The typical natural language processor, however, has realized only limited success because these processors operate only within a narrow framework. A natural language processor receives an input sentence, lexically separates the words in the sentence, syntactically determines the types of words, semantically understands the words, pragmatically determines the type of response to generate, and generates the response. The natural language processor employs many types of knowledge and stores different types of knowledge in different knowledge structures that separate the knowledge into organized types. A typical natural language processor also uses very complex capabilities. The knowledge and capabilities of the typical natural language processor must be reduced in complexity and refined to make the natural language processor manageable and useful because a natural language processor must have more than a reasonably correct response to an input sentence.
Identified problems with previous approaches to natural language processing are numerous and involve many components of the typical speech translation system. Regarding the spoken language translation system, stylistic variations are an important part of the message that people convey to each other in verbal communication. It is thus crucial for the spoken language translation system to be able to recognize the mode of the input and to be able to transfer into the appropriate style or mode in the target language in order to achieve a high-quality translation as a communication aid. Since the way in which stylistic variations are encoded varies significantly from one language to another, it is very important to have a systematic way to encode such characteristics in order to generate high-quality output.
On the other hand, the main attraction of natural language interface or dialogue systems derives from the natural and flexible way for people to interact with machines. As the system becomes more powerful and sophisticated, the user naturally expects to see more character and style from the xe2x80x9cagentxe2x80x9d with which they are talking. The typical natural language interface or dialogue systems use xe2x80x9ccannedxe2x80x9d expressions with some ability of substituting noun phrases for their output. For those systems that use rule-based generation components, the focus has typically been to generate grammatically correct structure with all the necessary information. Consequently, little attention has been paid to generating stylistically different natural language expressions.
A method and an apparatus for style control in natural language recognition and generation are provided. An acoustic input is received comprising at least one source language. The acoustic input comprises words, sentences, and phrases in a natural spoken language. Source expressions are recognized in the source language. Style parameters are determined for the source expression. The style parameters may be extracted from the source expression, set by the user, or randomly selected by the natural language system. A recognized source expression is selected and confirmed by a user through a user interface. The recognized source expressions are translated from the source language to a target language. An acoustic output is generated from the translated target language source expressions using the style parameters. The style parameters comprise variations selected from a group comprising formality, local dialect, gender, and age variations.
These and other features, aspects, and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description and appended claims which follow.