Field of Use
The present invention relates generally to the field of natural languages and, more particularly, to the processing of language materials for use in automated systems, such as machine translation systems, speech recognition systems, and the like.
Applications for automated recognition and/or translation of natural languages abound. Well-known examples include speech or voice recognition and machine translation. Speech recognition, for example, is the automated processing of verbal input. This permits a person to converse with a machine (e.g., a computer system), thereby foregoing the need for laborious input devices such as keyboards.
Of particular interest to the present invention is machine translation (MT), which is the automated process of translating one natural language (source language) into another (target language), such as English translated into Chinese. Machine translation quite literally has global application. To conduct world-wide business, for example, international companies must acquire and process vast amounts of data which are often written in a foreign language. In turn, these companies must also communicate with overseas concerns, including foreign companies, governments, customers, and the like. Since translation performed by human translators is a time consuming and expensive task, any automation of the process is highly desirable.
When translating natural languages, a system must process information which is not only vast, but often ambiguous or uncertain. A word in a given passage will often have a meaning which can only be discerned from its context. Consider, for example, the word "flies" in the phrases "fruit flies like a banana" and "time flies like an arrow." In the former, the word is a noun; in the latter, it is a verb. Thus the word "flies," when examined in isolation, is ambiguous since its meaning cannot be clearly discerned. Consequently, a system translating a passage of text must "disambiguate" each word, i.e., determine the best analysis for a word from a number of possible analyses, by examining the context in which the word appears. For a text passage of even modest size, this process demands significant, if not substantial, processing time and expense.
To improve the speed and accuracy of machine translation systems, several approaches have been generally adopted. For example, machine translation systems are routinely implemented in high-speed digital computers, which provide not only the necessary computational power for machine translation but also are widely available. Other efforts have focused on the actual methods employed in the translation process.
Natural language processing (NLP), for example, is an artificial intelligence method widely used in language processing systems, including MT systems, to improve the quality of the translation. The general methodology of natural language processing, which includes the steps of input, analysis, parsing, translation, and displaying/output, is set forth in further detail hereinbelow.
Fast and accurate parsing is crucial to the overall performance and quality of any MT system. In general, parsing is the stage where each sentence of the materials to be translated is parsed or broken down into simpler linguistic elements. In a simple parsing scheme, for example, the grammatical relevancy of every word within a given sentence is discerned by creating a tree-like diagram or syntax tree; every word is positioned according to its part of speech and its relationship to every other word in the sentence. To achieve a fast and accurate parse (and hence translation), the best (in a probability sense) syntax tree or output analyses having the best semantic interpretation should be rapidly attained.
It is known to improve the speed and quality of MT system by refining the parsing process. Of particular interest to the present invention is "scored truncation" parsing which employs score values for truncating or cropping unlikely paths on the basis of statistical probabilities. Since the size of the "search space" is decreased, parsing time is also reduced. The performance of a scored-truncation system, however, is closely tied to the scoring mechanism: what type of scores are generated and what method or search strategy is employed to generate them.
A score may be based on the frequency of rule usage on a set of grammar rules, i.e., a set of principles specifying the construction of syntax (or semantic) constructions. Specifically, the statistical frequency of selected rules is employed during the analysis stage for determining the best syntactic output. Several sets of grammar rules are known, including PS Grammar, Generalized Phrase Grammar (GPSG), and Lexical-Functional Grammar (LFG); see Sells, P, Lectures on Contemporary Syntactic Theories: An Introduction to Government-Binding Theory, Generalized Phrase Structure Grammar and Lexical-Functional Grammar, 1985.
The rule-usage frequency method has distinct disadvantages, however. Since it is concerned only with the local phenomena (i.e., without reference to context, style, mood, or the like) when applied in a global environment, it often leads to inappropriate or even incorrect language processing (e.g., translation). Another problem with rule usage frequency method (and other scoring mechanism based on rules) is the difficulty of normalizing syntactic structures of different topology. For example, a sentence may be analyzed into two syntax trees, one with more tree nodes than the other. Since a node corresponds to a phrase structure rule and is associated with a probability, whose value is always less than or equal to 1.0, the tree with more nodes will, in general, be associated with a lower probability (or score). In this case, the tree with more nodes will not be favored simply because its node number is larger, not for any grammatical reason.
Other methods and apparatus for translating natural languages are known. U.S. Pat. No. 4,502,128, for example, describes a handheld translation device. A method of operation of the device includes sectioning an input sentence into individual words; retrieving from a lexical word storage parts of speech corresponding to the individual words, thus describing the input sentence by a corresponding string of the parts of speech retrieved; transforming the string of the parts of speech of the input sentence into a corresponding string of the parts of speech for the second (target) natural language by using a translation pattern table (previously defined); and sequencing target words in accordance with sequential order of the parts of speech of the string pattern obtained after the transformation.
U.S. Pat. No. 4,586,160 describes an improved method of syntax analysis for performing the syntactic analysis of an input sentence, even when the input sentence includes words unregistered in a dictionary. When a dictionary word corresponding to an input word is registered in advance in a dictionary section, a syntactic category belonging to the dictionary word is applied to the input word. When words unregistered in the dictionary section are included in the input sequence, the application of the syntactic category based on dictionary consultation is not possible. In this case, the unregistered word is compared with the unregistered word in the input sentence by assuming from the category data prepared in advance (i.e., assumptively applying), and it is analyzed.
U.S. Pat. No. 4,635,199 describes a machine translation system where word units of a source language are translated into word units of a target language through "pivot words" of a pivot language. A pragmatic table stores pairs of pivot words and pragmatic data for each pivot word pair which defines a semantic relationship between the pivot words of the pivot word pair in different languages. During analysis, the pragmatic table is referenced to link the pivot words in pairs by relation symbols in compliance with the dominant and dependent pairs and source surface data.
U.S. Pat. No. 4,641,264 describes an automatic translation method which includes assigning parts of speech to words of an input text sentence by looking up a lexicon storage, segmenting the input text sentence, which is in the form of a string of parts of speech, into phrasal elements as minimum units having linguistic meaning to assigned parts of speech; converting the sequence of phrasal parts of speech into strings of syntactic roles to the respective phrasal elements and words; detecting patterns representing a sentence (or clause) from the sequence of syntactic roles, thereby transforming the input text sentence to a skeleton pattern represented by a combination of those patterns; transforming the sequence of the simple sentence (or clause) to the output language.
U.S. Pat. No. 4,661,924 describes a translation system including multiple parts of speech disambiguating methods. A table containing parts of speech disambiguating rules for disambiguating a part of speech of a word which includes an array of parts of speech of successive words is included. On the basis of data read from the table, parts of speech which the words are capable of functioning as multiple parts of speech are determined.
U.S. Pat. No. 4,706,212 describes a method for translation which includes scanning and comparing source words with dictionaries of source language words such that when a match is found the word under examination is stored with coded information derived from the dictionary used, where the coded information includes memory offset address linkages to a memory in the computer system where grammar and target language translations for the word are stored.
U.S. Pat. No. 4,787,038 describes a machine translation system which uses different display attributes to highlight the existence of multiple possible translations. Classes of other possibilities are displayed in accordance with a predetermined priority order.
U.S. Pat. No. 4,791,587 describes a machine translation system which displays selected senses first when the same word/phrase appears a second time; the system stores the selected senses.
U.S. Pat. No. 4,800,522 describes an interactive machine translation system where correct senses are selected and stored; the selected senses are given higher priority when they are translated a second time.
U.S. Pat. No. 4,864,501 describes a system for annotating digitally and coded text which includes lexical tagging for syntactic and inflectional features. For those words not retrieved from a dictionary, a morphological analyzer assigns tags. The analyzer than recognizes words formed by prefixation and suffixation, as well as proper nouns, ordinals, idiomatic expressions, and certain classes of character strings. The tag words of a sentence are then processed to parse the sentence.
U.S. Pat. No. 4,864,502 describes a sentence analyzer for digitally encoded text material which annotates each word of the text with a tag and processes the annotated text to identify basic syntactic units such as noun phrase and verb phrase groups to identify nominal and predicate structures. From these, clause boundaries and clause types are identified. Heuristic "weights" are employed for disambiguation.
The disclosures of each of the foregoing references are hereby incorporated by reference.
While prior art methods, particularly those employing scoring techniques, have increased the performance of machine translation systems, all have distinct disadvantages which limit their usefulness. Particularly, prior art systems have ignored certain semantic, syntactic, and lexical information which allows a system to generate a fast, accurate translation. For example, prior art systems are often based on empirical or ad hoc assumptions, and are thus unable to be generalized.
What is needed is apparatus and methods which systematically performs consistently across varying domains or fields. Such a system should be widely applicable to any language processing environment. Furthermore, the system should eliminate or truncate undesirable analyses as early as possible, thus decreasing the search space in which the system must seek a correct solution. In such a system, only the best one or two output analyses (or a few high-score candidates) are translated and passed to a post-editor for review. Thus, by eliminating ambiguous constructions that will eventually be discarded, the goals of rapid processing time and high quality output may be realized. The present invention fulfills this and other needs.