Conventionally, in a machine translation system, the original text to be translated is analyzed sentence by sentence by the analyzers thereof. Analysis steps thereby such as morphological analysis, syntax analysis(parsing) and semantic analysis are executed sequentially. To start with, a sentence is parsed through the morphological analysis, followed by the syntax analysis how the parsed words are related and arranged grammatically. At the step of syntax analysis, for example, using a top-down vertical search algorithms, the sentence is analyzed according to Context Free Grammar, and branched into a roots, nodes and leaves, until the termination is reached which is minimum unit for parsing. In the final analysis, tree structure is derived as a whole. Through the semantic analysis, for example by referring and collating a noun having attributes in semantics which are described in dictionaries in the machine translation system, the meaning of the subject is determined, and yet by referring and collating the information regarding sentence structures able to be formed by a detected predicate, the semantic attribute and structure of the subject are determined.
Such a parsing algorithm is implemented as the parsing rule based on the aforementioned analysis tree structure. In the tree structure, individual sentences are related with one another only vertically according to the text, and individual morphemes are related with one another only vertically within the sentence. All of the relationship among these elements must be reduced into only positions in a hierarchy. No relations other than the vertical relation are extracted from the tree structure. For the information regarding the relationship among words and phrases in the sentence, only the relationship able to be extracted as information is the relation that a conjunction positioned at a node indicates the anaphoric relationship together with the previous or subsequent clause or the position relation among adverbs and adjectives in phrase structure.
The text is a unity of syntax and meaning, and a stream of sentences and words. From the viewpoint of the information theory, the text is a randomly variable and continuous information source. Nevertheless, the text is parsed into discrete and unrelated symbols in the aforementioned analysis algorithm, and thus the relationship in the text is outputted as fragmented information called strings of the symbols. The analysis algorithm means information processing for handling the text as a discrete information source. Even if strings mentioned above correspond to the markov information source of Markov process, the information is discrete. Therefore, the relationship information will be lost.
As mentioned above, no relationship within a text can be analyzed through the syntax analysis using the tree structure in the conventional machine translation system, and then no relationship information in semantics and syntax is available. In the conventional machine translation system, the context and the syntax are analyzed insufficiently, which decreases the precision of the translation.
For example, when an English sentence is analyzed in the conventional machine translation system, either the information regarding the connection among clauses or the information regarding correlative and subordinate conjunctions cannot be analyzed or extracted. The correlative conjunctions are used in pairs and define the structure and meaning of the connection. The correlative information is provided by a correlative pair of the precedent adverb and the correlative conjunction, e.g., such . . . that, so . . . that, so . . . as or other, and by a pair of subordinating conjunctions, e.g., partly because . . . , partly because. Such correlative information cannot be analyzed through the conventional morpheme analysis or through the syntax analysis. In the conventional machine translation system, the sentence structure including the correlative words fails to be analyzed.
Through the analysis using the tree structure, it cannot be interpreted that even the same word provides different information according to its position or priority in word order in a sentence. There is a difference in connotation between the word "however" positioned at the beginning of the sentence and that positioned immediately after the subject of the sentence. The adverb "however" positioned immediately after the subject indicates that the content of the sentence forms a contrast to the content of the previous sentence. Such correlative information, word position information or other relationship information cannot be extracted from the original sentence or reflected in the translated sentence in the conventional machine translation system.
The shortcomings in the conventional machine translation system are now explained referring to the parsing tree of a sentence shown in FIGS. 1A and 1B. The parsing tree is composed of an dependent clause and a main clause. In FIGS. 1A and 1B, S denotes a sentence, ADP denotes an adverbial phrase, AD denotes an adverb, NP denotes a noun phrase, N denotes a noun, VP denotes a verbal phrase, V denotes a verb, PP denotes a prepositional phrase, P denotes a preposition, IA denotes a definite article, DA denotes a demonstrative adjective, CON denotes a conjunction, AJ denotes an adjective, and AUX denotes an auxiliary verb. When the sentence "The more all countries link their networks and develop their information infrastructure, the more we all will reap in terms of economic, educational, health care, and environmental benefits." is analyzed using the tree structure as shown in FIGS. 1A and 1B the sentence is parsed into two clauses at a node of the first comma. Subsequently, each clause is divided in phrases and the phrases are branched into individual discrete morphemes. In the process, the relevance information between the clause starting with "The more" and the clause starting with "the more" is lost. Although the repetition of the comparative means the concurrence and synergistic effect of two affairs or situations, such meaning of the clauses is also lost. The meaning and syntax represented by the indivisible clauses in the sentence are lost from the original sentence. For example, the sentence shown in FIGS. 1A and 1B is translated or transformed through the conventional machine translation system into "All countries more link their networks and develop their information infrastructure, and we all will reap more in terms of economic, educational, health care, and environmental benefits." In the translated sentence, no information regarding the correlation between the former clause and the latter clause is represented. The translated sentence does not indicate that the development of the condition mentioned in the former clause increases the result mentioned in the latter clause. Although in the sentence resulting from the conventional machine translation system, "The" positioned at the beginning of the dependent clause and "the" positioned at the beginning of the main clause are analyzed as definite articles, the former "The" is actually a relative adverb meaning "by how much" and the latter "the" is actually a demonstrative adverb meaning "by so much". Such mistake in analysis is made because no correlation information can be extracted.
Also in the conventional machine translation system, by semantically analyzing the deep structure of a sentence, the relations of the individual words or morphemes within the sentence are analyzed. For example, the government-binding theory for analyzing the relationship, the text grammar for analyzing anaphora and cataphora and the like are proposed. The syntax analysis algorithm for complementing the tree structure analysis is also proposed for use in various machine translation systems. For example, a bottom-up method, a bi-directional method, LR method, LL method, Tomita method and other are proposed.
Since in the aforementioned parsing algorithms, the sentence is analyzed as the tree structure, the information on the original text can only be partially extracted. Although the words forming a sentence have organic relations in the sentence, the abstract meaning is synthesized from the universal grammar through the semantic synthesis in the deep structure, irrelevant of the concrete semantic relevance in the original text. The word "organic" in this specification means that parts of a text work are in collaboration and in coordination with each other just as biotic organs.
It is suggested that to mechanically process the natural language it must be considered that the syntax itself has its own meaning, in other words, expression with language is the unity of syntax and meaning. It is described in the paper of Information Processing Society of Japan titled "the Speaker's Cognition in Expression with Language and Multi-step Translation Method" authored by Messrs. Ikehara, Miyazaki, Shirai and Hayashi, Volume 28, No.12 published in December, 1987, that it is difficult to prevent the meaning from being lost in the element synthesis method in which the entire meaning is synthesized from the partial meaning without considering the meaning of the syntax. Also, it is described in the book titled "Computational Linguistics: An Introduction" authored by Ralph Grishman, published by Cambridge University Press in 1986 "Yet the information conveyed by a text is clearly more than the sum of its parts--more than meanings of its individual sentences" (Chapter 4).
Furthermore, in the conventional machine translation system using the aforementioned analysis methods, since the depth of analysis and the number of backtrackings are excessively increased, the speed of syntax analysis is decreased. The calculation time is exponentially increased relative to the length of the sentence. Although the calculation time is extended, no relationship information can be extracted and no analysis precision can be disadvantageously enhanced.
Specifically, the English description of the specification of patent applications have long and complicated sentences and cannot be syntax-analyzed in the conventional machine translation system. In preparation for machine translation, the sentences need to be manually edited by dividing and rewriting the sentences such that they are adapted for the machine translation. The text must be divided and edited to a level such that the syntax analysis through the machine translation system is feasible. Much labor and time are required for such preparation works, thereby inhibiting the smooth, quick and mass translation work. Recently, even the preparation work was mechanized in the machine translation system. However, in the mechanical preparation work, the relationship information of the original text can be only insufficiently extracted, which decreases the translation precision.