1. Field of the Invention
The present invention generally relates to methods and systems for natural language understanding (NLU), and more particularly to a method and system for relationship analysis for semantic disambiguation of natural language. The present invention can include one or more technologies described and referenced throughout the present specification in brackets and listed in the LIST OF REFERENCES, the entire disclosures of which are incorporated by reference herein.
2. Discussion of the Background
Current approaches to natural language understanding (NLU) involve statistical analyses to select meanings from a “world knowledge” database, to interpret the contextual meaning of messages. Natural language understanding can include the analysis of communication between a speaker and a listener, whether those individuals are communicating via literature, voice or another medium. The listener interprets the intentions of the speaker, picking the one meaning for each of the words and/or phrases that best matches the overall meaning of the message. Since people do this with apparent ease, the approach to computerizing NLU typically has been to mimic the human communications environment. Such an environment has been assumed to be based on the world knowledge of the listener, gleaned from a lifetime of experiences.
For more than 10 years, a major research effort has been undertaken to collect, categorize, and store this massive amount of often contradictory world knowledge information. However, the best analyses seem to rely on statistical methods, and nearly all NLU research in recent years has been to find the most successful statistical approach. But what if human communication does not rely on the analysis by the listener of the intentions of the speaker at all? What if the speaker, in the construction of the message, has (invisibly) inserted the actual single meaning of each word and/or phrase within the message? Then, the listener can extract such natural intelligence from the message and recognize the overall meaning from the collection of individual meanings. In other words, for sensible sentences the listener does not need world knowledge, and computers do not need such knowledge either.
Consider the following sentence:
They met at the bank.
Such a sentence is ambiguous and therefore cannot be understood. For NLU (either human or machine) to be successful, the sentence must be further explained, such as:
They met at the bank to withdraw money.
They met at the bank where the fishing was best.
They met at the bank of spotlights.
As these examples illustrate, a message is ambiguous unless each of the words and/or phrases has a single distinct meaning. That meaning, that natural intelligence, makes sense if the speaker constructed the message with distinct meanings for each word and/or phrase, otherwise the message would be ambiguous and neither human nor computer could understand the message. The following Table lists the major approaches to NLU for which projects have been attempted and results have been published in peer-reviewed journals.
TABLEMajor Approaches to natural language understanding (NLU).ApproachDescriptionCase GrammarSyntactical relationship of a noun, pronoun, oradjective to other wordsConceptualConcept relationships between wordsDependencyDependencyRelationships making some words dependent onAnalysisthe meaning of othersFulcrum AnalysisRecognition of grammatical patternsHeuristic ParsingBased on punctuation, prepositions, andconjunctionsLexical AnalysisBased on the words or phrases and definableitems in a vocabulary, irrespective of grammarLogical AnalysisUses common sense inference rules to understandwhat is meantMorphologicalBased on the smallest meaningful unit of aAnalysislanguageNumber LanguageTransforms words into numeric strings forProcessingcomputer processingPhilosophicalConsiders the thoughts behind the meaning, ratherAnalysisthan the words themselvesPivot LanguageCreation of an artificial language in place of anatural languagePredictive SyntaxMakes predictions about the category of a wordfrom earlier wordsPreferenceIncludes procedures for natural languageSemanticsunderstandingPrinciple-BasedGrammar is viewed as principles rather than rulesSemantic AnalysisBased on the meaning of words or phrasesSemantic GrammarGroups of semantic factors are used to indicatesyntactic elementsStatisticalBased on probabilistic analysis of relationshipsAnalysisbetween words or phrasesSyntacticBased on the grammatical relationships of wordsAnalysisand phrasesText PredictionAnticipates what the following words mean basedon past wordsTransferSystems using an intermediate language todescribe the source language before finaltranslationWord ExpertEach word is understood in context with theothers
The topics most relevant to the present invention include conceptual dependency, dependency analysis, lexical analysis, number language, semantic analysis, semantic grammar, statistical analysis, transfer systems, and word experts. Many of these studies are combinations of the linguistic approaches shown in the above Table.
To appreciate the historical perspective of useful NLU theories and suggestions that were abandoned because it was too much trouble to convert them to the latest computers, this review describes projects in a time ordered sequence within the general linguistic approach. The linguistic approaches are semantics-based systems, category-based systems, interlingual systems, artificial intelligence systems, and statistical systems. The systems were all developed for machine translation (MT) because that was (and still is) the area of natural language understanding of interest to funding organizations.
Research using syntactic, semantic, and morphological rules known at the time was done at Georgetown University from 1952-1963. This project found that:
1. Pre- and post-editing were not necessary;
2. The main problem was linguistic analysis;
3. Semantic feature codes were needed in dictionaries;
4. Intermediate languages for multilingual systems seemed feasible.
These investigations resulted in the Georgetown Automatic Translation (GAT) system, capable of limited translation of French, Russian, and Chinese to English (Dostert, 1955 [8]; Zarechnak & Brown, 1961) [55].
A replacement for the GAT system was developed by Latsec, Inc., in the mid-1960s. This bilingual system (Russian/English, French/English, Italian/English), named Systran, was used by the National Aeronautics and Space Administration (NASA) during the Apollo-Soyuz mission with Russia. The main components of Systran were two bilingual dictionaries (single word and multi-word) containing grammatical and semantic information. In addition, a high frequency dictionary, a limited semantics dictionary, a conditional limited semantics dictionary, and a main dictionary were referenced. Syntactic analysis required seven passes through the source language, as follows:
1. Resolution of homographs;
2. Identification of compound nouns;
3. Identification of phrase groups;
4. Recognition of primary syntactic relations;
5. Identification of coordinate structures within phrases;
6. Identification of subjects and predicates;
7. Recognition of prepositional structures.
Organization seemed to be the main problem with Systran. Information about either the source or target language lexicons or grammar was included in any mixture that seemed convenient. As a consequence, there was no uniformity, the methods were inconsistent, coverage and quality were uneven, and modifications of one section of the dictionary often had unexpected consequences in other parts of the system (Pigott, 1983 [35]). Also, the raw output of English to French was considered inadequate to provide detailed information to a French reader (Arthem, 1979 [3]). Nonetheless, Systran produced (and continues to produce) successful limited translations, and the system is still under development. For example, Xerox Corporation is using Systran to improve the clarity of their manuals (Hutchins, 1991 [17]), and the European Union is using it to aid their translation of documents (Reid, 2002 [37]).
Other work based on GAT was done at the Pan American Health Organization in the mid-1970s. Two working systems, ENGSPAN for translating English to Spanish, and SPANAM for translating Spanish to English, were developed. These systems used separate source language and target language dictionaries linked by lexical numbers and semantic markers (human, mass, etc.), and the mainframe computer system was integrated with a word processor. These systems do not deal with disambiguation beyond syntactic homographs, and post editing is essential (Vasconcellos, 1985 [47]).
Dependency analysis for Russian-to-English translations was investigated at the RAND Corporation from 1957-1960 (Hays, 1967 [15]). In this system, the relationships between words were determined by a series of iterations through the text. For example, in “He ate the green pepper” the relationship of “pepper” to “ate” was established after “green” to “pepper.” This project was limited by lack of a computational linguistic theory and never resulted in an operational system.
Investigations into common syntagmatic structures (possession, etc.) for Hungarian-to-Russian translations were carried out at the Institute of Linguistics, Moscow from 1955-1967. This program chose Hungarian as the translating language due to the special difficulties it shared with English, German, and the Turkic languages. Algorithms for morphological analysis, dictionary searching, homograph resolution, and recognition of sentence structure were given special emphasis. An attempt was made to produce theoretical sets of interlingual semantic features (Papp, 1966 [34]).
The fulcrum method of analysis for Russian-to-English translation was investigated at the Ramo-Wooldridge Corporation from 1958-1967. This project was designed for linguistic studies emphasizing semantics. Researchers tried to solve problems that occur frequently, not those of greatest intellectual difficulty. The system started with a crude word-for-word translation, followed by a syntactic analysis from grammatical information provided by the dictionary. Multiple meanings of words were eliminated, and idioms were recognized. The approach was problem-solving rather than theory-oriented. The fulcrum parser, essentially a linguistic pattern recognition algorithm to identify grammatical patterns, required strict sequencing for the searches and was language-specific (Garvin, 1968 [11], 1980 [12]; Mersel, 1961 [29]).
An ambitious German-to-English (later interlingual) project was begun at the University of Texas Linguistics Research Center (LRC) in 1958. This project was designed as a typical transfer system, in which the source language is analyzed, transferred to an intermediate form, and synthesized into the target language. The LRC model emphasized semantic translation, establishing bi-directional phrase-structured analyses of the source and target languages lexical senses.
A fully automatic interlingual system was attempted after 1970, based on the universal base hypothesis in which the surface structure of any language can be related to a universal base. The ultimate goal was a system which could recognize and produce synonymous sentences by deriving canonical form (i.e. semantic interlingual) representations from sentences and generating all surface realizations of such representations. The LRC could not overcome differences they found in world views reflected in the vocabularies and semantic relationships of languages.
After 1978 the various research projects were collected into the Mechanical Translation and Analysis of Languages (METAL) system for translating telecommunication and data processing texts. METAL is not fully automatic (post-editing is required) and is bilingual rather than interlingual. The LRC concluded that an MT interlingua or pivot language is probably impossible (Hutchins, 1986 [16]; White, 1985 [48]).
A German-to-English MT system with interlingual intentions was investigated at the Forschungsgmppe Linguistik und Maschinelle Sprachverarbeitung (LIMAS) in Bonn, West Germany from 1964-1976. The basic premise was that computer natural language processing, including MT, must be based on a language-independent semantic syntax, a communicative grammar, expressing content elements and their relations. A classification of content elements or semantic factors was developed (about 80 factors). Translation involved the comparison and matching of matrices of coded factors both between and within languages. In reality, the research became bogged down in the laborious establishment of a lexicon of semantic factors and the construction of factor matrices for English and German vocabulary (Lehmann & Stachowitz, 1972 [26]).
A rather successful transfer-based approach was pursued at the University of Montreal, Traduction Automatique de l'Universite de Montreal (TAUM) in the 1970's. Two systems were developed. TAUM-METEO was limited to English-to-French translations of public weather forecasts; TAUM-AVIATION was an English-to-French translation of aircraft maintenance manuals. METEO implemented a semantic grammar in which rules operated on semantic categories. The system was limited in scope and used a very restricted language subset, failing to translate only 20% of unedited reports mostly because of human typing errors. Failures from non-recognition of syntactic patterns were very rare. METEO was the only MT system regularly producing translations, which were not edited before being made available to the public. AVIATION was much more ambitious with a larger range of language. Although initial results were promising, the project ran out of development time and was canceled (Isabelle & Bourbeau, 1984 [18]; Thouin, 1982 [44]).
Studies on the effects of context versus definition for vocabulary retention of English for speakers of other languages (ESL) students were conducted by Markham (1989 [28]). He found that context-imbedded vocabulary exercises facilitate better long-term retention of words, however knowledge of the definition of a word is important in the initial phase of vocabulary development. Contextual meaning is also the natural word-building method observed in reading.
Research using the predictive syntactic analyzer approach was done for a Russian-to-English translation system at the National Bureau of Standards (NBS) from 1959-1963 (Rhodes, 1961 [39]). In this project, grammatical, lexical, and physical predictions were made for words further in a sentence by using categories and the success of earlier predictions. Only some syntactic problems were studied; semantic difficulties were considered to be beyond MT. The investigators concluded that only sentence parsing is mechanizable.
A theoretical investigation of interlingual semantic analysis using a thesaurus approach was pursued at Cambridge University, England from 1956-1967. With the goal of producing good-quality fully automatic idiomatic translations, the Cambridge Language Research Unit (CLRU) developed a structured conceptual classification of vocabulary as the basis for an interlingua Words were separated into lexical items (stems) and grammatical operators (e.g., endings or function words). The lexical items were accessed in the dictionary. The researchers concluded that a phrase-by-phrase translation might be a more natural approach than traditional sentence-by-sentence translation. This research was hampered by the lack of access to a computer and the difficulty with synonymy, polysemy, and the establishment of proper interlingual semantic components (Needham & Joyce, 1958 [31]). Tosh (1969 [45]) suggested using the categories already contained in Roget's Thesaurus to overcome these difficulties. Begun in 1805, Roget's Thesaurus has classified English words semantically into general classes and associated categories. Tosh pointed out that the various meanings of a word are given distinct numerical identifiers in the thesaurus that include considerably more detail than might be assumed at first glance. Using a thesaurus as a basis for MT, however, was never pursued.
The U.S.S.R. has pursued theoretical interlingual investigations at the First Moscow State Pedagogical Institute of Foreign Languages since 1957 and at the Leningrad University Experimental Laboratory of Machine Translation (ELMP) since in 1958. The Moscow program involved semantic analyses in which relationships were devised from dictionary entries of words formed as combinations of elementary semantic factors and relations. The emphasis was on problems of synonymy and paraphrase rather than homonymy, and on subtle semantic differences rather than crude lexical transfer (Zholkovskii, Leont'eva, & Martem'yanov, 1961 [56]).
The Leningrad program proposed an interlingua that was a complete artificial language, including morphology and syntax. Decisions about the inclusion of particular features were to be based on the averaging of phenomena of various languages with preference given to the major languages manifesting those features. Although the synthesis was for Russian only, theoretical studies were done for Russian, Chinese, Czech, German, Rumanian, Vietnamese, Serbo-Croatian, English, French, Spanish, Norwegian, Arabic, Hindustani, Japanese, Indonesian, Burmese, Turkish, and Swahili (Andreev, 1967 [2]; Papp, 1966 [34]). The strategy of using an artificial language to model natural language was bound to fail because natural languages, in contrast to artificial languages, are nondeterministic, ambiguous, and largely unrestricted (Su & Chang, 1990 [43]).
The Cambridge Language Research Unit studies on phrase-structured semantic grammars were continued at Stanford University from 1970-1974. In an artificial intelligence (AI)-oriented interlingual MT system, semantic frame templates based on triples of semantic features were used. The approach was purely semantic using common-sense inference rules. No syntactic structures (not even the boundaries of sentences) were considered. As a result, discourse analysis across sentence boundaries was a natural feature of the system (Wilks, 1972 [49], 1973 [50], 1975 [51]).
A more recent phrase-structured semantic MT system is under development for the Indian languages in the Dravidian family group (Tamil, Telugu, Kannada, Malayalam) and the Indo-European family group (Hindi, Punjabi, Gujarathi, and Bengali). This system is centered on using verbs to delimit sentence phrases and to build the representational structure. The meaning of the verbs is determined by using a frame template analysis but, unlike the CLRU system, syntactic analysis is also included (Raman & Alwar, 1990 [36]). As in other frame-based systems, until a large number of real world descriptions have been included in the knowledge base, the vocabulary is severely restricted.
The only MT project using a full-fledged interlingua as an intermediate language has been pursued at the Buro Voor Systeemontwikkeling, Utrecht, Holland since 1979. The intermediate language is Esperanto. MT processing involves a direct translation of the source language to Esperanto and a transfer from Esperanto to the target language. The system emphasizes technical material translations using artificial intelligence in a word processing environment with personal computers, but the lack of technical vocabulary in Esperanto has been a problem. A working system, named DLT (Distributed Language Translation), was developed using the computer language Prolog and tested for English-to-French translations. The long-term aim is a multilingual system for translation between European languages (French, German, English, Italian), with eventual extensions to other languages (Japanese, Chinese, Arabic) (Papegaaij, Sadler, & Witkam 1986 [33]; Witkam, 1984 [52]).
Another interlingual MT system based on a word processing environment has been under development at Logos Corporation since 1982. A working product, the Logos Intelligent Translation System using a proprietary Semantic Abstraction Language, has been shown to translate over 20,000 words in 24 hours. Dynamic dictionary software asks questions concerning the syntactic and semantic properties of unknown words and ensures compatibility with the rest of the dictionary. Semantic information is categorized and put in a hierarchical tree structure, with source language and target language data separation. This system works best with highly specialized texts, generating less clear translations for general correspondence material (Hawes, 1985 [14]; Tschira, 1985 [46]).
The Commission of the European Communities also started an interlingual project, named EUROTRA, in 1982. It was conceived as a distributed system, with researchers in each of the member countries responsible for translating from their own languages into a common linguistic representation. A modest transfer component for each language pair was intended, but never realized. This project coordinates the work of about 150 researchers in 12 countries, and progress has been disappointingly slow. Work continues, however, as a basis for continued research and because no other project seems to be better (Hutchins, 1991 [17]).
An interlingual approach based on philosophical, rather than linguistic, foundations was considered at the University of Milan from 1959-1966. In this approach, the contents of thought were regarded as activities and not, as in traditional philosophy, as objects. Four fundamental operations were identified; differentiation, figuration, categorization, and correlation. The researchers contended that since traditional linguistics could not deal with discontinuous structures or with homography and polysemy, additional linguistic theory was needed before machine translation was possible. The new theory that resulted was an early version of conceptual dependency networks (Shank, 1975 [40]), in which correlation conditions and classifications were proposed. Unfortunately, nearly all correlations were open for certain words. The system only translated three small examples of Russian-to-English sentences. The philosophical foundations proposed could be interpreted as grammatical categories and classifications, and the correlational grammar was effectively just another version of phrase structure grammar (Albani, Ceccato, & Maretti, 1961 [1]; Ceccato, 1966 [6], 1967 [7]).
Artificial intelligence interlingual approaches have been investigated since 1973 at the Institut fur Angewandte Sprachwissenschaft, University of Heidelberg, West Germany; since 1975 at Kyoto University in Japan; and from 1984-1987 at the Centre for Computational Linguistics at the University of Manchester Institute of Science and Technology (UMIST), and at the University of Sheffield, England. Results from these efforts have formed the basis for many of the current commercial R&D projects.
In the Heidelberg project, the interlingual features are restricted to syntactic (based on logico-semantic foundations) and structural relations. A working system named SALAT (System for Automatic Language Analysis and Translation) has been developed using knowledge database and inference rule aspects of AI. The clear objective is to devise logical formulae both for the deep structure component of transformational grammar and for knowledge base representations, both using an interlingua (Hauenschild, Huckert, & Maier, 1979 [13]).
The relevant Kyoto research includes two different approaches. The first project is an experimental interactive English/Japanese system, written in LISP, using a logico-semantic interlingua based on Montague's semantic theory (Montague, 1974 [30]). The second approach is a learning MT system, with the system developing it's own analysis based on the sentences presented to it (Hutchins, 1986 [16]).
The British effort intended to use the computer to help in translation, rather than as an independent translator. Based on the transfer approach with an interlingua for future expansion and written in Prolog, this system required human support for resolving ambiguities. Pre-editing, post-editing, and interactive assistance was used, in which the computer displayed alternative parses and requested the user to select the correct one (Wood, 1991 [53]).
The classical conceptual dependency theory was developed at Yale University from 1978-1982 as the foundation for an interlingual semantic-based artificial intelligence MT system. This theory asserts that human language understanding represents meaning in primitive semantic relationships (conceptual dependencies), expressing both explicit information and implied/inferred information. These relationships may be described with language-independent scripts which produce retellings rather than translations. In this theory, it is more important to convey the sense unambiguously than to preserve the structure and style of the original. A working system, named MOPTRANS (Memory Organization Packets-based Translator), was developed (Carbonell, Cullingford, & Gershman [4], 1981; Shank, 1975 [40]).
Conceptual dependency research was continued at the Georgia Institute of Technology beginning in 1982 under Richard Cullingford. Designed as an interlingual system, the first application of this work was for Ukrainian-to-English translations. The approach uses lexical entries containing information on case, gender, number, and semantic knowledge to predict and build representations. This system uses AI techniques with a refinement of case-frame parsers without the syntactic information, and is closely related to the word expert systems (Small, 1983 [41]).
Ishikawa, Izumida, Yoshio, Hoshiai, & Makinouchi (1987 [19]) are using a domain model, linguistic knowledge, and a database mapping scheme (collectively called a knowledge base) to semantically interpret queries. By continuously culling the possible areas of search, they try to avoid combinatorial explosion (a rapidly increasing number of possible combinations), the most common problem in semantic processors. An eventual goal for semantic processing systems is to make expert systems easier to use.
A more substantial knowledge-based effort to understand and translate sentences has been started by the Defense Advanced Research Project Agency (DARPA) and involves the complimentary expertise of three universities. New Mexico State University has two tasks, building vocabularies and parsing sentences; Carnegie-Mellon University is concentrating on the concept lexicons; and the University of Southern California is developing routines to translate an interlingua into a target language (Stone, 1991 [42]). The system, called Pangloss, is intended to produce flawless translations of documents as complex as newspapers articles from Spanish, German, and Japanese into English.
A practical attempt at word-for-word translation was pursued at the IBM Thomas J. Watson Research Center from 1958-1966. English was the target language, with Russian, French, and Chinese as source languages. The method of best equivalents based on probabilistic criteria with some backtracking was used to try to produce something readable. No attempt was made to attack hard linguistic problems. The system had difficulty with syntactic parsing and encountered considerable problems in semantics. The result was a translation with poor clarity that required extensive post-editing (Kay, 1973 [21]).
Attempts to predict the meaning of future words based on selected meanings of past words are current research efforts at the University of Montreal (Langlais, et al., 2000 [24]; Foster et al., 2002 [10]). Intended as a tool to speed translation by humans, the prototype system seems to have had the opposite effect. This is probably because system selections sometimes do not correspond with translator expectations, requiring additional work by them. Efforts at improving the statistical model continue. Additional work at the University of Illinois (Even-Zohar & Roth, 2000 [9]) has tried to provide a focus of attention mechanism to help the statistical prediction.
Statistical studies with parallel English and French texts are currently being undertaken by the Thomas J. Watson Research Center at IBM. Hundreds of millions of words from the Canadian Parliament's English and French proceedings are being placed in a computer database to find statistical relationships between words. New texts refer to this statistical knowledge to yield the most probable translation. This system uses no linguistic theory, but is reputed to be quite good within its domain (Hutchins, 1991 [17]; Stone, 1991 [42]).
With similar intentions, DARPA established a Linguistic Data Consortium (now funded by the National Science Foundation) to collect raw text (naturally occurring text from a wide range of sources, 5 to 10 billion words), annotated text (syntactic and semantic labeling of some parts of raw text, upwards of 20 million words), raw speech (spontaneous speech from a variety of interactive tasks, 400 hours, 2000 speakers), read speech (1,000 hours, 10,000 speakers), annotated speech (phonetic and prosodic labeling, 20 hours), a lexicon (a computational dictionary of 200,000 entries plus a term bank containing geographical, individual, and organizational names, 200,000 to 300,000 entries), and a broad coverage computational grammar. All of these sources will be statistically analyzed for both natural language processing and MT (Joshi, 1991 [20]).
A study of how a statistical system performs when translating text far different from the sources used to collect vocabulary and to train it found a significant drop in performance due to unknown words (Langlais, 2002 [25]). The researchers plan to overcome this problem with non-statistical resources.
Classification systems have also been investigated to try to determine text content and to limit the statistical analyses (Even-Zohar & Roth, 2000 [9]; Rennie, 2003 [38]), but the classifications noted have focused on example problems. No comprehensive classification has been proposed.
Theoretical research into the straification of grammar for Russian-to-English MT (and later Russian-to-Spanish and Chinese-to-English) was conducted at the University of California, Berkeley from 1958-1964. The theory posited a series of levels within which and between which linguistic units were related. The levels identified were phonemic, morphemic, lexemic, and sememic. Machine translation was visualized as a system of decoding and encoding through the levels (Lamb, 1961 [22]). The project concentrated on the lexical and semantic aspects of translation, the development of research tools, and maximally efficient routines for dictionary lookup. The major problem seemed to be the resolution of lexical ambiguities (Lamb, 1965 [23]).
Pivot language research projects have dominated efforts at the University of Grenoble since 1961. Artificial pivot languages were developed to avoid the morphological and syntactic problems of natural languages. The Centre d'Etudes pour la Traduction Automatique (CETA) system conjoined the lexical units of whichever two languages were being processed, with as many pivot languages as there were source/target language pairs. The main features were a transfer lexicon between languages, semantic analysis of dependency relations, and an interlingual syntax. The analysis methods were rather rigid, with only 42% of the sentences correctly translated and only 61% comprehensible (Hutchins, 1986 [16]).
The CETA system evolved into the Groupe d'Etudes pour la Traduction Automatique (GETA), a multilingual system in which linguistic data were separated from programming procedures to allow linguists to work with linguistic concepts rather than programming concepts. GETA was particularly strong in morphological and syntactic analysis and transformation, with good quality translations. Major weaknesses were the lack of semantic processing and the non-portable nature of the assembly language in which GETA was written. Investigations into the minimum amount of subject matter understanding necessary to translate a text from Russian to Bulgarian were conducted at the Bulgarian Academy of Sciences from 1964-1976. The premise was that knowledge of how to select the appropriate target language expressions for a given source language text was sufficient. A large part of the research program was devoted to quantitative and statistical studies of Bulgarian, from which the necessary translation information consisting of the basic lexical information and additional contextual information necessary for interpretation was proposed (Ljudskanov, 1968 [27]).
A different approach has been pursued towards a bilingual English/Japanese system at Hitachi in Japan since 1975. Called the Heuristic Parsing Model, the method is based on a non-standard grammar in which detailed parsing is avoided in favor of elementary grammatical knowledge of language learners. English sentences are segmented into phrasal elements and clausal elements on the basis of punctuation, prepositions, and conjunctions. Syntactic pattern matching is used with little consideration for semantic issues. The Hitachi theory is that syntax-directed parsers are best for English, but semantics-based approaches are better for Japanese. A working system, named ATHENE (Automatic Translation of Hitachi from English into Nihongo with Editing Support), has been developed. Ambiguous English constructions and multiple meanings of words are not included, and the system requires post-editing (Nitta, 1982 [32]).
A recent focus has been to identify the correct meaning of specialty terms in languages (Zanger & Stertzbach, 1991). For example, the word chip generally refers to a piece of something, but when used as a chip shot in golf it conveys an entirely different meaning. A computerized dictionary for lexically ambiguous sport terms is under development at Bowling Green State University. While useful to explain the meanings of these words to foreign language learners, this dictionary would not be needed for a machine translation system based on synonym comparisons.
An effort to incorporate advances in speech recognition with MT has resulted in a continuous-speech translation system named Janus for English, German, and Japanese speakers. A collaboration between Siemens A. G., ATR (Kyoto, Japan), the University of Karlsruhe, and Carnegie Mellon's Center for Machine Translation has demonstrated a system with a 400-word vocabulary that helped speakers register for a 1991 conference. Operating on a standard workstation with a relatively slow 7-30 second response time, Janus is based on a neural network and is accurate even when the meaning and sounds of a sentence are not clear (Carlson, 1992 [5]).
However, the various approaches to natural language understanding, as described above, suffer from a range of problems and can involve complex analysis, often based on complex statistical models and relationships, which may be the reason why many of such systems have yet to be commercially exploited.