I. Introduction
For several decades, researchers in various areas of computer science have attempted to develop methods to enable machines to understand the natural language spoken and written by human beings (e.g., English, Chinese, Arabic) in a scalable, automated fashion. While computers can perform specific tasks for which they've been programmed, the state of the art does not provide a method or system for automated general understanding of the meaning of words and phrases in context.
Many applications, including machine translation (or MT) of human languages, voice recognition technology, search, retrieval and text mining systems, and artificial intelligence applications, require automated understanding of natural language in order to be fully effective. The obvious benefits of such applications, if broadly enabled, have motivated universities, governments and corporations to invest many decades of time and collectively billions of dollars of capital looking for a method that would enable computers to process and understand written or spoken natural language. Given the significant effort in these fields without a breakthrough, many in the scientific community question whether true machine understanding of natural language is possible. Even most advocates of the idea that computers will one day be capable of wide-ranging human-type understanding see that time as still decades away.
II. State of the Art of Machine Translation
Most language translation to date is performed by skilled and expensive human translators. Automating the language translation process would have major economic benefits ranging from significant cost reduction of translation to enabling new time-sensitive translation applications like on-the-fly cross-language text or voice communications and multilingual daily news publications.
Machine translation devices and methods for automatically translating documents from one language to another are known in the art. However, these devices and methods often fail to accurately translate sentences from one language to another and therefore require human beings to substantially edit the many errors made by the devices before output translations can be used for most applications. The current state of the art systems accurately resolve 60% to 80% of the words they translate among the Latin languages, but the percentage of publishable quality sentences translated by these systems in a broad domain is typically less than 40%. The accuracy of existing machine translation systems for non-Latin based languages is even lower. The only exceptions are narrowly customized special purpose machine translation systems that do not generalize across application domains. Moreover, most commercially deployed machine translation systems require man-decades of development for each direction of each language pair.
Achieving accurate machine translation is more complicated than providing devices and methods that make word-for-word translations of documents. Because each word's meaning is highly dependent on the context it is found in, simple word-for-word translation of sentences results in wrong word choices, incorrect word order, and incoherent grammatical units.
To overcome these deficiencies, known translation devices have been designed to attempt to make choices of word translations within the context of a sentence based on a combination or set of lexical, morphological, syntactic and semantic rules. These systems, which have been developed for over 40 years and are known in the art as “Rule-Based” machine translation (Rule-Based MT) systems, are flawed because there are so many exceptions to the rules that they cannot provide consistently accurate translation. The most prominent company providing machine translation based primarily on the Rule-Based method is Systran, which began the development of their machine translation engines in the 1960s. Rule sets are laboriously handcrafted and always incomplete, as it is extremely difficult if not impossible for human developers to encompass all the nuances of language in a finite set of rules.
In addition to Rule-Based MT, in the last two decades a new method for machine translation known as “Example-Based” machine translation (EBMT) has been developed. EBMT makes use of sentences (or possibly portions of sentences) stored in two different languages in a cross-language database. When a translation query in the Source Language matches a sentence in the database, the translation of the sentence in the Target Language is produced by the database, providing an accurate translation in the Target Language. If a portion of a translation query in the Source Language matches a portion of a sentence in the database, these devices attempt to accurately determine which portion of the Target sentence (that is mapped to the Source Language sentence) is the translation of the query. “Source” refers to the content in one language or state that is being translated into another language or state; “Target” refers to content in a language or state that the Source is being translated into.
EBMT systems known in the art cannot provide accurate translation of a language broadly because the databases of potentially infinite cross-language sentences will always be predominantly “incomplete.” And since EBMT systems do not reliably translate partial matches and sometimes incorrectly combine correctly translated portions, the accuracy of these systems is in the same approximate vicinity as the Rule-Based engines.
Another machine translation approach that is often used independently, as well as in conjunction with EBMT, is Statistical Machine Translation (SMT). SMT systems attempt to automate the translation process using pairs of translated documents in combination with a large corpus of documents in just the Target Language. Compared to Rule-Based MT, both EBMT and SMT significantly reduce the time to develop a translation engine for a pair of languages. The accuracy of SMT systems is comparable to Rule-Based MT and EBMT systems and is, therefore, not adequate for the production of translated documents in a broad domain.
SMT systems use what is known in the art as an “n-gram model” and are based on Shannon's “noisy channel model” for information transfer. These methods assume translation to be imperfect, and by design, SMT methods produce translations based on their probability of being correct based on the training corpora. These methods take a “best guess” at translations for each word based on the two, or at most three, other adjacent words in the Source and Target Languages. These methods gain less marginal benefit with increases in the size of the cross-language and Target Language training corpora, and have continued to make only incremental improvements over the last several years. For example, one of the higher quality SMT systems developed over the past years at the University of Southern California recently published the results of a test of their SMT system. After training on the domain-specific corpus (the Canadian Legislature proceedings), their system translated 40% of the text sentences correctly (AMTA 2002 Proceedings, October 2002).
Some translation devices combine Rule-Based MT, SMT and/or EBMT engines (called Multi-Engine Machine Translation or MEMT). Although these hybrid approaches may yield a higher rate of accuracy than any system alone, the results remain inadequate for use without significant human intervention and editing.
III. State of the Art of Statistical Natural Language Processing for Semantic Acquisition
The field of statistical natural language processing (NLP) includes the research and development of automated machine learning from text for various applications. One application of NLP is SMT for machine translation, as discussed above. Although various NLP methods attempt to extract the meaning from natural language, as a leading textbook on the subject makes clear, the state of the art is far from a solution: “The holy grail of lexical acquisition is the acquisition of meaning. There are many tasks (like text understanding and information retrieval) for which Statistical NLP could make a big difference if we could automatically acquire meaning. Unfortunately, how to represent meaning in a way that can be operationally used by an automatic system is a largely unsolved problem.” (Manning and Schutze, Foundations of Statistical Natural Language Processing, 5th printing, 2002, p 312).
There is a great need for organizations to better manage the knowledge they've captured in unstructured text such as word-processed documents, PDF files, email messages and the like. Although information previously assembled in databases can be searched and retrieved effectively, a practice referred to in the art as data mining, the broad mining of unstructured text (representing 80% or more of the world's data) to look for ideas and concepts is not currently possible using the state of the art systems. While Boolean and other keyword search methods find information using the words contained in the user's query, most ideas and concepts can be expressed in a large number of different ways, many of which will not exactly or even approximately contain a particular keyword or other search term. This means many relevant documents that will be identified when conducting a “concept-based” search (which is not limited to the query words the user provides) will be missed when a keyword search is conducted.
For instance, if the word string “terms and conditions” was submitted in quotes (indicating the exact string) as part of a keyword search, the system will find references to “terms and conditions” but not identify other words and word strings (a word string is two or more adjacent words in a specific order) or other abbreviations or representations expressing the same idea that may be of interest to the user, such as “conditions of use”, “restrictions”, “tos”, “terms of service”, and “rules and regulations”. The ability for a system to add close semantic equivalents to the search query when looking for relevant information would enhance the quality and efficiency of search in a variety of ways. Moreover, there are no comprehensive phrasal level synonymy or near-synonymy dictionaries. They simply do not exist because there are too many two- and three-word terms to manually create synonym lists for each, let alone all the terms that are longer than three words. Existing methods to automatically generate thesauri using patterns in text have had limited success in the broad semantic acquisition of natural language. The state of the art methods for concept extraction using patterns of words that occur in text include similarity assessment methods such as vector space models using various measures. Some of these methods attempt to find synonymous or related words by identifying individual words as points of context.
Some methods consider words that are different distances from a query and focus on the proximity and frequency of co-occurrence of individual words in relation to the query. These methods include an n-gram based method (Martin, Ney: Algorithms for Bigram and Trigram Word Clustering, Speech Communication 24, pp 19-37, 1998; Brown et al: Class-Based N-gram Models of Natural Language, Computational Linguistics, 18(4), pp. 467-479, 1992; and the Window-based Method (Brown et al)). Other related work in this area includes: Finch & Carter (1992, Bootstrapping Syntactic Categories Using Statistical Methods); Schutze & Pederson (1997, A Co-Occurrence-Based Thesaurus and Two Applications to Information Retrieval), among many others. While the contextual information has provided some results, the breath and accuracy of the results achieved using these methods has been limited and, therefore, they've had limited practical application in commercial products for search and retrieval, content management, and knowledge management.
Most advanced search and text mining applications use manually assembled linguistic rules, semantic knowledge, and ontologies and taxonomies. These methods and systems can be used to provide semantic clues for meta-tagging data by category as well as other purposes. In addition, some systems incorporate various supervised and unsupervised statistical learning and extraction methods including Bayesian methods assessing relevance probabilities to add to the analysis for search and/or categorization. These systems do not effectively mine text because the methods do not yield consistently accurate (i.e., relevant) search results. Additionally, because meta-tagging involves the pre-defining of information into categories to be used as part of enhanced search, the category determination requires that static labels be put on multi-dimensional ideas (that may also evolve or change categories over time). None of these systems are designed to mine information to find other words and phrases of equivalent meaning to query terms.
The ability of a system to identify semantically equivalent alternative representations of a word or word string within a language has many applications. The ability to generate synonymous expressions for any expression, in addition to text mining, is also a very effective component of any corpus-based machine translation system. In addition, the ability to identify expressions of equivalent meaning is machine understanding of natural language, and this ability could provide the foundation for artificial intelligence (AI) applications.
IV. State of the Art of Artificial Intelligence
The most ambitious goal of machine understanding of human language is for use in a system that achieves full-scale human quality intelligence, i.e., a system that is capable of reasoning rationally and exhibiting human-type common sense. This field of computing, referred to as “Strong AI,” has as its ultimate goal to enable computers to understand natural language, interact with people or other computers using natural language, learn concepts, make insights, and perform cognitive tasks. While a machine translation system has the task of understanding information only to the level necessary for the purpose of converting the information into another form, Strong AI applications need the capability to not only understand information and its other forms and states, but also to manipulate that information in a way that triggers the system to learn to answer questions and perform other cognitive tasks, such as draw conclusions from premises, discover relations from observations, and set sub-goals to pursue further knowledge gathering in anticipation of expected future needs.
The mathematician Alan Turing devised the Turing Test in 1939 as a conceptual design for testing whether a machine achieved human quality intelligence. Although a machine that passed the Turing Test would not necessarily completely fulfill the promise of all the ambitions of Strong AI, even the most optimistic proponents of Strong AI feel that a computer will not convincingly pass the Turing Test for decades.
AI methods known in the art vary in approach. The vast majority of commercial AI applications address far more narrow tasks than the goals of Strong AI. These applications are sometimes referred to as “Weak AI” and produce at best “idiot-savant”-type systems capable only in the confines of a narrow task such as playing master-level chess. Various methods used to produce these systems include manually encoding knowledge and rules, and systems that can learn how to generalize certain encoded knowledge to perform narrowly defined tasks. Other methods like neural nets have been developed to train systems to learn, again in very narrowly defined domains. In the absence of a true breakthrough that enables broad machine understanding of natural human languages, the focus on narrow problems enables practical applications for specific tasks.
There have been relatively few Strong AI software initiatives. Typically Strong AI systems known in the art manually encode knowledge using a specific computer language designed for that purpose and then employ a system to manipulate that knowledge in the aggregate to attempt to answer questions or perform tasks. The most prominent example of a Strong AI system using a manually created ontology of encoded knowledge is the Cyc system developed at CycCorp by computer scientist Doug Lenat. The Cyc system requires human beings to manually encode a vast amount of common sense knowledge as well as domain-specific knowledge (and understand the different representations of that knowledge), which are “rules” for the system to follow. An example of a hand encoded rule or piece of knowledge for Cyc might be “once people die they stop buying things” or “trees are usually outside.” Cyc has been in development since 1984 without producing a system with wide ranging human intelligence. To date, they have hand encoded fewer than 2 million of these very specific rules.
An enabling breakthrough in Strong AI would have far reaching implications. The evolution of technological advancement would increase dramatically as scalable computer processing and memory, armed with human quality intellect, is focused on the issues and problems we all face. A fundamental breakthrough in Strong AI could literally change the world as we know it.