The language knowledge needed to process complex language comprises several categories, three of which include, 1) morphology—of the study of meaningful components of words; 2) syntax—the study of structural relationships between words; and 3) semantics—the study of meaning or senses of words. In automatic natural language processing systems, human languages are parsed by computer programs. However, human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language. Therefore, natural language processors use the categories of linguistic knowledge for resolving ambiguity at one of these levels. A word, phrase, sentence, can be considered ambiguous if it can be interpreted in more than one way, i.e., if more than one linguistic structure can be associated with it. For example, syntactic ambiguity arises when a sentence can be parsed in more than one way. Lexical ambiguity arises when context is insufficient to determine the sense of a single word that has more than one meaning. And semantic ambiguity arises when a word or concept has an inherently diffuse meaning based on widespread or informal usage.
Because different natural languages are structured differently, there are different approaches to natural language processing for different types of languages. For example, the processing of Latin or Anglo-Saxon languages requires a different approach than for the processing Arabic or Asian languages, for instance. However, no matter of the language being processed, the models and algorithms comprising natural language processors use the categories of linguistic knowledge to resolve or disambiguate these ambiguities. In the evolution of automatic natural language processing, different combinations of these language knowledge categories have been used in varying degrees.
The first type of linguistic processors commercially developed utilized a morphological and speech tagging approach. The morphological approach uses parsing algorithms that attempt to recognize different words having the same root. So example, if the word “work” is a root, the words “working,” “worked,” and “works” share that same root word. Thus, the first type of linguistic technology focuses on morphological recognition of a word, and is the starting point for any linguistic technology.
Any linguistic processor that performs morphological recognition has two requirements. One is the use of a stored dictionary of words. The dictionary stores not only a list of words comprising a particular language, but also information pertaining to reach other words. Some basic information stored for the words includes a morphological listing of root words.
The second requirement for any linguistic processor is part-of-speech tagging, also called grammatical tagging. In this process, the words in the text are marked as corresponding to a particular part of speech, based on both its dictionary definition, as well as its context—i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph. When attempting to recognize words in the text, the linguistic processor utilizes grammatical rules to define which words are nouns, verbs, adjectives, adverb, and so on. There are many approaches for performing text analysis, such as Lexical Functional Grammar (LFG), for example.
The morphological and grammatical analyses described above are the two basic elements required for any linguistic processor that performs natural language processing. Processing text using only morphological analysis and grammatical analysis is referred to as a shallow linguistic processing or tagging.
The next step of linguistic analysis beyond the shallow approach is deep linguistic processing. Deep linguistic processing involves sentence analysis and uses the outcome of shallow linguistic processing to determine sentence structures (e.g., identifying the subject, the object, the verb, the direct object, and so on. Sentence analysis is much more language specific than the other steps in the process because there are significant sentence structure differences between languages. Sentence analysis has been introduced into commercial systems, but has been used sparingly because it is a step in the process that requires a significant amount of time over those employing shallow approaches. It is believed that between 70 to 80% of today's commercially available natural language linguistic processors perform shallow linguistic processing.
Typically, existing algorithms for sentence analysis are mainly based on proximity and group recognition. In this approach, the verb is used as a starting point for finding a relevant set of text, and then other elements are recognized such as the subject, object, and complements. Typically, the majority of published algorithms use a proximity concept that uses heuristics such as, the subject is before the verb, the adjective follows the transitive verb, and so on. Also, semantic information is used to during this analysis.
The final step of deep linguistic processing is semantic disambiguation. In existing systems, semantic disambiguation is mainly implemented in restrictive domains, meaning that to have a semantic understanding of the text, the system first requires an understanding of the contextual reference or main concept inside the text.
Despite the approaches described above, the field of automatic natural language processing hasn't yet reached the status of mainstream technology nor had related commercial success. This may be due to at least the following disadvantages. One disadvantage is that nearly all conventional approaches to semantic disambiguation use statistical approaches to address complex issues, such as word sense disambiguation. In order to simplify the complexity of the problem, several approaches consider, in one way or in another, the use of statistics as the key to enable natural language processing. For example, Pat. US 2005/0049852A1 describes a system based on probabilistic grammar analysis and statistic training to improve the results of each step of the analysis. However, this approach fails to provide the level of precision and quality that is required to ensure that the automatic management of unstructured information is a viable alternative, especially in complex scenarios. This quality level can be achieved only if the complexity is considered and faced in its entirety.
Another disadvantage is the reliance on genetic algorithms. The genetic approach is usually used in conjunction with the statistical and narrow macro approach to improve the quality of the results of the semantic disambiguation. The genetic approach attempts to make better use of the information extracted from the statistical approach. While the genetic approach may slightly improve the quality of the results of semantic disambiguation, a system for real use has yet to be demonstrated that provides sufficient quality of the semantic understanding of the text. This attempt to simplify the processing based on the reduction of the possible combinations to be analyzed (e.g., Patent WO 2005/033909 A2) limits the capability to maximize the precision of the analysis, causing again lower precision in the disambiguation and in the extraction of links between the concepts identified in the text.
A further disadvantage is that both statistical and narrow macro approaches require that the natural language processing system first be trained to perform semantic disambiguation using a set of examples that contain the type of information the system trying to understand. The system learns the content of the training examples and creates mechanical rules for performing the semantic disambiguation later. Requiring that the system be trained prior to use can be inefficient and time-consuming.
Accordingly, there is a need for an improved computer-implemented method for automatically extracting relations between concepts included in text.