Computers allow the storage and retrieval of an increasing amount of information. This information is in turn made available to users through local computer networks as well as global computer networks such as the Internet. This information is generally in the form of narrative texts. As the amount of information available increases, so does the difficulty in locating, extracting and making sense of relevant information. Additionally, the information contained in narrative texts is not organized in a form that can be easily processed and manipulated by a computing apparatus to extract desired information.
Natural language information extraction systems are used to extract information from written texts such as to facilitate its processing by a computing apparatus. Commonly, the process of natural language understanding can be divided in three distinct sub-processes namely morphological analysis, syntactic processing and semantic analysis.
The role of semantic analysis is to generate a logical form that describes the meaning of a sentence rather than just the syntactical link between words.
Morphological analysis is the process of assigning to each word the most likely part-of-speech (pos) or morphological tag. Words can have different forms (for example: “work”, “works”, “working”, “worked” are different forms of the word “work”) and can have different roles (for example: “work” can be a noun as in “difficult work”, or a verb as in “I work hard”). Those roles are commonly referred to as part-of-speech (pos) or morphological tags. Morphological analysis can be divided into two separate stages namely morphological tagging and morphological disambiguation. Morphological tagging is the process of determining the set of possible morphological tags for a word. Morphological tagging is relatively well understood in the art. Morphological disambiguation is the process of determining the actual or most likely morphological tag of a word in a sentence.
There are two well-known classes of methods for morphological disambiguation: probabilistic and rule-based. Probabilistic methods make use of statistical measurements derived from a plurality of training sentences. Probabilistic methods generally make use of training sentences that are tagged by hand. Using well-known statistical methods, a general disambiguation process is obtained by training a computer program on these hand tagged training sentences. This method is described in Elworthy, D., “Part-of-speech tagging and phrasal tagging”, Technical report, University of Cambridge Computer Laboratory, Cambridge, England, 1993 whose contents are hereby incorporated by reference. In addition to the costs of manually tagging the training sentences, the performance of probabilistic methods highly depends upon the training sentences used to derive the statistical measurements. Rule-based methods use linguistic rules written by people. In Voutilainen, Atro, “Morphological disambiguation” 1995, whose content is hereby incorporated by reference, examples of rule based methods are described. These linguistic rules examine the context in which a word appears and either assign a definite morphological tag or remove an unlikely possibility from the set of possible morphological tags. A deficiency in known rule-based approaches is that they make use of general linguistic rules that fail to address particular cases leading many ambiguities to be left unresolved.
Syntactic processing uses the information provided by the morphological analysis and attempts to identify the relationships between words (ex: subject, object, complement etc.). There are two common methods for representing the syntax of a sentence: constituency and dependency.
The overwhelming majority of parsers use constituency syntax. In constituency syntax, a sentence is depicted as a tree where each node is labeled with the type of constituent (ex: noun phrase, verb phrase, etc.) and the leaves store the individual words of the sentence. The tree itself is ordered and arcs on the tree are not labeled.
A few parsers use dependency syntax where a sentence is depicted as a tree where all nodes and leaves are associated to words in a sentence and arcs in the tree are associated to data elements indicative of relationships between words (ex: subject, object, etc.). Dependency syntax is described in Tesnière, Lucien, “Élèments de syntaxe structurale”, Éditions Klincksieck, Paris, 1959; Mel'cuk, Igor A., “Dependency Syntax: Theory and Practice”, State University of New York Press, Albany, 1987. The contents of these documents is hereby incorporated by reference.
A deficiency in prior art syntactic processors is that they do not provide practical domain-independent parsing capabilities. More specifically, to provide a suitable level of performance, syntactic analysis should work on complete sentences. However, complex sentences are often very long and contain various punctuation and symbols. Prior art parsers often have difficulty returning a complete parse on sentences beyond a certain level of complexity and for this reason have a poor performance when parsing complex texts such as those found in newspapers, journals and the likes.
Another deficiency in prior art syntactical processors is that they provide no practical way of deferring syntactical disambiguation to a later stage of analysis while preserving a plurality of syntactical possibilities. For example, different constituents and words can be attached at different places (ex: “receive a flu shot in a leg” compared to “receive a flu shot in a clinic”). Prior art syntactic processors attempt this disambiguation on syntactic basis alone or combined with simplified semantic tagging and only provide correct results in a low percentage of cases.
Thus, there exists a need in the industry to refine the process of natural language understanding so as to obtain an improved natural language information extraction system.