1. Field
This application relates to systems and methods of automatic knowledge recognition and extraction from the documents in electronic or digital form, which reflect outside world regularities in the form of cause-effect relations between the facts.
2. Description of Related Art
The following U.S. Patent documents provide descriptions of art related to the present application: U.S. Pat. No. 5,418,889, issued May 1995 to Ito (hereinafter Ito); U.S. Pat. No. 6,185,592, issued Feb. 2001 to Boguraev et al. (hereinafter Boguraev 1); U.S. Pat. No. 6,212,494, issued Apr. 2001 to Boguraev (hereinafter Bogureav 2); U.S. Pat. No. 6,263,335, issued Jul. 2001 to Paik et al. (hereinafter Paik); U.S. Pat. No. 6,754,654, issued Jun. 2004 to Kim et al. (hereinafter Kim); U.S. Pat. No. 6,823,325, issued Nov. 2004 to Davies et al. (hereinafter Davies); and U.S. Pat. No. 6,871,199, issued Mar. 2005 to Binnig et al. (hereinafter Binnig).
Knowledge engineering is the major tool for intellectualization of modem information technologies. Knowledge engineering was traditionally based on generalization of information obtained from experts in different knowledge domains. However, analysis shows that this approach cannot be utilized for creating adequate real-life (industrial) applications. Two questions arise: first, what can be the most reliable and effective source of such knowledge; and second, how can this knowledge be recognized, extracted and later formalized. Analysis shows, that at the present time, the time of global computerization, the most reliable source of knowledge is text in the broad sense of the word, that is, text as a set of documents in natural language (books, articles, patents, reports etc.). Thus, the basic premises of knowledge engineering in the light of the second question are as follows:
1 text is the ideal natural and intellectual model of knowledge representation
2. one can find everything in the text
The second premise may seem excessively categorical, but with the tendency to increase the text range, this is more and more the case.
What types of knowledge can be obtained from text and with what automatic means? Some existing methods are aimed at databases having a strict structure and manually compiled or at texts with strictly defined fields. A shallow linguistic analysis of text is usually performed. Kim describes processing text with a rigid structure (primarily emails). Kim's process extracts corresponding information from previously known fields of source documents and places it in predefined fields of a database (DB) that reflects the structure of the organization (such a DB has, for example, fields for names and titles of individuals within an organization). The linguistic processing described in Kim is utilized only for the extraction of key terms from documents according to the so-called filters.
Davies describes the performance of lexical and grammatical analysis of text in order to differentiate nouns from verbs and to perform in such a way a strongly definite search in a predefined and structured database according to “how”, “why”, “what” and “what is” relations.
Binnig also describes the use of a pre-structured database (i.e., a Knowledge Database) in the form of a fractal hierarchical network, which reflects the knowledge of the outside world (knowledge domain) in order to automatically expand information from an input string. Initially the input string (for example, part of sentence, or the whole sentence, etc.) is treated with a semantic processor that performs syntactic and grammatical parsing and transforming to build an input network. This network is then “immersed” into the Knowledge Database to expand the input information, that is, some kind of recording and later expansion of input information by means of a model of the outside world concerning objects, their relations and attributes.
Boguraev 1 describes the performance of a deep text analysis where, for text segments, the most significant noun groups are marked on the basis of their usage frequency in weighted semantic roles.
All abovementioned cases concern a particular knowledge about concepts. This is an entry level of knowledge that can be extracted from text.
Boguraev 2 describes the use of computer-mediated linguistic analysis to create a catalog of key terms in technical fields and to also determine doers (solvers) of technical functions (verb-object).
Ito describes the use of a Knowledge Base including Causal Model Base and Device Model Base. The Device Model Base has sets of device knowledge describing the hierarchy of devices of the target machine. The Casual Model Base is formed on the basis of the Device Model Base and has sets of casual relations of fault events in the target machine. Thus, the possible cause of failure in each element of the device is guessed on the basis of information about its structural connections with other elements of the device. Usually, these are the most “connected” elements, which are determined as the cause.
Paik describes a system that is domain-independent and automatically builds its own subject knowledge base. The system recognizes concepts (any named entity or idea, such as a person, place, thing or organization) and relations between them. These relations allow the creation of concept-relation-concept triples. So the knowledge recognized in Paik is close to the next important knowledge level—facts (subject—action—object), although they are not facts yet. Paik also mentions “cause” relations between concepts (in the context of concept-relation-concept triples). However, this is not yet the Cause-Effect relations between the facts, which is the next very important level of knowledge, because it is this knowledge that reflects the outside world regularities (or the regularities of knowledge domain).