The present invention relates generally to the field of computer-based information retrieval, and more specifically to the application of natural language processing (NLP) techniques to the interpretation and representation of computer text files, and to the matching of natural language queries to documents.
As the amount of electronic information continues to increase, the demand for sophisticated information access systems also grows. Over the years, new types of information access systems such as data mining systems have become commercially available; however, until the present invention, domain-independent question-answering systems still exist only as experimental prototypes.
Both types of systems require a pre-constructed repository of information to find answers to users' questions. Data mining systems commonly utilize statistical procedures to detect patterns in data; users are expected to interpret the patterns to find the answers. The current interests and successful commercial uses of data mining systems are due to the premise that these systems are designed to use the same set of data which is already used by the legacy database management systems.
In comparison, question-answering systems are designed to provide answers directly to users as if they were involved in question-answering sessions with other people. This requires systems to perform complex inferencing to draw answers from organized knowledge bases. Over the years, there has been significant progress in the problem-solving aspect of AI research; however, there are no practical AI applications except the ones which are used in a few narrowly-defined domains. This is due to the lack of practical knowledge bases. Research has demonstrated that building the requisite knowledge bases automatically is extremely time consuming and expensive.
For a number of years, both manual and automatic approaches to constructing knowledge bases have been studied and implemented; however, manual construction of knowledge bases has been too expensive to be practical, as was discovered in the CYC Project (Lenat et al., 1989), and automatic approaches have not yet produced domain-independent and usable knowledge bases. The CYC project was an attempt to build a common-sense knowledge base, containing all the information necessary for a person to understand a one volume desk encyclopedia and a newspaper. The project began in 1984, with specially trained knowledge editors manually entering knowledge in the CYC database. The knowledge base is still incomplete. In recent years, there has been increased interest in textual information extraction research using natural language processing techniques. The most common medium of storing knowledge is texts. Textual information extraction extracts and organizes knowledge from texts automatically.
Research efforts in this field have been reported in the Message Understanding Conferences (MUC). The goal of MUCs was to automatically extract information from news texts to populate structured databases. Participants of MUC were given the task of extracting information about clearly defined event types (or domains) such as "terrorism in South America." For each event type, the MUC participants were given pre-determined categories of information that their systems were required to extract. The goal of MUC is to evaluate information extraction systems applied to a common task. MUCs have been funded by the Advanced Research Project Agency (ARPA) to measure and foster progress in information extraction. The focus of MUCs has been a single task of information extraction by analyzing free text, identifying events of a specified type, and filling a database template with information about each event (MUC-6).
In the MUC tradition, there are two fundamental modes of information extraction: atomic and molecular. The atomic approach relies on the strong typing of entities to match them to roles in events; the molecular approach relies much more on the placement of the entity description within syntactic patterns.
For example, a terrorist organization, "Shining Path" is identified as the perpetrator in a message which has been categorized as a terrorist story within the "atomic" framework of information extraction. This is possible as all appropriate elements of an event and each element's type are pre-determined. Specifically, a terrorist organization type entity is considered to take the role of the perpetrator of a terrorist activity in a terrorist story.
In comparison, in the "molecular" approach to extracting information, if a name of an organization occupies the subject position of a verb which describes the terrorist activity such as "bomb" or "kill," the organization is identified as the perpetrator.
The limitation to both these approaches is that they are domain-dependent. To change domains requires a lengthy process of preparing a new knowledge base for another subject which would list various entities and events exhaustively. Both approaches depend on the careful analysis of common terminologies which are used in each event type. Thus, every participating system has to be re-worked either to capture the typical roles of the exhaustive list of entities (for example, names of all terrorist groups in South America or the names of bombs) which have potential to occur in the designated event or to identify all possible verbs which can be used to describe the event and the associated roles of the syntactic arguments of the verbs. These processes can take long periods of time, varying from a few weeks to several months.
While many participating systems in MUC have been successful in extracting relevant information, given that there are an almost infinite number of event types or subject domains, it does not seem feasible to build a domain-independent textual information extraction system by following MUC's one-domain-at-a-time approach.