This invention relates generally to computer software and, more specifically, to a system and method for information extraction.
Personal computers or workstations may be linked in a computer network to facilitate the sharing of data, applications, files, and other resources. One common type of computer network is a client/server network, where some computers act as servers and others as clients. In a client/server network, the sharing of resources is accomplished through the use of one or more servers. Each server includes a processing unit that is dedicated to managing centralized resources and to sharing these resources with other servers and/or various personal computers and workstations, which are known as the xe2x80x9cclientsxe2x80x9d of the server.
Computers often need to retrieve information requested by a user. The information may be available locally or may be available on another computer, such as a server, through a network. Retrieving information is relatively simple when the user wishes to retrieve specific information which the user knows to exist and when the user knows relevant parameters about the information to be retrieved such as a document name, an author, or a directory name. However, when the user wishes to retrieve information and has no knowledge of where it might be located or in what document it might be contained, more sophisticated information retrieval (xe2x80x9cIRxe2x80x9d) techniques are necessary.
IR systems use a search query, input by a user, to locate information which satisfies the query and then return the information to the user. Simple IR systems may use the original query, while more advanced systems may modify the query by adding parameters or changing its format. IR systems may be limited to searching a specific database accessible to the system or they may be enabled to search any available information, such as that located on the Internet. Successfully searching unstructured information such as that available on the Internet generally demands a more flexible IR system, since users have no knowledge of how the information for which they are looking might be indexed and stored.
However, flexible IR systems are difficult to develop. Part of this difficulty stems from the inherent complexity of natural languages, which operate on several different levels of meaning simultaneously. Five of the levels of meaning are the morphological, syntactic, semantic, discourse, and pragmatic levels.
The morphological level focuses on a morpheme, which is the smallest piece of a word that has meaning. Morphemes include word stems, prefixes and suffixes. For example, xe2x80x9cchildxe2x80x9d is the word stem for xe2x80x9cchildishxe2x80x9d and xe2x80x9cchildlike.xe2x80x9d
The syntactic level focuses on the structure of a sentence and the role each word plays in the structure. This level includes the relationship that each word has to the other words in the sentence. For example, the position of a word in a sentence can give valuable insight as to whether the word is the subject of the sentence or an action.
The semantic level focuses not only on the dictionary meaning of each individual word, but also on the more subtle meaning that is derived from the context of the sentence. For instance, the meaning of the word xe2x80x9cdrawxe2x80x9d can change depending on the context in which it is used. To xe2x80x9cdraw a picturexe2x80x9d and to xe2x80x9cdraw a swordxe2x80x9d both use the action xe2x80x9cdraw,xe2x80x9d but in very different ways which are made clear by examining the context provided by the related words.
The discourse level examines a document""s structure as a whole and derives further meaning from that structure. For example, technical documents usually begin with an abstract, while newspaper articles generally contain important xe2x80x9cwho, what, where, whenxe2x80x9d information in the first paragraph. This structure helps identify the type of document being examined, which in turn aids in determining where certain information in the document might be located and how the information might be organized.
The pragmatic level focuses on a body of knowledge that exists outside the document itself but is not actually reflected in the document. For instance, attempting to discover the current status of the European Currency Unit in different countries assumes a knowledge as to what countries in Europe are taking part in the implementation process, even if those countries are not specifically named in a document.
The levels of meaning operate simultaneously to provide the natural language environment in which communication occurs. Attempts at implementing the different levels of meaning for IR purposes have resulted in three basic types of systems, which may be generally categorized as boolean, statistical/probabilistic, and natural language processing (xe2x80x9cNLPxe2x80x9d). Many IR systems use a combination of these three basic types.
Boolean systems use basic boolean operators such as xe2x80x9cANDxe2x80x9d and xe2x80x9cOR,xe2x80x9d which are implemented mathematically to obtain search results. An example of this is a boolean search for xe2x80x9cinformation AND retrieval,xe2x80x9d which will return documents which contain both xe2x80x9cinformationxe2x80x9d and xe2x80x9cretrieval.xe2x80x9d Documents which do not contain both words are ignored by the system. In contrast, a search for xe2x80x9cinformation OR retrievalxe2x80x9d will return documents which contain either or both of the words xe2x80x9cinformationxe2x80x9d and xe2x80x9cretrieval,xe2x80x9d and so is a less restrictive search than one utilizing the xe2x80x9cANDxe2x80x9d operator.
Statistical/probabilistic systems use statistical and probabilistic analysis to aid a user in a search by first returning results that seem to be a better answer to the query. xe2x80x9cBetterxe2x80x9d may mean that the words in the query occur more frequently, are closer together, or match some other criterion that the system classifies as superior.
NLP systems attempt to treat a natural language query as a complete question and use the words, sentence structure, etc., to locate and retrieve suitable documents. However, the different levels of meaning in natural languages discussed previously make NLP systems extremely difficult to design and implement.
Current IR systems, which are generally a combination of the three systems described above, have yet to successfully overcome many of the obstacles presented by natural language queries. For example, natural language information retrieval should deal not only with synonyms in a single language, but also across regions and countries. For example, a xe2x80x9ctruckxe2x80x9d in the United States is often a xe2x80x9clorryxe2x80x9d elsewhere. An additional problem is posed by words having multiple meanings, which often require interpretation through context. For instance, the word xe2x80x9cchargexe2x80x9d may refer to a military charge, an electrical charge, a credit card debit, or many other actions, each one of which should be known to the IR system.
The inability to specify important but vague concepts presents a further problem to IR systems. For example, formulating a question to identify the likelihood of political instability in a country necessarily involves abstract ideas. False drops are yet another problem in current IR systems. False drops are documents which match the query but are actually irrelevant. An example of this is a simple query for xe2x80x9cJapan AND currency,xe2x80x9d which is intended to find articles on the topic of Japan""s currency. However, a document which discusses Japan""s housing problems in the first paragraph and the current currency situation in Canada in the third paragraph may be returned because it contains the requested terms.
Indexing inconsistencies also present problems for IR systems. Unless documents are indexed using the same consistent standards, document categories and organization tend to become blurred. A further difficulty to be overcome by IR systems is presented by spelling variations and errors. As with synonyms, spelling variations often occur when dealing with an international audience. Common variations such as xe2x80x9cgreyxe2x80x9d/xe2x80x9cgrayxe2x80x9d or xe2x80x9ctheaterxe2x80x9d/xe2x80x9ctheatrexe2x80x9d should be identified by an IR system. In addition, misspellings might cause an IR system to miss a highly relevant document because it fails to recognize misspelled words.
Therefore, what is needed is an information extraction system that utilizes improved natural language processing techniques to process a user""s natural language question, locate the information requested in the question, and return the information to the user.
In response to these and other problems, an improved system and method are provided for extracting information using a pipe-lined finite-state architecture.
For example, in one implementation, a system and method are provided for processing a document containing multiple words to extract an answer. The method uses a set of signatures and signature based rules. Each rule includes a name formed by concatenating the signatures included in the rule and a concept which is assignable to the combination of signatures included in the rule. The method parses the words and assigns a signature to each of them. It may also assign a window which includes a parameter defining the number of signatures viewable at one time. The window may be moved so that the window reveals signatures not viewable in its previous position. The signatures viewable in the window are read and a rule name is created by concatenating the signatures. The created rule name is compared to the names of the rules and, if the created name matches any of the rule names, the matching rule is applied to create a concept and the concept is assigned back into the window.