The present exemplary embodiment relates generally to document processing. It finds particular application in conjunction with a system and a method for extracting facts from documents, and will be described with particular reference thereto. However, it is to be appreciated that the present exemplary embodiment is also amenable to other like applications.
Information retrieval (IR) systems using databases or the Internet are widely used for retrieving documents associated with a query. The extensible markup language (XML) format imposes stringent document formatting requirements and is a widely accepted format for exchange of documents. Some queries are generally expressed by a small set of fixed words or expressions and thus are readily easy to detect. For example, they may be detected automatically, using a simple keyword search or a set of regular expressions. Keyword searching alone, however, is less effective for concepts conveyed by a wide range of linguistic expressions. Additionally, while conventional Information Retrieval-based search engines help to locate documents which might contain the information needed, they do not provide a way to extract this information in order to process it further or to answer directly user queries.
The ability to perform information retrieval has recently been incorporated into portable devices. Mobile users often seek precise information quickly. While it is often acceptable to refine Internet queries or fill in user profiles when operating a PC with a keyboard, such lengthy procedures are undesirable when using a mobile device.
Information Extraction (IE) processes seek to extract and store information in a formal representation (e.g., in the form of relations in databases, such as in relational databases or XML databases) in order to allow efficient querying and easy processing of the extracted data. Information stored and queried in a canonical way can be processed and interpreted by a computer without human interaction. It can also be used to build ontologies, create knowledge bases, and perform data analysis. The area of IE comprises techniques, algorithms and methods for performing two tasks: grasping the desired data, and storing it an appropriate form for future use.
Fact extraction can be regarded as a subset of IE, focusing on technologies for extracting facts from documents rather than on storing this information in databases. The goals of fact extraction, however, are typically more specific where fact extraction is defined as the transformation of facts expressed in natural language into a given, formal, properly defined target structure. In classical information extraction, the emphasis is made mainly upon the text processing stage. The target representation is only a secondary element.
Although fact extraction has been widely discussed, few concrete solutions have been proposed. Existing solutions rely on simple pattern matching techniques sometimes enriched with taxonomies. Commercial fact extraction systems extract a limited number of entities (usually person names, location, organizations, and dates) and identify simple links between these entities. For example, such systems can extract from sentence the following facts: Who led X Corporation and when? The general method used consists first in matching person names to “who” and time expressions to “when” and then in writing a regular “like” expression to translate the fact that in order to contain this information (who led X and when?) a sentence should contain a person name and a time expression, in this order, separated by “lead” or one of its synonyms.
A variety of methods and tools are available for information and fact extraction. Statistical and linguistic techniques are used as well as those from the field of artificial intelligence. For example, statistical approaches such as the naive Bayes approach, Hidden Markov Models, and machine learning techniques in general have been proposed. The type of facts such systems extract is rather limited. Additionally, they often miss facts because they do not use sophisticated linguistic tools and are not able to capture the semantics of the links between entities. For example, “when” questions, can often be answered only by a time expression, such as a date. In a technical context, being able to answer a “when” question requires also being able to extract more complex facts, such as events.
As an example, when searching technical printer documentation for when to use the “resume” button, the system should be able to extract the event “after you have corrected an ‘out of paper’ condition or cleared a paper jam” in order to extract the following sentence:                Press the resume button to restart printing after you have corrected an “out of paper” condition or cleared a paper jam.        
The variety of expressions which can be used for the same general concept can result in the retrieval of text which is non responsive to the actual query.
The lack of a powerful linguistic engine for performing the fact extraction task is only one part of the problem. It is just as important to provide users with an easy way to express the type of facts for which they are looking. Because users are looking for different facts depending on the task, fact extraction is not a static process. An efficient user interface is thus a complement to the linguistic technology. Most information extraction systems propose learning a high level language, such as regular expressions, to describe the type of facts users want to extract from documents. This type of solution, however only works for specialists and not for the general public. Additionally, for users of mobile devices, such a system may be unwieldy even for trained users.