Computerized document creation systems and the rapid growth of the Internet have led to an explosion in the number of documents of all types (e.g., text files, web pages, etc.). Internet search engines, such as Google™, have responded to the need to search through immense document sets by offering basic search tools for finding topically focused sets of documents. It is possible to create and refine searches using, for example, Boolean combinations of keywords, that is, keywords together with Boolean operators such as “AND,” “OR”, “NOT,” etc. to specify relationships between the keywords. Advanced approaches for refining searches include, for example, whole text matching or user profiling to tailor results to the kinds of documents the user has sought before.
Regardless of search sophistication, users often must wade through an unmanageable number of documents and examine the documents one by one to determine, for example, the most relevant documents. Furthermore, the ongoing, enormous growth in the number of available documents seems to insure that even with future advances in search capabilities, users will continue to receive large result sets of relevant documents, no matter how sophisticated searching becomes.
There is currently no intuitive, easy-to-use tool that helps an ordinary user do all the following tasks on a topically focused set of documents: (1) analyze the entire set for its informational content, (2) with these analyses and the user's own domain knowledge, enable the user to build an intuitive, visual model of the concepts in the document set, (3) then use the model to drive extraction and location of those concepts in the documents, (4) enable the user to aggregate and process extracted information, (5) enable the user to export the model, the data, and reports conveniently, for sharing with other interested parties, who can upload the model and data on their own computer, (6) support easy and intuitive iteration of all of these steps.
Researchers in technical fields may have access to hundreds of thousands of electronic versions of research papers, making research increasingly complex and fast-paced. For example, the National Library of Medicine provides access to more than 14 million citations in the field of biomedical research. Frequently, a researcher needs to refine his search technique when faced with a large set of documents or search results to retrieve a smaller set of more relevant information. However, especially for complex research projects, these types of searches are difficult to create and manipulate because of the length of the search text required. Furthermore, iterative searching of this nature can be quite time-consuming. Additionally, information retrieved from these searches is not easily viewed, saved, or shared among multiple users.
For example, a researcher performing a PubMed® search for articles related to a clinical trial for anthrax might enter the following search terms into the search engine: “clinical trial AND anthrax AND test.” This search might return more than 100,000 documents, typically displayed as textual fragments with links to the actual documents spread over thousands of web pages. The researcher will have great difficulty navigating through the thousands web pages to find a smaller number of documents, and will have even greater difficulty reading each document one by one to extract information. If the researcher tries to refine the search to retrieve a smaller, more relevant set of documents, the researcher must return to the original search and modify the terms used. Ultimately, the researcher may end up with an unmanageable search string containing twenty or more words.
Having received a list of documents that result from a search, most researchers are left with the tedious task of scanning through the list to see if any of the documents are really relevant to their needs. Those documents that look relevant must be opened and scanned to see what is in them. Further, it is difficult to share the results of an iterative search with others, because the researcher cannot easily save a copy of each set of the search terms or a copy of the extracted information using conventional search tools. Moreover, a document set may contain aggregate information that is not contained completely in any single document, so that a user may not want to reduce the document set to a size small enough to read in full. Accordingly, there exists the need for a tool to create persistent models of information that may be easily manipulated, refined, saved, and shared, where these models provide an intuitive, visual aid to help the user define the concepts of interest, define extractors associated with the concepts, to launch extraction of those concepts, to analyze, aggregate, and output extracted information.
To extract information is to remove it from its original, natural language format. Currently available desktop applications for extraction perform single purpose tasks, such as excerption or summarization, but are limited in their usefulness and do not provide a user with much flexibility in configuring the them. Typical heavyweight or enterprise-scale extraction systems allow an expert to design customized functions for excerpting, summarizing, and presenting information from a class of documents. Trained experts may, for example, build extractors that arrange extracted text fragments in an table format for viewing, or fill templates that represent various multi-component concepts requested by a ordinary user of the system. Currently available tools may require a specially prepared set of training documents to define a concept taxonomy that can be used to categorize large sets of documents similar to the training documents. Current tools may also locate and highlight entities that belong to predefined categories (e.g., personal names, company names, geographical names), and allow experts to define extractors to identify specific text patterns.
One disadvantage of current enterprise-scale extraction systems, such as InXight's FactFinder™ editor (www.inxight.com), is that they do not allow an ordinary user, i.e., someone not specially trained to customize the system, to create a persistent or portable model of information that mirrors that individual's mental model of a subject. Another disadvantage of some commercial tools is that, although they may locate specific information in texts and highlight it, the highlighted information is often presented in an unmanageable format. For example, if a user starts with 6,000 documents, the extraction tool may present 6,000 documents highlighting or colorizing the concepts requested by a user. Even though the concepts may be highlighted in the texts, the sheer number of documents is still unmanageable for a typical user. Yet another disadvantage of current enterprise-scale systems is that they are costly to purchase and manage because they require trained experts to run them. Because they are so expensive, such extraction systems are only justified for large groups of similar users who are interested in the same kinds of information (e.g., a group of intelligence analysts).
Accordingly, there is a need for a lightweight tool that enables a user to model, extract, and aggregate information contained in any topically focused document set, such as a document set that results from an Internet search using specific keywords. Since no two persons have the same mental model of a subject area, a tool is needed that allows a user to design an individual model of information and to iteratively extract information from the documents, analyze it, and present the extracted information in ways that reflects a user's own conceptualization and organization of the information.