Large organizations such as pharmaceutical companies and healthcare organizations have a massive amount of information available to them. This may include, for example, ongoing and historical clinical trials and studies, treatment guidelines, patient information, patents, research documents, external research literature, news articles, as well as information on the web. Most of this information is in the form of unstructured or semi-structured text (e.g. XML). The vast quantities make it hard to read, even with the help of a search engine to prune down the number of relevant documents.
Conventional systems do not provide results directly from the semi-structured or unstructured text in a format that can be used directly for decision making. Search engines do not provide any structure, other than the structure in the original document. Information extraction systems do not use an index, so cannot provide fast interactive querying, nor do they allow a flexible mix of constraints based on linguistic constructions and the structure of the document.
Text mining converts unstructured or semi-structured text within a set of documents into structured facts and relationships. Some examples of text mining techniques are described in commonly-assigned U.S. patent application Ser. No. 12/133,205, entitled “Extracting and Displaying Compact and Sorted Results from Queries over Unstructured or Semi-Structured Text,” filed on Jun. 4, 2008, now U.S. Publication Number US-2008-0301129-A1 which is hereby incorporated by reference herein in its entirety. As described in this patent application, text mining techniques may generate and utilize a text mining index that provides an efficient representation of the content of a set of documents. For each source document in the set, a text mining index, may for example, encode the regions (e.g., Abstract, Acknowledgements, Authors, Body, Figures, Figure Text, Paragraphs, Tables, Table Row, References, Keywords, Title, etc.) of the source document; the text of the source document; and the start and end positions of linguistic units, such as sentences, noun groups, verb groups, etc. The index may also identify the concepts (e.g. breast cancer) that are present in a document (and their position within the document), whether these concepts are referred to by the standard or preferred name (e.g. breast cancer) or by a synonym (e.g. breast carcinoma, breast neoplasm, cancer of the breast etc.). In addition, the index may also identify broader classes, such as, e.g., people, companies, amounts, temporal expressions, etc. that are present in the document and their positions within the document
In some examples of text mining, such as those described in the above patent application, a graphical user interface (“GUI”) allows users to create and run their own text-mining queries against a text mining index. When these queries are run on small document sets (i.e., against a small text mining index), this gives the user an interactive experience, similar to a search engine, with users getting all results back within a few seconds in a compact and sorted manner. However, text mining is a more computationally expensive process than search and, when querying large document sets, the user may need to wait some time to get all results back. The user may not need to see all results in order to judge their value, just as in a search engine results are typically delivered one page at a time. Furthermore, text mining querying is typically an iterative process, with users creating a query, seeing results, then refining the query to get better quality answers. Any waiting during this process increases the cycle time for each iteration, and this can have a large effect on the time taken from having an initial query intention, and getting back high quality answers to that query.
The foregoing examples of some existing problems with text mining are intended to be illustrative and not exclusive. Other limitations will become apparent to those of skill in the art upon a reading of the Detailed Description below.