1. Field of the Invention
The present invention generally relates to a method and system for analyzing text documents and more specifically, to a method and system for automatically identifying relationships between text documents and structured variables pertaining to those text documents.
2. Description of the Related Art
Unstructured free form text documents are commonly analyzed to discover interesting correlations between structured variables (e.g., a time interval) and categories of text documents (e.g., text documents in which a particular keyword occurs). For instance, if the text documents include “problem tickets” in a helpdesk log from a computer support center, the text might be analyzed to discover correlations between a particular month and all text documents containing the keyword “computer model XXX”.
However, conventional methods of analyzing text documents typically do not automatically identify interesting relationships between the text documents and structured variables. Instead, words and phrases which frequently occur in the text documents are plotted on a graph and users are required to determine for themselves whether an interesting relationship exists. Of course, this is a labor intensive and time consuming process.
One conventional method for analyzing text documents is disclosed in U.S. Pat. No. 5,371,673 to Fan, incorporated herein by reference. The Fan method is intended to sort and score text in order to determine public opinion for specified positions on a specified issue based on information available to the public. The method requires a computer, printer and modem and uses information in the Associated Press (AP) wire service to determine expected public opinion. The method first gathers relevant AP stories. The issue (e.g., “should defense spending be increased, kept the same or decreased?”) and positions (e.g., “it should be increased”) are defined. The user then enters a search command (e.g., DEFENSE or MILITARY or ARMS) to cause the computer to use the modem to search remote databases (e.g., Nexis®) for stories relevant to the issue, retrieve the stories and store them on disk. The computer then edits extraneous characters out of the text.
A set of numerical scores is then generated. The text is “filtered” in a series of steps to remove irrelevant text and “scored” using a text analysis dictionary, a set of text transformation rules and text scoring rules.
Lastly, public opinion is computed. Here, the data is stored in an array which is chronologically sorted by the computer, from the earliest story to the latest. The user then enters results of actual public opinion polls which are stored in an “opinion array” which has as its elements the time of the poll and the subpopulation holding a certain position (e.g., “defense spending should be increased”). The computer then refines this data and applies a set of population conversion rules to calculate public opinion as a time trend.
Thus, the Fan method generates structured data (e.g., public opinion) from unstructured data (e.g., AP text stories). However, the method does not automatically determine interesting relationships such as, for example, statistically significant relationships, between the AP text stories and public opinion. In other words, the Fan method, like other conventional methods, does not automatically identify interesting relationships between text documents and structured variables pertaining to the text documents.