There exists an astronomical amount of useful and valuable information and knowledge embodied in the text of documents. The documents can exist on numerous media and can take a variety of forms. Documents can take the form of books, articles, academic papers, professional journals, editorial writings, news stories, legal briefs, manuals, and so on. These documents exist as hard paper copies, computer files, and in electronic form on the internet. The documents are typically written in unstructured text, i.e. free-form format of complete grammatical sentences, written in any language, but which are not classified in any systematic way.
The sheer number of documents in the prior art, and corresponding body of knowledge contained therein, make it necessary to use computer-based analysis tools to search, analyze, understand, and make use of the material contained in the documents. With the primary concern being the study of human communications, and how people exchange information and assign meaning to that information, the tools are useful in areas such as searching, knowledge management, competitive intelligence, data mining, data mapping, data management, information retrieval, and intellectual asset management. For example, a number of documents can be stored on a computer system, or a plurality of computer systems interconnected by a communication network such as the internet. With a key word search term or query, a document search tool can examine the documents on the computer system(s) looking for matches or hits to the key word search term. The documents matching the key word search term are identified and retrieved for further study. The search term can include several key words interrelated with Boolean logic to find the relevant documents and retrieve the desired information.
In the prior art, there are other tools that can be used depending on the analysis to be performed on the documents. There may be a need to analyze a group of documents to understand how each relate to the other, with no specific key word or query in mind, or the analysis may involve evaluating how the historical documents change over time. The analysis centers on how the text of one document relates to the text of other document(s).
One prior art document analysis tool uses a vector space model. Vector space analysis involves forming a first vector for a word or group of words in a first document, and a second vector for a word or group of words in a second document, as defined by their index terms. The index terms can be indices such as frequency of occurrence or binary variables. The first and second vectors can be weighted according to the relative importance of the words or groups of words, e.g. with term frequency/inverse document frequency factor (TF/IDF) weighting. A vector correlation is computed as the cosine of the angle between the first and second vectors to determine a distance or proximity between the vectors. Vector space analysis provides a measure of the closeness or importance between two or more documents. In other words, the greater the vector correlation over a large number vector sets in high dimensional space, the more similarities exist between the documents.
Another document analysis tool uses a classification methodology known as clustering. In clustering, words are grouped together or clustered into an inclusion hierarchy based on similar use within the documents. A distance is calculated, e.g. squared Euclidian distance, between clusters in the document. The distance between the clusters provides a measure of the closeness or similarity between the documents. The clustering methodology has a number of well-known variations including agglomerative clustering, divisive clustering, and partitioning of clusters.
Yet another document analysis tool uses a network model that takes into account connections between words. A window model is one type of network model that assumes that the author puts related words in close proximity to one another within the text. The text is preprocessed for stemming, i.e. reducing words to their base or root term by removing prefixes and suffixes, and to remove stop words, i.e. words that are used too frequently to provide any meaningful distinction or discrimination. A moving window of size S words, e.g. S=3 to 5 words, is slid across the text in discrete steps, one or more words at a time. As the window moves across the text, a link is established each time the same word is found in other window(s). The process involves the steps of moving the window, checking for links, moving the window, checking for links, and so on. A network of nodes and links is created to represent the text. The importance or degree of any one node is determined by the number of links coming into that node.
An important limitation with existing document analysis tools is that each fails to take into account the entire structure of the document or the word sense. Most are based on relative distance comparisons, or counts, or weighted counts of words that appear in the text. There is no single way to objectively determine where to cut off a judgment or measure of vector proximity to relevance or closeness between documents. Finally, existing document analysis tools do not model the intent of the author to convey specific, coherent meaning or message. Failure to consider the underlying intent of the document does not take advantage of valuable information embedded within the text.