Information retrieval methods exist for analyzing documents in a document repository and retrieving specific documents that are responsive to a query. For example, a simple information retrieval technique is word or term searching. Term searching involves a user querying a corpus of documents containing information for a specific term or word. A resultant solution set of documents is then identified as containing the search term. Single term searching is extremely fast and efficient because little computational effort is involved in the query process, but it can result in a relatively large solution set of unranked documents being returned. Alternatively, the results may be ranked based on a weighted sum of matching words. Many of the documents may not be relevant to the user because, although the term occurs in each solution set document, the term is out of context with the user's intended meaning. Further, even if all of the resultant documents use the search term in the proper context, the user has no way of judging which documents in the solution set are more relevant than others. Thus, the user often must perform a secondary search for a more relevant solution set of documents. Additionally, a user must have some knowledge of the subject matter of the search topic for the search to be efficient (e.g., a less relevant word in a query could produce a solution set of less relevant documents and cause more relevant documents in the solution set to be ranked lower than more relevant documents).
A logical improvement to single term searching is multiple, simultaneous term searching using “Boolean term operators,” or simply, Boolean operators. A Boolean retrieval query passes through a corpus of documents by linking search terms together with Boolean operators such as AND, OR and NOT. The solution set of documents is generally smaller than the set that would result from single term searches, and all returned documents are typically ranked equally with respect to relevance. The Boolean method of term searching is a widely used information retrieval process, and it is often used in Internet search engines because it is fast, uncomplicated and easy to implement in a remote online environment. However, the Boolean search method carries with it many of the shortcomings of term searching. The user has to have some knowledge of the search topic for the search to be efficient in order to avoid relevant documents being ranked as non-relevant and visa versa. Furthermore, since the returned documents are not ranked, the user may be tempted to reduce the size of the solution set by including more Boolean linked search terms. However, increasing the number of search terms in the query narrows the scope of the search and thereby increases the risk that a relevant document is missed. Still again, all documents in the solution set are ranked equally.
Other information retrieval processes have extended and refined the Boolean term searching method to attempt to rank the resultant documents. Such processes do not necessarily reduce the size of the solution set. One such method of ranking the documents is by term weighting the query terms and/or term weighting the occurrence of terms in the solution set of documents by frequency. Expanded term weighting operations make ranking of documents possible by assigning weights to search terms used in the query. Documents that are returned with higher ranking search terms are themselves ranked higher in relevance. However, a more useful variation of term weighting is by occurrence frequency in a resultant document database, thereby allowing the documents in the solution set to be ranked. A higher term occurrence frequency in a resultant document is indicative of relevance. Boolean searching variants that allow for term frequency based document ranking have a number of drawbacks, one of which is that longer documents have a higher probability of being ranked higher in the solution set without a corresponding increase in relevance. Additionally, because the occurrence of the term is not context related, higher ranking documents are not necessarily more relevant to the user. Moreover, if the user does not have an understanding of the subject matter being searched, the combination of Boolean logic and term weighting may exclude the most relevant documents from the solution set and under-rank the documents that are most relevant to the subject matter.
Information retrieval methods have been devised that combine Boolean logic with other techniques such as content-based navigation, where shared terms from previously obtained documents are used to refine and expand the query. While each of the above described improvements provide some benefit to the user, the solution set of documents does not optimally convey the right information to a user if the user does not have an understanding of the subject matter being searched. Further, other ranking tools, such as word stemming and thesaurus techniques, are language dependent and thus have limited applicability.
An information retrieval method with rising popularity is the use of the vector space model, where documents and queries are represented in a high-dimensional space in which each dimension of the space corresponds to a word in the document collection. The most relevant documents for the query are expected to be those represented by the vectors closest to the query. Closeness may be calculated by looking at angles and choosing documents that enclose the smallest angle with the query vector. For example, the graph shown in FIG. 1 depicts a vector space with two dimensions corresponding to the words “crash” and “spooler.” Represented in the space are a query vector 10, and three document vectors d112, d214, and d316 with the following coordinates. The coordinates, or term weights, of each vector are derived from occurrence counts within the documents. For example, crash may have only a passing reference in d1, but there may be several occurrences of spooler. Thus, crash receives a higher weight than spooler. Since document d2 has the smallest angle with q, it is the top-ranked document in response to the query crash spooler. Both terms (crash and spooler) are salient in d2 and therefore have high weights, and although the other two documents also mention both terms, one of the terms is not a centrally important term in the overall document.
Documents are ranked according to similarity with the query as measured by the cosine measurement, or the normalized correlation coefficient. This calculation computes how well the occurrence of a term correlates to both the query and the document. For example, if the cosine of a query vector q and a first document vector d1 is greater than the cosine of the query vector q and a second document vector d2, then the first document may be ranked higher than the second document because d1 is closer to the query vector than d2. For additional background relating to space modeling, see e.g., Understanding Search Engines Mathematical Modeling and Text Retrieval, Michael W. Berry, Murray Browne, Society for Industrial and Applied Mathematics, (1999).
Latent Semantic Analysis (LSA) is a technique that helps to identify the contextual meaning of words or phrases in a document by using vector analysis to project queries and documents into a space which helps to infer semantic relations between terms. With LSA, a database is searched using a query, which produces result items. The result items are clustered into logical categories, and the result items within each category are ranked based on the frequency of the occurrence of relevant words in each of the result items. The frequency of the occurrence of words for a given result item may correspond to the number of unique terms that are contained in the given result item. Co-occurring terms are projected onto the same dimensions, non-co-occurring terms are projected onto different dimensions. Thus, the product of a query vector and a document vector may be relatively high (in terms of cosine similarity) even if they do not share any terms, as long as their terms are semantically similar according to the co-occurrence analysis. Accordingly, LSA permits dimensionality reduction, thereby taking a set of objects that would exist in a high-dimensional space if traditional vector analysis were used, and representing them in a low-dimensional space.
LSA applies a well-known mathematical technique called Singular Value Decomposition (SVD) to a word-by-document matrix. SVD is a form of factor analysis, or the mathematical generalization of which factor analysis is a special case. In SVD, a rectangular matrix is created from the product of a number, typically three, of other matrices. One component matrix is a two-dimensional matrix in which rows correspond to search terms, columns correspond to documents, and values correspond to numbers of occurrences of the terms within the documents. This matrix may be multiplied by other matrices that, for example, predict or identify the expected number of terms or related terms in the document in order to form a reconstructed matrix that measures the contextual relation of search terms. See, An Introduction to Latent Semantic Analysis, Thomas K. Landauer, Peter W. Foltz, Darrell Laham, published in Discourse Processes, 25, 259-284 (1998).
Thus, LSA provides a means for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. Information content is defined by more than just its mere terms but is also characterized by the context in which a specific term is given. There is an underlying latent semantic structure in word usage that is partially hidden or obscured by the variability of word choice. An important concept is that the similarity of meaning of terms to each other is based on the context in which that term appears and the context in which that term does not appear. LSA determines similarity of meaning of terms by context in which the terms are used and represents the terms and passages by statistical analysis of a large information corpus.
LSA represents the terms taken from an original corpus of information or any subset of terms contained in the original corpus, as individual points in a high dimensional semantic space. A semantic space is a mathematical representation of a large body of terms, and every term, combination of terms or sub-combinations of a term can be represented as a unique high-dimensional vector at a point in semantic space. The similarity of two terms can be understood by comparing the terms' vectors. Quantitative similarity values are defined by the cosine of the angle between the term vectors representing the terms in the semantic space. More importantly, this comparison is valid only in the defined semantic space for the corpus of information. LSA utilizes concepts rather than terms contained in the documents for analyzing the latent semantic structure of documents. Thus, LSA allows for information retrieval on the basis of concepts contained in the document with respect to other documents in the corpus rather than terms in the documents of the corpus.
Latent Semantic Indexing (LSI) is the application of LSA principles in a statistical information retrieval methodology. LSI focuses upon the relationship between documents and terms to register recurring patterns of semantic structure. Co-occurrences of terms are given higher weights than singular occurrences because the object of semantic indexing is to recognize which terms reappear most often in which parts of a document's structure. It has been demonstrated that there are a few constantly recurring structural patterns within sentences and these patterns occur at the level in which terms and their articles interrelate. The recurrence of these patterns is “latent” because they are inherent in grammatical structure and “semantic” because they define linguistic contextual meaning. In other words, through the pattern of co-occurrences of words, LSI is able to infer the structure of relationships between articles and words that define context and meaning by the relationship between structure and concept. For more information about LSI, see e.g., Foundations of Statistical Natural Language Processing, Christopher D. Manning, Hinrich Schutze, The MIT Press, 1999.
One problem with current methods of information retrieval as described above is that, even with traditional LSA or LSI methods, the searches often retrieve semantically uninteresting words. A need exists for a method to compare documents not simply based on word occurrences, but rather on a higher level of concept occurrence and co-occurrence, where a concept is defined by two or more correlated terms. It would then be advantageous to cluster together documents having more than one concept and compare and rank the documents according to the frequency of occurrences of particular terms within a concept.
An important application of information retrieval is to extract information from service logs written by product support staff. In an enterprise that manufactures complex and expensive machines and systems, service contracts are offered to customers. When machines or systems encounter problems, customers notify customer support staff who enter pertinent information about the problem into a service log. Field service engineers read these logs and update them with technical observations and remedies. These logs provide a source of information for discovering and tracking systemic product problems. Information retrieval methods are used because the information is in textual form. However, to date methods of analysis of service logs for discovering product have been limited to manual analysis, which is time-consuming, expensive and prone to error.
The present disclosure describes attempts to solve one or more of the above-listed problems.