This invention pertains to text retrieval systems. More particularly, this invention relates to data mining systems and methods for discovering context groups and document categories by mining usage logs.
Text retrieval systems are concerned with locating documents relevant to a user""s information needs from a collection of documents. The user describes their information needs using a query consisting of a number of words. Text retrieval systems compare the query with the documents in the collection and return the documents that are likely to satisfy the information needs. The intelligence of a text retrieval engine relates to the ability to determine how relevant a document is to a given query. Computation of the relevance largely depends on the words used and operators used in a given query. Generally, the chances of retrieving a document are high if queries contain words that occur in the documents the user wishes to retrieve. However, the words users use to describe concepts in their queries are often different than the words authors use to describe the same concepts in the documents. It has been found through experiments that two people use the same term to describe an object less than 20% of the time. The problem which results from this inconsistent word usage is referred to as a word mismatch problem. The problem is more severe for short queries than for long elaborate queries because as queries get longer, there is more chance that some important words will occur in both the query and the relevant documents.
The word mismatch problem is often solved by computing relationships between words in a given domain. In response to a query, the text retrieval system returns not only those documents that contain the words in the query, but also those documents that contain their related word as given by the word relationships. Word relationships are usually computed by some form of corpus analysis and by using domain knowledge. For example, stemming, a popular method to compute word relationships, reduces words to common roots. As a result, documents containing morphological variants of the query words are also included in the response. Another popular approach of computing word relationships is to use manual or automatically generated thesauri. Manual thesauri are built using human experience and expertise, and hence, are costly to build. Previously, corpus analysis has been used to generate a thesaurus automatically from the content of documents. Automatic construction of thesauri is based on a well-known hypothesis call the xe2x80x9cassociation hypothesisxe2x80x9d which states that related terms in the corpus tend to co-occur in documents. xe2x80x9cAssociation hypothesisxe2x80x9d can capture word relationships that are specific to the corpus. However, generating a thesaurus automatically from the entire corpus is very expensive if the corpus contains millions of documents. It is also hard to maintain if documents are frequently added to the corpus.
Recently, it has been shown that word relationships computed from the top n documents relevant to a query outperformed the thesaurus generated from the entire corpus. Relevance of a document to a given query is computed by the text retrieval system, solely based on the words that appear in the query and the document and may not conform to the user""s opinion of relevance.
Capturing relationships between keywords by automatic thesaurus construction has previously been studied extensively. Some of the earliest work in automatic thesaurus construction was based on term clustering. According to this work, related terms are grouped into clusters based upon co-occurrence of the terms in the corpus. Terms that co-occur with one or more query terms are used in query expansion. Previously, it has been proposed that expansion terms must co-occur with all query terms in order to achieve higher retrieval effectiveness. Techniques have been proposed that do not require analyzing all documents in the corpus for computing the thesaurus. Expansion terms are obtained by co-occurrence analysis of all query terms in a query Q and other terms from top-ranked documents (or all documents) in the result of Q.
Therefore, there exists a need to improve the completeness of information that is mined and a need to address word mismatch problems incurred with prior art techniques based on historical information mined from usage logs. Usage logs record data about the queries submitted by users and the documents retrieved by users.
An apparatus and method are provided for discovering context groups and document categories by mining usage logs of users.
According to one aspect, an apparatus is provided for relating user queries and documents. The apparatus includes a client, a server, a communications pathway, and a database. The client is configured to enable a user to submit user queries to locate documents. The server has a data mining mechanism configured to receive the user queries and generate information retrieval sessions. The communications pathway extends between the client and the server. The database is provided in communication with the client and the server. The database stores data in the form of usage logs generated from the information retrieval sessions generated by a user at the client. The data mining mechanism includes a clustering algorithm operative to identify context groups and usage categories. The data mining mechanism is operative to identify query contexts associated with individual queries from the usage logs, partition the queries into context groups having similar contexts, and compute multiple context groups associated with specific query keywords from the usage logs.
According to another aspect, a method is provided for relating user queries and documents using usage logs from retrieval sessions from a text retrieval system. The method includes: identifying contexts associated with a user query comprising at least one specific query keyword; identifying user queries having similar query contexts; partitioning user queries into groups based upon similarity of the query contexts; merging the groups to compute multiple contexts associated with specific query keywords; and applying a clustering algorithm to identify similar query contexts based upon the query keywords to generate context groups that associate keywords with documents accessed by users.
According to yet another aspect, a method is provided for relating user queries and documents using usage logs from retrieval sessions from a text retrieval system. The method includes: identifying contexts associated with a user query comprising at least one specific query keyword; identifying user queries having similar query contexts; partitioning user queries into groups based upon similarity of the query contexts; merging the groups to compute multiple contexts associated with specific query keywords; and applying a clustering algorithm to identify similar query contexts based upon the query keywords to generate context groups that associate keywords with documents accessed by users.