1. Field of the Invention
The present invention relates to the field of information retrieval (IR) systems and, more particularly, provides an IR system, method, and computer program product that advantageously provides a kernel function capable of utilizing language modeling approaches and vector space modeling (VSM) to optimize information sorting and retrieval.
2. Description of Related Art
Information retrieval (IR) has changed considerably in the past decades with the expansion of the Web (World Wide Web) and the advent of modern and inexpensive graphical user interfaces and mass storage devices. The IR area is no longer limited to traditional applications such as indexing text and searching for useful documents in a collection. Rather, research in IR includes modeling, document classification and categorization, systems architecture, user interfaces, data visualization, filtering, languages, etc. As a result of such changes, traditional IR methods and models are faced with increasing challenges, such as how to modify and improve the existing IR models to dynamically meet various user information needs, and how to fully utilize the currently available IR approaches in different stages of the IR process to provide most effective and efficient retrieval performances, etc.
A typical IR process starts with a document indexing step at which each document or crawled web page in a collection is transformed into an instance of a certain type of document representation and stored in an indexed document database. On the other hand, a user information need is formulated as a query to be submitted to and parsed by an IR system (i.e., search engine). In response to the query, a document retrieval or ranking step is triggered to evaluate the relevance between the query representation and each of the document representations stored in the document database and rank all the documents based on their respective relevance values. Typically, the top n ranked documents would be presented as the initial retrieval results to invite a user relevance feedback, i.e., the user can specify which documents are relevant and which are non-relevant. Based upon the user feedback, the IR system (i.e., search engine) may run a certain machine learning algorithm to determine a boundary that separates the relevant results from non-relevant ones. Through the learned boundary, the IR system can either refine the query representation or re-measure the relevance values, and thereby present better retrieval results to the user.
As a traditional information retrieval method, Vector Space Model (VSM) has been the most widely utilized computational model of document retrieval or ranking since it was proposed in 1975. Today, most web search engines adopt strategies derived from the VSM. The VSM is built upon an assumption that all documents or queries can be properly represented as vectors in a vector space. By providing a way to measure similarity between any two document vectors or a document vector and a query vector, the VSM allows documents to be ranked according to their respective similarity values. The documents ranked by the VSM, coupled with user relevance feedback, will enable different machine learning algorithms to draw different optimal decision boundaries between relevant (positive) and non-relevant (negative) results. Among the various learning machines, the Support Vector Machine (SVM) is a highly effective one that generates the optimal decision boundary with the maximal margin of separation between the positive examples and negative examples. Despite the wide use of the VSM, one problem in applying this model is, the model itself does not specify how to determine a vector space or how to represent documents and queries as vectors, which requires supplementary methods to be used for resolving those issues. Among existing methods, however, there is no systematic but heuristic way to construct a vector space and represent document or query vectors. In addition, the measured similarity values between documents should vary with the change in user information needs. In other words, the vector space, where documents are represented as vectors, is expected to be dynamically determined from different user information needs. But how to dynamically determine an optimal vector space remains unexplored.
Proposed more recently as an alternative to traditional IR methods, the language-modeling approach integrates document indexing and document retrieval into a single model. This approach infers a language model for each document, estimates the probability of generating the query according to each of these models, and then ranks the documents according to these probabilities. A language model is built from collection statistics such as term frequency in a document, document length, and term frequency in the collection of documents. With the ability to utilize those statistics in a well-interpreted systematic way, the language-modeling approach outperforms the basic vector space model with TFIDF (term frequency-inverse document frequency) indexing scheme on several known document collections (such as the TREC collections, for example). However, the language-modeling approach does not provide an explicit model for relevance, which makes it conceptually difficult to incorporate any relevance feedback mechanism for improving retrieval results. In order to overcome this obstacle, some additional IR systems provide a model based feedback mechanism to estimate a query model (i.e., term distribution from which the query is generated) estimated from the positive feedback (relevant documents), and then rank the documents based on the divergences between each query model and document model. In such model-based feedback mechanisms, the language-modeling approach gains some limited learning ability. However, the model based feedback mechanism is unable to utilize statistics from negative feedbacks (i.e., the selection of non-relevant documents). Therefore, further enhancement of the language-modeling technique is needed in order to fully incorporate the advantages brought by machine learning algorithms, such as run by the SVM.
In light of the above, a need exists for an integrated information retrieval framework that can incorporate advantages provided by both the VSM and the language model, such as systematically representing documents as vectors, dynamically determining an optimal vector space based on user information needs, utilizing document statistics, collection statistics, and relevance statistics in a systematic rather than heuristic way, and utilizing both positive and negative feedback to interface with a machine learning algorithm (such as the SVM, for example).