1. Technical Field
The present invention relates to informational retrieval technology and more particularly to systems and methods for text document matching and ranking.
2. Description of the Related Art
Ranking text documents given a text-based query is one of the key tasks in information retrieval. Classical vector space models use weighted word counts and cosine similarity. This type of model often performs remarkably well, but suffers from the fact that only exact matches of words between query and target texts contribute to the similarity score. It also lacks the ability to adapt to specific datasets since no learning is involved.
Latent Semantic Indexing (LSI), and related methods such as probabilistic Latent Semantic Indexing (pLSA), and Latent Dirichlet Allocation (LDA) choose a low dimensional feature representation of “latent concepts”, and hence words are no longer independent. A support vector machine with hand-coded features based on the title, body, search engine rankings and the URL has also been implemented, as well as neural network methods based on training of a similar set of features. Other methods learned the weights of orthogonal vector space models on Wikipedia links and showed improvements over the OKAPI method. The same authors also used a class of models for matching images to text. Several authors have proposed interesting nonlinear versions of (unsupervised) LSI using neural networks and showed they outperform LSI or pLSA. However, we note their method is rather slow, thus dictionary size is limited.