Computer aided document searching typically involves the use of one or more computer programs to analyse a document corpus and then to search through the analysed document corpus. Analysis of a document corpus may involve organising the documents into a plurality of document clusters in order to facilitate the searching process. Typically, this involves the use of one or more computer programs for implementing a clustering algorithm. Searching through a document corpus is typically performed by a computer program commonly known as a search engine.
A feature that has a significant impact on the architectural design of a search engine is the size of the document corpus. Another important consideration is whether the maintenance of the document corpus (adding and deleting documents) is open to all users (an uncontrolled corpus such as the Internet) or whether maintenance is controlled, for example by an administrator, (a controlled corpus such as an Intranet). More generally, a controlled corpus comprises a dataset that is controlled by an administrator or a dataset that it wholly accessible.
Conventional search algorithms return, as a search result, a ranked list of documents which should contain all or a part of the whole set of keywords presented in a user query. Such systems determine document relevancy based on key word frequency occurrence or by making use of references and links between documents. Often many search results are returned and the user cannot easily determine which results are relevant to their needs. Therefore although recall may be high, the large number of documents returned to achieve this results in low precision and a laborious search for the user to find the most relevant documents.
Additionally, a conventional search engine returns a flat ranked list of documents. If the query topic is relatively broad then this list can contain documents belonging to many narrow subtopics.
In order to obtain the best results from conventional search algorithms, which are based on word statistics, a user needs to have statistical knowledge about the document corpus before he forms a query. This knowledge is never known a priori and as such the user rarely forms good queries. With a thematic search, knowledge about cluster descriptions can be provided to the user, enabling them to improve and intelligently refine their queries interactively.
Conventional search engines often use additional information such as links between web pages or references between documents to improve the search result.
The concept of document-clustering-based searching or browsing is known (for example, the Scatter-Gather browsing tool [4]). The main problems with this type of approach are its applicability to real life applications, and the efficiency and effectiveness of the clustering algorithms. Unsupervised clustering algorithms fall into hierarchical or partitional paradigms. In general similarities between all pairs of documents must be determined thus making these approaches un-scalable. Supervised approaches require a training data set which may not always be readily available, can add to the cost of a project and can take a long time to prepare.
A different approach to the problem of thematic-focusing retrieval is considered in [5]. This system uses a set of agents to retrieve from the internet, or filter from a newsgroup, documents relevant to a specific topic. Topics are described manually in text form. Additionally a set of rules is generated manually in a special rule language to describe how to compare a document with a topic, i.e. which words from the topic description should be used and how these words influence the category weights. The resulted document category is determined using calculated category weights and fuzzy logic. The main disadvantage of this approach is that all topic descriptions and rules are defined manually. It is impossible to predict in advance what the given topic descriptions and corresponding rule set, sufficient to retrieve the relevant documents, are, with high precision and recall. Therefore, a large amount of manual work and research is required to generate effective topic descriptions and rules. As such, this approach cannot be considered as scalable.
Automatic topic discovery through the generation of document clusters could be based on such techniques as Probabilistic Latent Semantic Indexing [6]. Probabilistic Latent Semantic Indexing uses a probabilistic model and the parameters of this model are estimated using the Estimation Maximization algorithm. This is seen as a limitation of this approach. For example, the number of clusters must be set in advance reducing its flexibility.
In [7], another example of a search engine based on information-theoretic approaches to discover information about topics presented in the document corpus is outlined. The main idea is to generate a set of so called topic threads and use them to present the topic of every document in the corpus. The topic thread is a sequence of words from a fixed system of word classes. These classes are formed as a result of an analysis of a representative set of randomly selected documents from the document corpus (a training set). Words from different classes differ by probabilities of occurrence in the training set and hence represent topics at different levels of abstraction. A thread is a sequence of these words in which the next word belongs to a more narrow class and neighbouring words from this sequence should occur in the same document with a sufficiently high probability. Every document from the document corpus is assigned one of the possible topic threads. Cross-entropy is then used as a measure to select a topic thread which is most relevant to the topic of the document. This topic thread is stored in the index and is used at the search stage instead of the document itself. The main disadvantage of this approach is that only a relatively small part of information about a document is stored in the index and used during search. Also these thematic threads cannot be used to cluster documents into thematic clusters and hence information about the topic structure of the document corpus is hidden to a user.
It would be desirable to mitigate the problems outlined above.