The invention relates generally to computerized information management technologies and, more particularly but not by way of limitation, to the generation of relevant domain-specific topics for a corpus of data to facilitate subsequent search and retrieval operations for the data.
It is generally recognized that much of the world economic order is shifting from one based on manufacturing to one based on the generation, organization and use of information. To successfully manage this transition, organizations must collect and classify vast amounts of data so that it may be searched and retrieved in a meaningful manner. Traditional techniques to classify data may be divided into four approaches: (1) manual; (2) unsupervised learning; (3) supervised learning; and (4) hybrid approaches.
Manual classification relies on individuals reviewing and indexing data against a predetermined list of categories. For example, the National Library of Medicine's MEDLINE® (Medical Literature, Analysis, and Retrieval System Online) database of journal articles uses this approach. While manual approaches benefit from the ability of humans to determine what concepts a data represents, they also suffer from the drawbacks of high cost, human error and relatively low rate of processing. Unsupervised classification techniques rely on computer software to examine the content of data to make initial judgments as to what classification data belongs to. Many unsupervised classification technologies rely on Bayesian clustering algorithms. While reducing the cost of analyzing large data collections, unsupervised learning techniques often return classifications that have no obvious basis on the underlying business or technical aspects of the data. This disconnect between the data's business or technical framework and the derived classifications make it difficult for users to effectively query the resulting classifications. Supervised classification techniques attempt to overcome this drawback by relying on individuals to “train” the classification engines so that derived classifications more closely reflect what a human would produce. Illustrative supervised classification technologies include semantic networks and neural networks. While supervised systems generally derive classifications more attuned to what a human would generate, they often require substantial training and tuning by expert operators and, in addition, often rely for their results on data that is more consistent or homogeneous that is often possible to obtain in practice. Hybrid systems attempt to fuse the benefits of manual classification methods with the speed and processing capabilities employed by unsupervised and supervised systems. In known hybrid systems, human operators are used to derive “rules of thumb” which drive the underlying classification engines.
No known data classification approach provides a fast, low-cost and substantially automated means to classify large amounts of data that is consistent with the semantic content of the data itself. Thus, it would be beneficial to provide a mechanism to determine a collection of topics that are explicitly related to both the domain of interest and the data corpus analyzed.