The following relates to the document indexing and archiving arts, document retrieval arts, data mining arts, and so forth.
As used herein, the term “data mining” encompasses processing of a set of documents to extract useful data. As such, data mining as used herein encompasses document categorization, grouping, or indexing (in order to allow rapid retrieval of documents pertaining to a particular topic or that are similar to a representative document), document archiving (e.g. using a document categorization system or index), document and/or information discovery (e.g., discovering patterns in the documents, discovering information contained in the documents, et cetera), and so forth. In many data mining tasks, a document category (or topic) system must be prepared based on limited knowledge of a small sub-set of the topics that need to be indexed. For example, consider the task of a company or other entity specializing in photovoltaic energy production, which wishes to investigate the literature (i.e. a set of documents) to assess the role of photovoltaic energy in the energy landscape. Such a company is quite capable of formulating definitions for topics closely related to photovoltaic energy, e.g. topics such as “solar cells”, “solar collection efficiency”, and so forth. Company personnel may also have varying levels of knowledge of related areas, such as concentrated solar power techniques, nuclear energy, fossil fuels, and so forth, but usually do not have sufficient knowledge to identify and define specific topics pertaining to these areas with sufficient accuracy and detail to enable mining the literature for these topics.
More generally, it is often the case that a data mining task will be motivated by the desire to locate documents pertaining to one, two, or a few, “hot” topic(s) of interest, which can be precisely defined—but additional relevant topics are known or suspected to exist, which also need to be mined, but for which insufficient information is available to construct accurate topic definitions.
Existing data mining techniques have some deficiencies in addressing such a task. In classification techniques (also known as supervised learning), a set of documents are provided that are labeled by topic, and this set of labeled documents is used to train a classifier. This approach can create a very accurate classifier for the pre-defined categories (i.e. topics)—but only if those categories are accurately known and defined beforehand, e.g. by laborious manual labeling of a sufficiently large set of training documents. Pre-defining these categories (e.g. labeling the training documents) entails laborious user interfacing with the system. The extensive a priori knowledge required may be expensive to obtain in terms of cost, human resources, or both. Supervised learning of a classifier using a set of pre-labeled training documents is also unable to discover new topics not known beforehand.
Clustering techniques, on the other hand, group documents into clusters based on document similarity without relying on any pre-defined topics (e.g. topic-labeled training documents). Clustering is also known as unsupervised learning, and has the potential to group documents into semantically meaningful topics without a priori knowledge of those topics, thereby enabling topic discovery. In practice, however, the generated clusters may not have semantic significance, or may contain numerous outliers that are only tangentially related to the semantic identification assigned to the cluster. Successful use of clustering may involve numerous repetitions of the clustering algorithm, with each repetition employing different initial conditions, in order to arrive at a usable result, and/or the clustering results may need a substantial amount of manual adjustment in order to be rendered usable.
Another difficulty with clustering is that there is typically no mechanism to ensure accurate labeling of documents as to “hot” topic(s), as compared with other topics that may be of less importance or interest. Said another way, there is no reason to expect the clustering to generate clusters representing hot topics that are more accurate than clusters representing other topics—indeed, there is no guarantee that any of the generated clusters will correspond to a given hot topic at all, in which case the clustering must be re-run with different initial conditions in the hope of converging to a semantically meaningful result. There are numerous reasons why it may be desirable to have enhanced accuracy for the certain “hot” topic(s). In the illustrative case of the entity specializing in photovoltaic energy production performing an energy landscape study, if the results are presented to a potential client and have inaccuracies in topics relating to photovoltaic energy, this would be particularly embarrassing given the entity's purported expertise in photovoltaics.
Disclosed in the following are improved data mining techniques that provide various benefits as disclosed herein.