1. Field of the Invention
The present disclosure relates to documents and more specifically to a system and method for filtering and recommending documents that are related to a topic of interest.
2. Description of the Related Art
One method of searching for electronic documents online is by entering one or more keywords into a search engine, such as a search engine webpage on the Internet. In general, the quality of such a search depends on the skill of the user and their ability to craft and submit an appropriate query. There are some systems that can return results from a keyword search and offer to find more documents based on a given result. If someone was interested in several different topics, they may need to spend a significant amount of time searching for and reviewing documents that may meet the search criteria, but may not be of any real interest.
Earlier works by Potok et al., address the need for automated document searching and the following three references are incorporated by reference as if included here at length. U.S. Pat. No. 7,805,446, “Agent-based Method for Distributed Clustering of Textual Information” to Potok et al.; U.S. Pat. No. 7,693,903, “Method for Gathering and Summarizing Internet Information” to Potok et al.; and U.S. Pat. No. 7,937,389, “Dynamic Reduction of Dimensions of a Document Vector in a Document Search and Retrieval System”, to Jiao and Potok.
Document clustering is an enabling technique for many machine learning applications, such as information classification, filtering, routing, topic tracking, and new event detection. Today, dynamic data stream clustering poses significant challenges to traditional methods. Typically, clustering algorithms use the Vector Space Model (VSM) to encode documents. The VSM relates terms to documents, and since different terms have different importance in a given document, a term weight is associated with every term. These term weights are often derived from the frequency of a term within a document or set of documents. Many term weighting schemes have been proposed. Most of these existing methods work under the assumption that the whole data set is available and static. For instance, in order to use the popular Term Frequency-Inverse Document Frequency (TF-IDF) approach and its variants, one needs to know the number of documents in which a term occurred at least once (document frequency). This requires a prior knowledge of the data, and that the data set does not change during the calculation of term weights.
The need for knowledge of the entire data set significantly limits the use of these schemes in applications where continuous data streams must be analyzed in real-time. For each new document, this limitation leads to the update of the document frequency of many terms and therefore, all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2), assuming that the term space M per document is much less than the number of documents. Otherwise, the computational complexity is O(N2MlogM), where O(MlogM) computations are needed to update a document.
Using the weighting scheme called Term Frequency-Inverse Corpus Frequency (TF-ICF) addresses the problem of finding and organizing information from dynamic document streams. TF-ICF does not require term frequency information from other documents within the set and thus, it can process document vectors of N streaming documents in linear time.
The widely used, current term weighting schemes generally all require knowledge of the entire document collection. In other words, if a TF-IDF based method is used to generate document representation, a newly arriving document requires the weights of existing document vectors to be recalculated. Consequently, any applications that rely on the document vectors will also be affected. This can significantly hinder their use in applications where dynamic data streams need to be processed in real-time. TF-ICF generates document representations independently without knowledge of the document stream being examined. Its computational complexity is O(N).