There are many situations where it is desirable to organize data into clusters without first knowing anything about the data that is being organized. For example, e-mails arrive at an e-mail address (or set of addresses) on a constant basis. These e-mails come from different senders and are not necessarily related to each other. It would be desirable to organize this incoming e-mail stream into, for example, categories for storage and ultimate retrieval in an organized fashion.
Similarly, data pertaining to financial transactions can stream into a business from a plurality of sources or text data can be streamed into a location and it is desired to be able to separate the data into categories or partitions for subsequent viewing or use. Sometimes this data arrives in a streaming fashion (such as network traffic) and sometimes it is contained in large batch files, such as archived data files. In either event, it is desired to be able to determine the semantics of this data without first knowing what the data contains and without requiring a static data set.
Previously, for example in a paper titled, “Text Modeling for Real-Time Document Categorization,” published in March 2005 in the transactions of the IEEE, which paper is hereby incorporated by reference herein, the idea of performing real-time document categorization is broached but the actual implementation is left undecided as a future research item. In the above-referenced paper, while hardware is described which assigns documents to categories in real-time the paper requires a previous “learning” of the data structures.