In many areas there is a desire to characterize objects according to the elements from which they comprise. Examples can be found in various fields such as software event stream analysis, which aims to discover sequences of events that describe different states in a system running complex applications using event log analysis for example. Existing research in the area of automated log analysis focuses on discovery of temporal patterns, or correlation of event statistics, within the events. Such techniques are typically based on knowledge of which event messages can occur, or require access to the source code of software that generates the event messages in order to determine which event messages can occur. In general, the research does not accommodate the complexities of real world systems, in which logs may be generated by various different components in a complex system, leading to, for example, interleaving of sequences of events, asynchronous events and high dimensionality.
An alternative scenario is document characterization, which aims to describe and characterize documents in a corpus according to the concepts they discuss by using the words from which the documents are composed. Following characterization, each document in the corpus, or indeed new documents added thereto, can generally be represented sparsely using these concepts. The representation can be used as an aid in keyword extraction, or concept based retrieval and search for example. Document characterization works can use probabilistic latent semantic indexing for example, to produce models that capture latent concepts in documents using a corpus of training documents and different finite mixture models. In general, existing approaches for characterizing a corpus of documents use a compressed representation of the data which is learned from data through probability distributions over words and concepts.