As the generation of information proliferates, vast quantities of data are created by systems, software, devices, sensors and all manner of other entities. Some data is intended for human review, problem identification or diagnosis, scanning, parsing or mining. As data sets are generated and stored in greater quantities, at greater rates, and with potentially greater levels of complexity and detail, the “big data” problem of storing, handling, processing or using the data arises.
Specifically, it can be problematic to identify meaning within data, or to identify relationships between data items in large or complex data sets. Further, data can be generated in real-time and received by data storage components or data processing components at regular or variable intervals and in predetermined or variable quantities. Some data items are generated over time to indicate, monitor, log or record an entity, occurrence, status, event, happening, change, issue or other thing. Such data items can be collectively referred to as ‘events’. Events include event information as attributes and have associated a temporal marker such as a time and/or date stamp. Accordingly, events are generated in time series. Examples of data sets of events include, inter alia: network access logs; software monitoring logs; processing unit status information events; physical security information such as building access events; data transmission records; access control records for secured resources; indicators of activity of a hardware or software component, a resource or an individual; and profile information for profiling a hardware or software component, a resource or an individual.
Events are discrete data items that may or may not have association directly or indirectly with other events. Determining relationships between events requires detailed analysis and comparison of individual events and frequently involves false positive determinations of relationship leading to inappropriate conclusions. Statistical methods such as time-series analysis and machine learning approaches to the modeling of event information are not ideally suited because they require numerical features in many cases, and because they typically seek to fit data to known distributions. There is evidence that human behavior sequences can differ significantly from such distributions—for example, in sequences of asynchronous events such as the sending of emails, exchange of messages, human controlled vehicular traffic, transactions and the like. In the paper “The origin of bursts and heavy tails in human dynamics,” (A. L. Barabasi, Nature, pp. 207-211, 2005), Barabasi showed that many activities do not obey Poisson statistics, and consist instead of short periods of intense activity which may be followed by longer periods in which there is no activity.
A related problem with statistical approaches and machine learning is that such approaches generally require a significant number of examples to form meaningful models. Where a new behavior pattern occurs (for example, in network intrusion events) it may be important to detect it quickly (i.e. before a statistically significant number of incidents have been seen). A malicious agent may even change the pattern before it can be detected.
The identification of sequences of events is a widespread and unsolved problem. For example, internet logs, physical access logs, transaction records, email and phone records all contain multiple overlapping sequences of events related to different users of a system. Information that can be mined from these event sequences is an important resource in understanding current behavior, predicting future behavior and identifying non-standard patterns and possible security breaches.