Large amounts of data are becoming available due to the ease of sharing information over the Internet and due to the development of a wide variety of sensors that provide information about individuals, society and technological systems. The importance of using this information to understand the behavior of a wide variety of systems is growing. Among the opportunities is improved understanding of failures and risk of failures, attacks on the system by malicious actors, and more generally the characterization of events. Characterizing events that occur in a system can enable us to better respond to those events and to change the system to make it less vulnerable to adverse events.
Existing methods of analyzing data streams and determining the existence of adverse events and characterizing those events generally depend on human identified specific measures for those events determined by logic, or specific identification of particular types of events obtained from analysis of preexisting instances of those specific events. For example events associated with security breaches are analyzed by using, for example, parts of the code of a particular malware, or presence of a particular file in the system. Vulnerability of a computing device system is determined by the settings of the system compared with recommended settings. Adverse health related events are obtained by identifying specific indicators of those health related events.
A typical data management approach is to organize related data values into a simple data structure, such as a multi-dimensional vector in which each data value is assigned to one of the dimensions of the vector. As a system becomes more and more digitally integrated with other systems, observation devices, and data flows, the amount of data generated by the system increases and additional data values become available for the purpose of characterizing the system or the data produced by it. In a complex system, vectors quickly become extremely difficult or impossible for human beings to process as the number of dimensions increases: effective visualization of a vector can only be achieved at very low (i.e., two or three) dimensions, so conventional solutions that rely on the ability of human observers to infer and characterize the nature of the event are deficient. It is common to attempt to characterize systems using methods that are fully specified algorithimically by an individual who determines the process of signature extraction from data vectors. For example, the observer must individually identify, characterize, and program each signature that is to be used to characterize events. These measures are not robust to the many possible ways that events, adverse events, vulnerability, failures, and security breaches can occur. There is need for a more general ability to recognize when a system is vulnerable, failing, or when a system has been compromised. More generally, there is a need for methods to extract various signatures characterizing events from large amounts of data.