1. Technical Field
The present disclosure generally relates to methods and systems for analyzing data from networked devices. More particularly, the present disclosure relates to methods and systems for analyzing data to identify whether any analyzed networked device includes information that is anomalous as compared to other analyzed networked devices.
2. Background
Attempts to determine methods for clustering objects into groups have led to the development of numerous clustering algorithms that use a distance or similarity measure to determine the proper clustering of the objects. Such clustering algorithms have been used, for example, in the fields of bioinformatics and language classification.
The efficacy of a clustering algorithm can be determined by evaluating the Kolmogorov complexity for clustered objects. The Kolmogorov complexity is a measure of randomness of a string based on its information content. A string is a finite binary sequence of information. Other finite information sequences can be transformed into a finite binary string prior to determining the Kolmogorov complexity using known methods. The Kolmogorov complexity can be used to quantify the randomness of individual objects in an objective and absolute manner.
The Kolmogorov complexity K(x) of a string x is defined as the length of the shortest program required to compute x on a universal computer, such as a Turing machine. As such, K(x) represents the minimal amount of information required to generate x using an algorithm. The conditional Kolmogorov complexity of string x to string y, K(x|y), is similarly defined as the length of a shortest program required to compute string x if strings is provided as an auxiliary input to the program. Similarly, K(xy) denotes the length of a shortest program required to generate string x and stringy. Based on the Kolmogorov complexity, the distance between two strings x and y has been defined based on the following equation:
            d      k        ⁡          (              x        ,        y            )        =                              K          ⁡                      (                          x              |              y                        )                          +                  K          ⁡                      (                          y              |              x                        )                                      K        ⁡                  (          xy          )                      .  The Kolmogorov complexity represents the ultimate lower bound among all measures of information content. However, it cannot generally be explicitly computed. As such, different techniques have been developed to approximate the Kolmogorov complexity for a text string.
K(x) is essentially the best compression that can be achieved for a text string x. As such, compression algorithms provide an upper bound to the Kolmogorov complexity. For a given data compression algorithm, C(x) can be defined to be the size of string x when compressed using the algorithm. Similarly, C(x|y) can be defined to be the compression achieved by first training the compression on string y and then compressing string x. As such, the Kolmogorov distance equation can be approximated using the following equation for a given text compression algorithm:
            d      c        ⁡          (              x        ,        y            )        =                              C          ⁡                      (                          x              |              y                        )                          +                  C          ⁡                      (                          y              |              x                        )                                      C        ⁡                  (          xy          )                      .  Data compression algorithms for which dc closely approximates dk are considered to be superior to algorithms for which dc does not closely approximate dk.
dc has been shown to be a similarity metric and has been applied to clustering DNA (see, e.g., Allison et al., “Sequence Complexity for Biological Sequence Analysis,” Computers and Chemistry 24(1), pp 43-55 (2000)) and classifying languages (see Benedetto et ah, “Language Trees and Zipping,” Physical Review Letters 88, 048702 (2002)). However, the computation of dc requires altering the chosen compression algorithm to obtain C(x|y) and C(y|x), which can require significant computational effort.
A simplified distance measure (the compression dissimilarity measure or CDM) can be used to approximate dc, as shown in Keogh et al., “Towards Parameter-Free Data Mining,” in “The Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining,” pp 206-215 (2004). The CDM value for two strings is defined by the following equation:
      CDM    ⁡          (              x        ,        y            )        =                    C        ⁡                  (          xy          )                                      C          ⁡                      (            x            )                          +                  C          ⁡                      (            y            )                                .  In other words, determining the CDM value for two strings does not require determining conditional values, but merely the compression of the strings and a concatenation of the two strings. As such, the computational effort for determining the CDM value is significantly less than the computational effort for determining the approximation of the Kolmogorov complexity dc.
If two objects, x and y, are unrelated, CDM(x, y) is close to 1. As the value of CDM(x, y) decreases, x and y are determined to be more closely related. As such, two objects that are substantially similar have a relatively small CDM value. It should be noted that CDM(x, x) does not equal zero.
Devices, such as computers, printers and other processor-based devices, are commonly connected together via a communications network, such as the Internet, a local area network (LAN) or the like. The use of networks to interconnect devices enables communication and processing operations to be performed among remote devices. For example, information can be passed from a first device to a second device to enable performance of a computing operation. Similarly, information can be distributed among a plurality of devices that are connected to a network to enable distributed processing operations.
A device that is connected to a network, such as a LAN or a wide-area network (WAN), might be configured in a similar manner with other devices on the network. For example, a computer used in a business or scholastic environment could be configured similarly to other computers in the same environment. As such, computers connected to the same LAN tend to have similar software, use the same operating system and/or generally have a similar system configuration. Similarly, other devices that are networked together, such as printers in a print cluster, tend to be similarly configured and perform similar operations. For example, printers connected via a network might have similar print drivers, store similar types of process information and/or the like.
One problem with networked devices is that such devices can be more easily compromised than non-networked devices. For example, an individual could create and distribute a software program, such as a computer virus, worm or other “malware” software program, that is received and stored by a networked device. An exemplary malware program could perform intrusive operations, such as periodically providing system information from an infected device to an unauthorized third party and/or preventing the infected device from performing some or all of its intended operations. Detecting malware can be difficult because malware typically masks itself as a legitimate software application.
Accordingly, systems and methods for clustering devices in a network based on the similarity between such devices and detecting devices in a network having anomalies based on such clusters would be desirable.