Meaningful information may be extracted from large datasets by searching for patterns in them. Relationships and interactions between humans or objects are recorded, and collections of such data records are subjected to a process of classification analysis. For example, communication logs may record source hosts, destination hosts, port numbers, and other information about communication events. Bank transaction logs may register source accounts, destination accounts, branch office names, and other things related to monetary transactions. These logs are subjected to a classification analysis. In the case of communication logs, the records are analyzed in terms of, for example, whether they suggest any illegal or criminal activities. The analysis may discover a particular pattern of events in communication logs collected at the time of distributed denial-of-service (DDoS) attacks, targeted threats, or the like. In the case of bank transaction logs, the records are analyzed in terms of whether they suggest the occurrence of money laundering or money-transfer frauds. The analysis may discover a particular pattern of events in transaction logs collected at the time of such crimes.
Support vector machines (SVMs) are one of the techniques used for data classification analysis. An SVM algorithm determines a boundary plane that divides two classes with maximum distances to the closest data records. The similarity between data records is evaluated when classifying data records.
As an example of classification analysis on human or object relationships, a computer may calculate the similarity between a first communication log collected in a certain time window and a second communication log collected in another time window. Communication logs include multiple records, and each record includes a numerical value that indicates the number of communication events performed between a source host and a destination host.
The computer evaluates the overall similarity between two communication logs by associating their individual data records on a one-to-one basis and calculating a difference in the above-noted numerical values between each two associated records. How to best associate data records in two logs is, however, often unknown in the case of interactions between humans or objects. For example, the hosts involved in suspicious communication patterns may differ from log to log. That is, comparing the records that have identical source hosts and identical destination hosts is not a best practice for similarity determination. Accordingly, the computer determines record-to-record associations to maximize the similarity between datasets of interest. However, if an exhaustive approach was taken in this case, the computer would see an explosion in the number of possible association patterns, thus being unable to achieve the goal within a realistic time frame.
Graph kernels (e.g., random walk kernel and shortest path kernel) are techniques for efficiently calculating the similarity of pairs of graphs. Recorded relationships between humans or objects may be represented in the form of graph data, which allows the use of a graph kernel to classify them. As one example of graph-based data classification techniques, a graph edit distance kernel is proposed to improve the accuracy of measurement of similarity between two graphs by using graph mapping distance as an approximation of graph edit distance. See, for example, the document below:
Eimi Shiotsuki, Akihiro Inokuchi, “Learning for graph classification using Star edit distance”, DEIM Forum 2016, Feb. 29, 2016
While the graph kernels make it possible to measure the similarity with a small amount of computation, the drawback is that they could lose some part of the original data and thus degrade the accuracy of similarity determination. For example, in the case of communication log analysis, it is not possible to express the combinations of source host, destination host, and port number in graph form. That is, graph kernels are unable to maintain the information about combinations of three interrelated things. The existing methods are unable to provide sufficient accuracy in determining similarity between data records describing relationships between humans or things, although it may be possible to execute the calculation with a reasonable amount of computation.