The present invention is generally directed to graph mining in data processing applications. More specifically, the present invention is directed to calculating direction-aware proximity for graph mining.
Data mining refers to the sorting of large amounts of data to determine relevant and useful information. Graph mining is a type of data mining in which data is organized into a graph in order to extract information regarding the data. Such a graph can represent a so called “social network” in which nodes represent entities, such as people, and edges connecting the nodes represent some relationship, collaboration, or influence between the entities. Examples of social networks include nodes representing scientists, with edges connecting pairs that have co-authored papers; nodes representing scientific papers, with edges representing citations between the papers; nodes representing telephone numbers, with edges representing calls between the telephone numbers; nodes representing web sites, with edges representing links between the web sites; etc. Accordingly, nodes and edges of a graph can represent any entity and relationship between entities, respectively. In some graphs, the edges between nodes are weighted based on the relationship between the entities represented by the nodes. For example, in a graph representing a telephone network edges between nodes representing telephone numbers can be weighted by the number of calls between the telephone numbers.
An undirected graph refers to a graph in which the edges connecting nodes have no direction. Accordingly, an edge in an undirected graph represents a relationship that exists symmetrically between nodes. A directed graph refers to a graph in which the edges connecting nodes have a direction. Accordingly, an edge in a directed graph represents a relationship that exists from one node to another. For example, in an undirected graph of a telephone network, an edge between first and second node, representing first and second telephone numbers, respectively, represents calls made between the first and second telephone numbers regardless of who initiated the calls. In a directed graph of a telephone network, an edge from the first node to the second node represents calls from the first telephone number to the second telephone number.
Measuring proximity between nodes in a graph is a fundamental problem in graph mining. Typically, the proximity between nodes is a measure of similarity or distance between the entities represented by the nodes. However, conventional methods for measuring proximity are designed for undirected graphs. While such conventional proximity measurements can be applied to directed graphs, such proximity measurements do not account for the directional information in directed graphs.