Graph analysis is a subfield of data analysis that encompasses systems and methods for analyzing datasets modelled as graphs. A graph in this context represents an underlying dataset that it is organized into a set of data entities and connections. The data entities are referred to as nodes or vertices of the graph, and the connections between data entities are referred to as edges of the graph. Other information in the underlying dataset may be encoded as node or edge properties. Using this model, a graph may capture fine-grained, arbitrary relationships between different data entities within the underlying dataset. Graphs can be used to model a wide variety of systems and relationships including, without limitation, communication networks, linguistic structures, social networks, data hierarchies, and other physical or virtual systems. For instance, a node within a graph may represent a person in the underlying dataset with node properties representing social security number, name, address, etc. The edges may represent connections between people, with edge properties capturing the strength of connection, a source of the connection etc. Other entities and connections may also be represented depending on the particular application. By analyzing relationships captured by a graph, data scientists, applications, or other users can obtain valuable insights about the original dataset.
Computer-implemented processes for performing graph analysis are generally not computation-intensive but may be significantly memory-bound. For example, some operations may be performed by traversing nodes of a graph and performing simple comparisons. Although the processing overhead for these operations may be relatively small, some datasets have a large number of nodes to analyze. In such scenarios, memory can become a chokepoint, slowing down the performance of such graph analysis operations.
Graph processing systems may use different approaches for structuring node and edge properties (collectively referred to herein as “graph properties”). According to one such approach, graph properties are stored in a row-oriented format, where different rows correspond to different nodes or edges, and the attributes for each row represent different graph properties. Within memory, the graph properties that belong to a node or edge are stored contiguously. TABLE 1 below illustrates an example data structure that organizes node properties according to a row-oriented format.
TABLE 1SAMPLE NODE PROPERTIES IN ROW-ORIENTED FORMATstruct node_property {long employer_idstring namestring addressint base_salary...}
Referring to TABLE 1, the node properties including “employer_id”, “name”, “address”, and “salary” are stored contiguously in memory for a node. These memory entries may be followed by the node properties for a next node in the graph.
According to another approach, graph properties are stored in a column-oriented format. In this approach, a set of property vectors are defined, where each property vector contiguously stores values spanning multiple nodes or edges for a respective graph property. TABLE 2 below illustrates an example data structure that organizes node properties in column-oriented format.
TABLE 2SAMPLE NODE PROPERTIES INCOLUMN-ORIENTED FORMATstd::vector <long> node_prop_employer_idstd::vector <string> node_prop_namestd::vector <string> node_prop_addressstd::vector <int> base_salary...Referring to TABLE 2, the property vector “node_prop_employed_id” stores a set of employer_id values from different nodes contiguously in memory. Similarly, the property vectors “node_prop name”, “node_prop address”, and “base_salary” store contiguous values for the corresponding node properties.
According to another approach, node properties are structured using a key-value store. In this approach, properties are represented as a general key-value map for each node or edge. TABLE 3 below illustrates an example data structure that organizes node properties using a key-value mapping.
TABLE 3SAMPLE KEY-VALUE MAPPING FOR NODE PROPERTIESstruct node_props {std:map < string, void*> property_map;}...Referring to TABLE 3, key-value pairings for the node are arbitrarily defined according to “property_map”. The key-value pairings map a property, which acts as the key, to a corresponding property value. The key-value pairings may be stored contiguously in memory.
The approaches described above involve various tradeoffs when applied to graph analysis procedures. If only a single node property is involved in a particular procedure, then the column-oriented approach allows different values of the property to be read from consecutive memory locations. In such scenarios, the column-oriented approach may yield significant improvements in memory bandwidth over the row-oriented and key-value approaches. On the other hand, the row-oriented approach may improve memory access times if a particular procedure accesses all of the properties of a row. The key-value approach allows for greater flexibility in defining the properties of a node or edge, but may suffer in memory performance due to the lack of structure.
For certain graph analysis procedures, none of the approaches described above yield significant improvements with respect to memory performance. Generally, this scenario occurs when the graph analysis procedure accesses only a subset of the properties that belong to a graph object, and the accesses occur multiple times in a non-sequential manner. TABLE 4 below depicts an example algorithm where multiple, non-sequential property accesses occur.
TABLE 4SAMPLE ALGORITHM WITH MULTIPLE ACCESSESOF DIFFERENT NODE PROPERTIES...while (some_condition) {foreach(n: G.nodes)foreach(t: n.Nbrs)n.foo = t.bar1 + t.bar2; ...}...In the example depicted in TABLE 4, the properties are located in different memory locations. Therefore, accessing the values of two properties (even for the same node) results in two reads to two non-consecutive memory addresses. Even with row-oriented approach, there is no guarantee that the two properties, “bar1” and “bar2”, are consecutive in-memory or located in the same cache line. Consequently, a central processing unit executing the above algorithm may require reads of two separate cache lines to access “bar1” and “bar2”. Multiple reads contributes to memory bottleneck, especially when the expression is executed repeatedly across different nodes or edges.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.