Field of the Disclosure
This disclosure relates generally to graph analytics, and more particularly to systems and methods for performing graph processing operations using mutable multilevel data structures to represent the target graph structures.
Description of the Related Art
The importance of graph analytics is growing, but so is the size and richness of the data being analyzed. In addition, graph analytics and other graph processing computations are typically highly dynamic processes, in that both the graph structure and its node and/or edge properties can change, sometimes quite rapidly. The dynamic nature of today's graphs is ill-suited for Compressed Sparse Row (CSR) representation, which is one of the most popular data structures used today for representing graphs for in-memory analytics applications.
Previous attempts have been made to create mutable CSR-based representations, such as delta maps and log-based representations, but both of these have their shortcomings. Delta maps, which store modifications in a write-optimized representation, either slow down a computation (e.g., if the computation accesses the write-optimized representation) or require rebuilding the entire representation (e.g., to obtain up-to-date results without accessing the slow write-optimized representation). Log-structured approaches take up large amounts of memory and require periodic merges to cap memory usage, requiring a space/time trade-off.
Data structures other than CSR-based representations have been used to represent graph structures. For example, systems that prioritize updates and/or that are optimized for online transaction processing (OLTP) workloads or OLTP-like workloads typically store adjacency lists in separate arrays. One existing system stores the adjacency lists in separate objects called “cells,” which the system stores compactly back-to-back in a circular buffer. This system grows the cells by reallocating them at the head of the buffer, and it compacts the buffer by periodically moving objects from the tail to the head. This system supports low-latency online queries by providing fast random access to the adjacency lists in addition to efficient offline analytics, which uses a vertex-centric computational model. It optimizes message passing by identifying “hub” vertices with a large number of connections.
Another existing system represents graphs using compressed, partitioned bitmaps, which enable the database to efficiently combine multiple adjacency lists using set operations. In this system, the bitmaps are chunked into 32-bit words, and all nonzero chunks are stored in a balanced tree that maps the offset of a chunk to the chunk itself. The result is a compact data structure optimized for out-of-core computation with reasonable in-memory performance, but it is not as fast as dedicated in-memory systems. Dense graphs, in which the number of edges approaches the square of the number of vertices, are often stored using an uncompressed adjacency matrix. However, real-world graphs on which complex analytics are performed are rarely that dense. Yet another existing system enables computation on a distributed, constantly changing graph by providing a snapshotting method based on vector clocks. It enables incremental computation by updating the computation results based on recent changes in the graph as reflected in new snapshots.