A logical graph is a powerful data model for data analytics, which allows analyzing the relationships between data entities, such as vertices and edges. A graph query is a patterned request to find a subset of the graph that satisfies (matches) the query. For example, a graph may be a continental road atlas, and a graph query may be a request to find all driving routes that stretch from the east coast to the west coast of the continent.
In graph theory, a super node is a vertex that has a disproportionately high number of incident edges. In practice, most large graphs have some super nodes. For example within a social graph (network), an ordinary person may be connected to a few people. Whereas, a famous or important person may be connected to thousands of people.
Another example is a call graph that represents calls to functions within a software codebase. A call graph often has a very large number of invocations (edges) that connect to functions (vertices) of a standard library or to functions that implement a core functionality of a software application.
When processing graph queries however, a significant portion of execution time, space (memory), and energy may be spent processing the few super nodes of a graph. These costs may be aggravated because graph query processing may repeatedly visit the same vertices. Each visitation of a super node may be very costly because a super node has so many edges and neighboring vertices.
Industry attempted to solve this problem by creating indices over the graph data that facilitates rapid access for some incident edges that are relevant to a particular query. Typically, such an index groups the incident edges of a vertex by edge labels, in case a query specifies traversing only over edges having a specific label.
A problem with such an approach is that it is too inflexible to apply it to arbitrary cases that have queries with all kinds of predicates over edges and vertices, and not just predicates over labels of edges. For each predicate, a separate index is required. However, which kind of predicates a user may specify in a graph query may be unpredictable, such as with ad hoc queries. In theory, indices for all the possible predicate types may be exhaustively created. However, this is impractical because indices cost a lot of (memory or disk) space and also take time to build.
In a property graph data model, edges and vertices may have labels and an arbitrary number of properties, which may be implemented as name-value pairs. Predicates may involve many properties and labels. Indeed, there may be a combinatorial (polynomial) amount of potential predicates for a particular graph. This may thwart scalability and make the index approach infeasible in the wild (in practice). Another problem with the index approach is that if a graph mutates (changes), then the indices may need updating, which is expensive.