Graph analysis is a recently popularized methodology in data analytics. In graph analysis, a dataset is represented as a graph where data entities become vertices, and relationships between them become edges of the graph. Through this graph representation, it may be tractable to analyze fine-grained relationships between data entities.
While graph analysis may need in-memory execution for quick analysis of large data, graph-processing systems often provide distributed execution mode as well. That is, the large graph data is loaded into the aggregated (though not necessarily shared) memory of multiple computers, while each computer only loads up a portion of the data.
A typical workflow for a distributed graph processing system may involve these steps:                1. Load graph data from external sources such as files, databases, network connections, etc.        2. Transform the data into an internal data representation for the distributed graph processing system        3. Perform some form of distributed processing on the data such as algorithmic analysis        4. Transform the output data from the internal data representation to the external data representation and present and/or save the results for the user        
A problem may be that steps 1-2 usually take too long for large graphs. Furthermore, because steps 3-4 are possibly executed multiple times, step 2 may be performed in a way that optimizes the internal representation to accelerate steps 3-4. For example, graph data may be distributed to available computers to horizontally balance the workload of steps 3-4.
Note that a graph processing system may have its own internal representation of graph data, which may involve assigning a unique internal identifier to each vertex of the graph. For accelerated access, the graph processing system may encode the location of a vertex's data within the internal identifier of the vertex.
However, there may be challenges for building or maintaining the internal identifier of a vertex:                1. Available computers should agree on the internal identifier of the vertex. This may be aggravated if multiple computers simultaneously load graph data        2. The graph data instead includes the external (natural or otherwise original) identifier of the vertex.        3. Internal and external identifiers may subsequently be expected in different contexts after loading.        
Some of these problems were partially solved by having all available computers load the whole graph. However, that approach may have fundamental problems such as loading latency, memory exhaustion, and coherence.
Another solution was to apply a global hash function to vertices to decide which machine should solely load and exclusively own a vertex. However, this approach may have fundamental problems such as network latency, network congestion, and workload imbalance. The amount of vertices per computer is more or less arbitrarily determined by the hash function instead of by requirements of the analysis application or graph system. If the hash function distributes the vertices unevenly, then system throughput may degrade. Furthermore if each computer only reads its own portion of the graph data, then communication may be irregular, which may aggravate contention. For example, hashed access to an underlying (e.g. original) data store may be random access, which is irregular and may be difficult to perform in bulk. Furthermore, hashing does not simplify generation of internal identifiers.