This disclosure relates to a method for backfilling graph structure and to articles comprising the same. In particular this disclosure relates to a method for generating new graph-based data structures and adding them to existing graph-based data structures based on queries posed by users of the system over a period of time.
Graph-based data systems provide information in form of nodes and edges in a wide variety of data systems such as, for example, those used in hospitals, police and detective databases, university systems, employment databases, city service databases, and the like. Graph-based data was once thought of as a fallback option for data that could not be manipulated into a relational data system. However, graph-based data structures and graph-based data systems are now emerging as the preferred storage method, not only for overtly networked systems, such as social networks and citation networks, but also for biological systems, traffic patterns and, well, all of human knowledge.
Graph-based data structures are therefore emerging as an intuitive and flexible means of encoding a wide range of information. From hospital data systems to social networks, notions of interaction, correlation, and influence are increasingly being represented by nodes and edges.
The FIG. 1(A) shows how moving along the nodes and edges of a graph-based structure can be used to support a variety of different queries that users may need to execute. The term “graph-based” data structure as used herein refers to a data structure comprised of nodes and edges. Nodes represent entities such as people, businesses, accounts, or any other item one might want to keep track of. Edges are the lines that connect nodes to nodes or alternatively, nodes to properties and they represent the relationship between the two. Meaningful patterns emerge when examining the connections and interconnections of nodes, properties, and edges.
For example, in the data system for a large hospital, a doctor 102 might want to find out which other doctors his patients are seeing. By first finding himself 102 in the data system (see step (a)), he can then pivot out to all of the patients 104 associated with him (see step (b)), and then pivot back to all of the doctors 106 and 108 associated with those patients (see step (c)). A pivot is the process of selecting an initial set of seed nodes (in this case “the doctor 102”) in the graph, and then swinging out to the neighboring nodes (in this case “all of the patients 104”) that are connected to it. This produces subgraph data consisting of both the seed node (102) and neighbor nodes (104). The term “pivot” comes from the fact that this operation can be chained together, with the neighbor nodes (104) from the previous step serving as the seed nodes in the present step to determine another set of neighbor nodes (106 and 108).
In the FIG. 1, the lines (shown in bold) connecting the doctor 102 to his patients 104 are called edges. These bold lines represent a first set of edges 105. The lines connecting the patients 104 to their other doctors 106 and 108 are shown in dotted lines and these represent a second set of edges 107. While the exemplary graph shown in the FIG. 1 is helpful to the doctor to determine which other doctors service some of his patients, not all users of the system will find the graph-based data system easy to access and even fewer will find the means (e.g., querying the system using keywords approved by the system) to access the structure and to obtain all information available to them from the structure.
The overall utility of a graph-based data structure database can depend heavily on how well the data abstraction matches the queries that will ultimately be run against it. Abstraction is defined by the level of complexity on which a person interacts with the structure. This is often because the people who determine the abstractions for the graph-based data structures are often technology experts, not experts in the data itself or people who will be working most directly with the data (domain experts).
The people creating the abstractions might choose an abstraction that does not fit with the tasks and queries that need to be accomplished. Or perhaps those tasks and queries will change over time, and the abstraction simply goes out of date. In either case, changing the underlying data abstraction can involve reloading the entire data structure or executing complex queries that require close coordination between technology experts and the domain experts.
It is therefore desirable to have a graph-based data structure where abstractions are developed and continuously improved based on the type and population of queries that the system is subjected to over time by users and not just by the people that create the abstractions.