Relational databases work well for processing transactions, using the query language SQL to pull together data from columns in multiple separate tables full of transaction numbers. But they do not work well for pulling together “nodes” from a network of “edges”, like Facebook's Social Graph. That task requires “many-to-many joins”, which require creating “junction tables”, and it requires writing SQL queries that are recursive across these junction tables, making these queries complex to write and slow to execute.
A “graph database” is a kind of “NoSQL” database that uses a graph query language, like the World Wide Web Consortium (“W3C”) standard SPARQL (described in Prud'hommeaux, E. & Seaborne, A. (2007). SPARQL Query Language for RDF: W3C Candidate Recommendation 14 Jun. 2007. Available at http://www.w3.org/TR/rdf-sparql-query/) or the proprietary Cypher (described in De Marzi, M. (2012). Cypher Query Language. Chicago Graph Database Meet-Up) or GraphQL (described in He, H., Singh, A. (2008). Graphs-at-a-time: query language and access methods for graph databases. Proceedings of the 2008 ACM SIGMOD international conference on management of data), to traverse edges without junction tables or recursivity, enabling queries that are simple and fast for data structured as a “graph,” such as the one depicted in FIG. 1 as a visualization of nodes in black and edges in color.
The graph structure enables queries to traverse multiple edges in a graph to compile groups of related nodes, such as the Transactions whose product is an Electronic device. If a graph is structured more formally as an “ontology,” then it can also enable “logical inference,” wherein relationship assertions, such as “Notebook PC is a PC” and “PC made with CPU,” enable the system to logically infer additional relationships, such as “Notebook PC made with CPU.” Then if a user changes the assertion that a computer is made with a CPU, for example, the system can automatically change the inference that a notebook PC is made with a CPU. This reduces redundant effort, which can be useful for maintaining a graph over time as the relationships between its nodes change.
Graph database adoption accelerated in 2011 as the ecosystem finally became stable and scalable, and as “Big Data” and “NoSQL” became technology industry buzzwords. Graph database producers include Aduna, Clark & Parsia, Cray, Franz, Kobrix, Neo Technology, Objectivity, Ontotext, Oracle, sones, and SYSTAP, among others.
But as these firms' customers migrate their data into a graph structure, they are struggling to explore it and edit it, despite the proliferation of tools to visualize graphs as nodes and edges. FIG. 1's visualization of nodes and edges, for example, makes it impractical to compare attributes of its nodes, such as to compare Transactions' revenue by product, time, buyer, and seller.
It turns out that for viewing comparisons of nodes by multiple attributes, such as by the products, times, and companies described for the “Transactions” in this graph, tables still have advantages over visualizations of nodes and edges even when data is structured as a graph. And it turns out that viewing graph-structured data in tables presents a host of challenges.
To describe these advantages and challenges, we will describe graphs using the W3C standard Resource Description Framework (“RDF”) (described in Carrol, J. & Klein, G. (2004). Resource Description Framework (RDF), Concepts and Abstract Syntax: W3C Candidate Recommendation 10 Feb. 2004. Available at http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-property) terminology, referring to each link in a graph as a “triple” with three parts: a “subject” node, an “object” node, and a “predicate” linking the subject and object. For example, in the link “PC made with CPU” in the graph in FIG. 1, “PC” is the subject, “made with” is the predicate, and “CPU” is the object. Alternative terms for predicate include “arc”, “edge”, “line”, “link”, and others. Alternative terms for subject and object include “node”, “point”, “vertex”, and others.
For each node, we refer to as “attribute” the predicate and object of each triple whose subject is that node, so for a node like “Intel buys $250M of chip testers from Advantest in 2010” in the graph in FIG. 1, we describe its attributes as “is a Transaction”, “time 2022”, “seller Advantest”, “product Chip tester”, “buyer Intel”, and “revenue $250M”. In some contexts, others describe an attribute by using the term “property”.
But consistent with RDF, we use the term “property” to refer to the kind of relationship represented by each predicate. For example, the subsumption relation in set theory may be represented by a property called “is a”, and the parthood relation in mereology may be represented by a property called “made with”, so in the graph in FIG. 1, the two predicates in the two triples “Computer made with CPU” and “Semiconductor made with Chip tester” are both instances of one property called “made with”.
If one supplements a graph with a formal vocabulary, such as supplementing RDF with the W3C standard OWL2 Web Ontology Language (“OWL2”) (described in Motik, B. et. al. (2009). OWL2 Web Ontology Language Profiles: W3C Proposed Recommendation 27 Oct. 2009. Available at http://www.w3.org/TR/2009/REC-owl2-profiles-20091027/), then the graph becomes a kind of graph termed an “ontology”, and certain properties like the subsumption relation enable one to “infer” additional information without stating it in the graph. For example, in the graph in FIG. 1, one can infer that “Notebook PC” has the attribute “made with CPU” because it has the attribute “is a PC” and “PC” has the attribute “made with CPU”. If one uses the formal vocabulary to describe the “made with” property as “transitive”, then one can traverse the graph, combining successive predicates in that property to infer that “Notebook PC” also has the attribute “made with Chip tester”. And if one describes the “made with” property as “reflexive”, then it will relate every node to itself, and one can infer that “Notebook PC” also has the attribute “made with Notebook PC”.
We refer to “property path” to specify a combination of different properties that may connect nodes in a graph transitively across more than one triple, and we use a colon to separate properties in a property path. For example, in the graph in FIG. 1, if the property “is a” is transitive, then the property path “product:is a” connects all three Transactions as subjects to “Asset” as object. Extensions to the SPARQL graph query language have been proposed to support queries across not only properties but also property paths, and some graph database producers have implemented these extensions (described in Harry, S. & Seaborne, A. (2012). SPARQL 1.1 Query Language: W3C Proposed Recommendation 8 Nov. 2012. Available at http://www.w3.org/TR/sparq111-query/#propertypaths).
We refer to “cardinality” of a property for a given node to describe the number of attributes that node has with that same property. In the graph in FIG. 1, the node “PC” has cardinality 2 in the “made with” property. We refer to cardinality above 1 as “higher cardinality”.
We refer to “arity” to describe the number of different nodes in a relationship. A triple describes a relationship between 2 nodes, so it has arity of 2 and can be described as a “binary relationship.” But one may wish to examine a relationship between more than 2 nodes, such as between the 4 nodes “2010”, “Advantest”, “Intel”, and “$250M” in the graph in FIG. 1, because these nodes are the objects of the attributes of “Intel buys $250M of chip testers from Advantest in 2010”. We refer to relationships between more than 2 nodes as “higher-arity relationships”. These higher-arity relationships can be useful for making comparisons, but they are difficult to manipulate in a visualization of nodes and edges.
Tables can be useful for manipulating higher-arity relationships. For example, some higher-arity relationships in the graph in FIG. 1 can each be represented as a row in a table that uses 1 page header and more than 2 row headers, as shown in FIG. 2.
The same higher-arity relationships can each be represented as a grid cell in a table that uses 1 page header and more than 2 row headers and column headers, as shown in FIG. 3.
The same higher-arity relationships can be summarized further in a table's grid area by performing an OLAP roll-up operation based on the “is a” property, as shown in FIG. 4.
Prior art for using tables to manipulate higher-arity relationships in graph-structured data includes designs for browsing these relationships by using “faceted navigation” (described in Hearst, M. (2011) UIs for Faceted Navigation: Recent Advances and Remaining Open Problems. School of Information, UC Berkeley) and by linking graphs to OLAP systems (described in Cao, L. et. al. (2002). Ontology-based Integration of Business Intelligence. University of Sydney), as well as designs for editing these relationships by linking graphs to spreadsheets where the relationships can be edited in tables.
However, the prior art does not address numerous issues that can arise when using tables to browse and edit relationships with higher arity and cardinality in graphs and ontologies.
Issue 1:
If one starts from the roll-up in FIG. 4 and drills down, then one will need to drill down through multiple layers that are not informative before one begins to distinguish the different nodes that were rolled up, as shown in FIG. 5.
Issue 2:
Even if one could address Issue 1 by eliminating redundant layers such as “Asset” and “Device” in FIG. 5, it can still be confusing to see redundant information to the right of the expanded cells, such as seeing all the companies to the right of “Electronic device” if one expands “Company, even though one sees them again for the more-specific products that appear when one expands “Electronic device”. FIG. 6 illustrates this additional redundancy.
Issue 3:
When one selects a table's page headers, row headers, and column headers from properties in a graph, one may want to describe a default superclass for the objects of each property in that table so that new nodes created in the table obtain a superclass typical of objects of that property. This happens automatically to data entered into a table in a relational database, but it does not happen automatically to data entered into a table representing relationships in a graph.
For example, assume one enters a new row at the end of the table shown in FIG. 2, as shown in FIG. 7.
This entry can create a new node that has all of the attributes in this new table row, including that it “is a Transaction”. But the attribute “buyer Genentech” has an object that does not exist, so a “Genentech” node must also be created. And that node will not have any of its own attributes. One may want it to have the attribute “is a Company”, but prior art does not offer a way to do that when one enters a new node into a table that represents higher-arity relationships in a graph.
Issue 4:
When one selects a table's page headers, row headers, and column headers, one may want to select at least one header from not only properties in a graph but also property paths. For example, one may want to sort the Transactions in the graph in FIG. 1 by what their products are made with, so one may want to select a row header from the property path “product:made with”, where “:” signifies a node that is an object of a triple whose predicate is the “product” property and is a subject of another triple whose predicate is the “made with” property, as shown in FIG. 8.
This introduces additional issues for data entry in the table. Assume that one enters a new row at the bottom of the table shown in FIG. 8, as shown in FIG. 9.
In addition to Issue 3, here we have another issue due to the use of a property path. Entry of this new table row can create a new node that has all of the attributes in this new table row, including that the new node “is a Transaction”, but it is ambiguous what to do with the attribute “product:made with GPU” because its predicate is a property path. This new table row suggests that the new node should have a “property” attribute whose object should in turn have the attribute “made with GPU”. If the property “made with” is transitive and reflexive, then the nodes “CPU”, “Notebook PC”, and “PC” all meet that criteria. Which node should the new Transaction use? Or should we create a new node for the new Transaction to use?
Issue 5:
FIG. 8 also illustrates that even if one solves Issue 1 and Issue 2, one may still be burdened by redundancy when browsing objects of a transitive property like “made with”. One sees Transactions with “product Notebook PC” three times and the Transactions with “product CPU” two times. One might like the ability to use OLAP roll-up not just with the “is a” property but also with another transitive property like “made with”, so that one can group the objects of “product” attribute that are “made with” the same node, viewing the “buyers”, “sellers”, etc. for that group, and then to expand that group to see more-specific groups and to remove “buyers”, “sellers”, etc. from the more general group that was expanded. Prior art provides no adequate method to perform this roll-up and expansion with a transitive property other than a subsumption property like “is a”.
Issue 6:
In practice, graphs are typically much larger than the example in FIG. 1, so typically one will need to filter a table for only some objects of at least one of the properties. For example, one may want to see only the Transactions in which Intel participated. When one wishes to do this, the graph in FIG. 1 will prove problematic, because one will only be able to filter the table in FIG. 2 for Intel as a “buyer”, as shown in FIG. 10 or as a seller as shown in FIG. 11 but not as either. One will be unable to include both of Intel's two Transactions in a single table, so one will be unable to follow Intel's participation in the value chain of products in this graph.