Semantic data models allow relationships between resources to be modeled as facts. The facts are often represented as triples that have a subject, a predicate, and an object. For example, one triple may have the subject of “John Smith,” the predicate of “ISA,” and the object of “physician,” which may be represented as
<John Smith, ISA, physician>.
This triple represents the fact that John Smith is a physician. Other triples may be
<John Smith, graduate of, University of Washington>
representing the fact that John Smith graduated from the University of Washington and
<John Smith, degree, MD>
representing the fact that John Smith has an MD degree. Semantic data models can be used to model the relationships between any types of resources such as web pages, people, companies, products, meetings, and so on. One semantic data model, referred to as the Resource Description Framework (“RDF”), has been developed by the World Wide Web Consortium (“W3C”) to model web resources but can be used to model any type of resource. The triples of a semantic data model may be stored in a semantic database.
The triples of a semantic database may be viewed as representing a graph with the subjects and objects as nodes and the predicates as links between nodes. FIG. 1 is a graphical representation of an example semantic database. The example semantic database includes the following triples:
<John Smith, ISA, college graduate>
<John Smith, graduate of, University of Washington>
<John Smith, degree, MD>
<John Smith, parent of, Bob Smith>
<John Smith, ISA, male>
<Bob Smith, child of, John Smith>
<Bob Smith, ISA, male>
<MD, ISA, post-graduate degree>
The graph 100 includes nodes 111-117 representing entities (subjects or objects) of John Smith, Bill Smith, male, college graduate, University of Washington, MD, and post-graduate degree. The links between the nodes are labeled with the corresponding predicate. For example, the link between node 111 for John Smith and node 115 for University of Washington is labeled with “graduate of,” representing the following triple:
<John Smith, graduate of, University of Washington>.
The graph thus includes one link for each triple.
Many graph algorithms can be employed to identify various characteristics of and perform various processes on graphs. For example, the graph algorithms may identify subgraphs that are not connected to other subgraphs, spanning trees of the graphs, cliques within the graph, and so on. These graph algorithms typically represent a graph as a matrix, often referred to as an adjacency matrix, with a row and a column for each node with the elements of the matrix representing links between the nodes. An adjacency matrix may represent the presence of a link between nodes with a non-zero value (e.g., 1) in the corresponding element of the matrix and the absence of a link between nodes with a zero value in the corresponding element of the matrix. The non-zero values may represent weights of the links. An adjacency matrix may represent both graphs that are directed (as FIG. 1) and graphs that are not directed. For a directed graph, the rows represent source nodes and the columns represent the sink nodes.
One well-known algorithm that processes a graph is the PageRank algorithm. The PageRank algorithm generates a ranking of the importance of web pages. The PageRank algorithm represents the World Wide Web as a graph with a node for each web page with links between the nodes representing links between web pages. Because the Web includes hundreds of millions of web pages, the amount of storage needed to store an adjacency matrix representing a graph of the Web is very large (e.g., O(n2) where n is the number of web pages). Many other domains (e.g., physics and bioinformatics) also require vast amounts of storage space to store each element of a matrix representing a graph.
In many of these domains, the matrices representing a graph are sparse—that is, the vast majority of the elements contain the same value (e.g., zero), referred to as a distinguished value. For example, a graph representing the Web is very sparse as each web page typically has links to only a very small fraction of a percent of the total number of web pages. Because most of the elements of such a sparse matrix have the distinguished value, the sparse matrix can be stored in a compressed form by explicitly storing only elements with non-distinguished values along with a mapping of rows and columns to those stored elements. When an algorithm is to access an element at a row and a column of such a sparse matrix, the mapping is checked to determine whether that element is stored—that is, whether the element has a non-distinguished value. If so, the non-distinguished value is retrieved and returned as the value of that row and column. Otherwise, the distinguished value is returned as the value of that row and column. One technique for mapping elements to non-distinguished values is referred to as compressed sparse row (“CSR”) as described below in detail. In some cases, these techniques can reduce the storage space needed to represent the sparse matrix by orders of magnitude. Another technique for mapping elements to non-distinguished values is referred to as an edge list. An edge list is a list of pairs of nodes that have links between them and, if the links are weighted, a weight for each link.
FIG. 2 illustrates a compressed row storage form of a matrix. The matrix 200 includes five rows and five columns and has 25 elements with 19 elements having a zero value (i.e., distinguished). A CSR data structure 250 represents the CSR form of the matrix 200. The CSR data structure includes a row table 260 and a column table 270. The row table contains an entry for each row of the table with a pointer to an entry in the column table. The column table contains an entry for each element with a non-distinguished value. Each entry of the row table points to a sequence of entries in the column table representing the columns with a non-distinguished value along with that non-distinguished value of that row and column. For example, the element of the matrix 200 in the third row and second column has a value of −4. To represent this element, the third row of the row table points to the fifth entry of the column table that indicates that the second column has a value of −4.
Many implementations of graph algorithms assume that the corresponding matrix is represented in a CSR form and access the matrix via a CSR interface. A CSR interface typically provides a function that is passed a row and a column, accesses the CSR data structure, and returns the value of the corresponding element. To access a semantic database, rather than a matrix represented in CSR form, a new implementation of a graph algorithm would need to be developed (or a existing implementation would need to be revised) to access a semantic database. Such development of a new implementation (or revision of an existing implementation) would be both costly and time-consuming.