1. Field of the Invention
Apparatuses and methods consistent with the present invention relate to databases and large distributed database systems.
2. Description of the Related Art
During the last decade, there has been viral growth in social networks (SN). FaceBook, Flickr, Twitter, YouTube and Blogger, all implement social networks. Both SN owners and SN users are interested in a variety of queries that involve subgraph matching. For example, consider the small social network 100 shown in FIG. 1. Users of such a network might ask queries such as:
Find all vertices ?v1, ?v2, ?v3, ?p such that ?v1 works at the University of Maryland and ?v1 is a faculty member and ?v2 is an Italian university and ?v3 is a faculty member at ?v2 who is a friend of ?v1 and ?v3 has commented on a posting (or paper) ?p by ?v1. This query corresponds to a query graph 200 as shown in FIG. 2—it might be used by a University President to find existing interactions between his faculty and those in Italy (e.g., just before he goes for a meeting with the Italian embassy). When this query subgraph 200 is posed against an enormous SN, it is not feasible to match the subgraph in a naive way against the graph—without intelligent processing, the query would simply take too long. In the above subgraph 200 and the SN 100, the nodes are called vertices and the edges between two nodes specify relationships between two vertices.
Query 200 above contains multiple vertices and different relationships between the vertices, demonstrating the need to execute complex queries over social networks. In addition, answering SPARQL queries in the Semantic Web's RDF framework often involves subgraph matching. A goal of the present disclosure is to show how to answer such queries and more complex ones over large social networks efficiently. A further goal of the present disclosure is to show how to store such large SNs on a plurality of computers (a cloud of computers) and how to answer queries from a client when the SN is stored in this cloud of computers.
Another goal of the present disclosure is to create a graph-based index for a database (such as an RDF database) such that the complete index can reside on a single disk. RDF (Resource Description Framework) is an increasingly important paradigm for the representation of information on the Web. As RDF databases increase in size to approach tens of millions of triples, and as sophisticated graph matching queries expressible in languages like SPARQL become increasingly important, scalability becomes an issue. For data sets of this size secondary memory needs to be used for storage. There is therefore a growing need for indexes that can operate efficiently when the index itself resides on disk.