1. Field of the Invention
The present invention generally relates to improving efficiency in query searches of RDF and/or other schema-less data in a relational database. More specifically, a hash table is created so that each subject (or object) of the schema-less database can be represented in a row in the hash table, and a hash function is then used to insert predicate/object data related to that subject as units into that row into adjacent pairs of columns, such that the first column of a column pair stores a predicate and the next adjacent column stores the value associated with that predicate.
2. Description of the Related Art
The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model and has become the lingua franca for both information extracted from unstructured data, such as with OpenCalais, as well as for information with a natural graph representation, such as DBPedia and UniProt. RDF has come to be used as a general method for conceptual description or model of information that is implemented in web resources, using a variety of syntax formats.
RDF provides a way to express linked data: Subject—Property—Object (Value). As an example, “IBM hasLocation Hawthorne” might be expressed in RDF as a triple (IBM, hasLocation, Hawthorne).
FIG. 1 shows exemplarily how RDF data 100 stored in a relational store is generally stored as a triple 101, each triple 101 having a subject 102, predicate 103, and object (e.g., metadata) 104. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. As an initial matter, it is noted that, in the art as well as in this description, the word “property” is sometimes used instead of “predicate”, and an “object” is also sometimes alternatively referred to as the “data”, “value”, or “metadata” associated with a predicate and/or subject. An RDF database D is a set of triples of the form (subject, predicate, object), where the subject and predicate are drawn from a set R of resources. A resource is any entity that can be denoted by a Uniform Resource Identifier (URI). The object is either a resource or a primitive value such as an integer, string, floating-point number, etc.
It is noted at this point that a more generic schema-less data scheme would use tuples rather than the format based on triples such as demonstrated by the RDF scheme used to describe the method of the present invention. Thus, for example, a tuple will contain a subject that is then interrelated to other components defined in that tuple. Other schema-less data representations include, for example, key/value databases (e.g., CouchDB).
A second triple example, “Hawthorne is locatedIn New York”, represented as triple (Hawthorne, locatedIn, New York), demonstrates how RDF triples can form a natural graph structure that interconnect triples. In this case, the second triple involves and extends the first example, noting that IBM is located in Hawthorne, which is located in New York, a relationship that lends itself to a graph. Thus, RDF is ideally suited for representing information extracted from unstructured sources, such as DBPedia, or graph-based information, such as uniProt.
In general, there could be 1-to-1 or 1-to-many relationals for a subject and predicate, so that, in general, not all predicates would be applicable for a subject and the number of predicates for subjects can vary widely. Queries could be expected for a particular subject or predicate or object. Arguably, the most widely supported RDF query language is SPARQL, although there is no single standard query language for RDF databases.
Answering these queries in a conventional manner in an RDF database results in a lot of joins and self-joins and is difficult to optimize. A self-join is a condition that occurs when a table is joined to itself, based on some join condition. These joins allow a user to retrieve related records in one table, but joins and self-joins slow down query processing in conventional RDF database searches. Thus, in FIG. 1, subject “Articleid” represents the same subject for all five triples shown, so a self-join exists for this subject ArticleId 102 having five predicates 103, if this listing were a table in a database.
FIG. 2 demonstrates a summary 200 of an exemplary RDF store (i.e., DBPedia 3.1, an RDF dataset used for Wikipedia) analyzed as a testbed, in which there were 136.9 million triples, with the number of predicates associated with any given subject varying in number between 1 (minimum) to 927 (maximum), with the number of objects (values) of any given subject/predicate pair varying from a minimum of 1 value to a maximum of 7,176 values. Of significance in this data is that almost 82% of the triples had a single value, and almost 99% of the triples had no more than 128 predicates.
A possible way to reduce self-joins in query processing is to convert a triple store into a single property table, where all the predicates are listed for a subject, such as exemplarily shown in the property table 300 of FIG. 3. However, this is not feasible in practice since the predicates tend to be sparse and not all subject/predicates are 1-to-1. More important, a database engine does not support that many columns.
For example, in DB2 (a relational model database management system developed by IBM), a table with about 1012 columns for page sizes of 8 to 32 K can be defined. In the above example having up to 7,176 values for any given subject/predicate pair, we will need at least that many columns in a naive scheme, which would not be possible with current database engines. To make it possible, one would have to break things up into multiple tables and that would hugely complicate processing.
Even if that is possible, most of the values would be null since, in the exemplary dataset described by FIG. 2, although up to 927 predicates exist for any specific subject, most have fewer than 5. So the space consumption would be huge and query processing efficiency would suffer.
Another possibility is the article/metadata property table 400 shown in FIG. 4. However, this second approach leads to multiple property tables, thereby making processing complicated.
As noted, RDF is becoming common for the representation of unstructured data that has been converted into structured tuples. There is a burgeoning amount of RDF data on the web, either in the form of extracts of semi-structured data from Wikipedia, such as the DBpedia data described above, extracts from unstructured data, such as OpenCalais for Reuters produces and houses RDF for newsfeeds (1 million web service requests to RDFified newsfeeds), the growing use of RDFa microformats to embed RDF in HTML, or extracts of relational data so it can be linked to other unstructured data (e.g., U.S. Census data). Other examples of RDF use include Linked Open Data (2 billion RDF triples), Twine from radar networks (billions of RDF triples), and Powerset, acquired by Microsoft produces RDF triples for Wikipedia.
An efficient RDF store is clearly important for storing and querying this form of schema-less data, and, from the above discussion, a need continues to exist to improve storage of RDF and/or other schema-less data in a relational store and to improve the efficiency of a query search over this data.