Field of the Invention
The techniques disclosed herein relate to representing and manipulating RDF data in a large RDBMS. Particular techniques include efficient bulk loading of RDF data, using hash functions to generate the identifiers for the lexical values of the RDF data in the RDBMS, and techniques for compressing lexical values that are URIs.
Representing Information Using RDF
RDF is a widely-used language that was originally developed for representing information (metadata) about resources in the World Wide Web. It may, however, be used for representing information about absolutely anything. When information has been specified using the generic RDF format, it may be consumed automatically by a diverse set of applications.
There are two standard vocabularies defined on RDF: RDF Schema (RDFS) and the Web Ontology Language (OWL). These vocabularies introduce RDF terms that have special semantics in those vocabularies. For simplicity, in the rest of the document, our use of the term RDF will also implicitly include RDFS and OWL. For more information and for a specification of RDF, see                RDF Vocabulary Description Language 1.0: RDF Schema, available at www.w3.org/TR/rdf-schema/        OWL Web Ontology Language Overview, available at www.w3.org/TR/owl-features/        Frank Manola and Eric Miller, RDF Primer, published by W3C and available in September, 2004 at www.w3.org/TR/rdf-primer/.        
The RDF Vocabulary Description Language 1.0: RDF Schema, OWL Web Ontology Language Overview, and RDF Primer are hereby incorporated by reference into the present patent application.
Representation of Facts as RDF Triples
FIG. 1 and FIG. 2 provide an overview of RDF. Facts in RDF are represented by RDF triples. Each RDF triple represents a fact and is made up of three parts, a subject, a predicate (sometimes termed a property), and an object. For example, the fact represented by the English sentence “John is 24 years old” can be represented in RDF by the subject, predicate, object triple <‘John’, ‘age’, ‘24’>, with ‘John’ being the subject, ‘age’ being the predicate, and ‘24’ being the object. In the following discussion, the values in RDF triples are termed lexical values. In current RDF, the values of predicates must ultimately resolve to lexical values termed universal resource identifiers (URIs), and the values of subjects must ultimately resolve to lexical values termed URIs and blank nodes.
The following are examples of URIs:                http://www.w3.org/2000/01/rdf-schema#Resource        http://www.w3.org/1999/02/22-rdf-syntax-ns#type        http://example.org/family/Female        http://www.w3.org/2000/01/rdf-schema#subClassOf        http://example.org/family/Person        
A URI is a standardized format for representing resources on the Internet, as described in RFD 2396: Uniform Resource Identifiers (URI): Generic Syntax, www.ietf.org/rfc/rfc2396.txt. RFD 2396 is hereby incorporated by reference into the present patent application. In the triples, the lexical values for the object parts may be literal values. In RDF, literal values are strings of characters, and can be either plain literals (such as “Immune Disorder”) or typed literals (such “2.4”^^xsd: decimal). The interpretations given to the lexical values in the members of the triple are determined by the application that is consuming it. For a complete description of RDF, see Frank Manola and Eric Miller, RDF Primer, published by W3C and available in September 2004 at www.w3.org/TR/rdf-primer/. The RDF Primer is hereby incorporated by reference into the present patent application.
Representing the RDF Triple as a Graph
RDF triples may be represented as a graph as shown at 109 in FIG. 1. The subject is represented by a node 103, the object by another node 107, and the predicate by arrow 104 connecting the subject node to the object node. A subject may of course be related to more than one object, as shown with regard to age 103. An example triple is shown at 117. In the following general discussion of RDF, lexical values that are URIs will be replaced by the names of the entities the URIs represent.
Note that for clarity in the various figures and descriptions, URIs such as :Reviewer and :Person are shown in a simplified format in which default namespaces are omitted: thus :Reviewer is shown simply as Reviewer, with the initial colon designating the default namespace omitted.
Making RDF Models Using RDF Graphs
An RDF representation of a set of facts is termed in the following an RDF model. A simple RDF model Reviewers is shown at 101 in FIG. 1. The model has two parts: RDF data 113 and RDF schema 111. RDF schema 111 is made up of RDF triples that provide the definitions needed to interpret the triples of RDF data 113. Schema triples define classes of entities and predicates that relate classes of entities. A definition for the predicate age is shown at 112. As shown there, a predicate definition consists of two RDF triples for which the predicate is the subject. One of the triples, which has the built-in rdfs:domain predicate, indicates what kind of entities must be subjects for the predicate. Here, it is entities belonging to the class Person. The other triple, which has the built-in rdfs:range predicate, indicates what kinds of entities must be objects of the predicate; here, it is values of an integer type called xsd:decimal. Schema 111 uses the rdfs:subClassOf predicate 110 to define a number of subclasses of entities belonging to the class Person. Also defined are Conference and University classes of entities, together with predicates that relate these entities to each other. Thus, an entity of class Person may be a chairperson of a conference and an entity of class Reviewer may be a reviewer for a conference. Also belonging to Schema 111 but not shown there is the built-in RDF predicate rdf:type. This predicate defines the subject of a triple that includes the rdf:type predicate as an instance of the class indicated by the object. As will be explained in more detail, RDF rules determine logical relationships between classes. For example, a built-in RDF rule states that the rdfs:subClassof relationship is transitive: if A is a subclass of B and B a subclass of C, then A is a subclass of C. Thus, the class Faculty is a subclass of Person.
The data triples to which schema 111 applies are shown at 113; they have the general pattern <individual entity>, <predicate>, <object characterizing the individual entity>. Thus, triple 115 indicates that ICDE 2005 is an entity characterized as belonging to the class Conference and triple 117 indicates that John is characterized by having the age 24. Thus, RDF data 113 contains the following triples about John:                John has an age of 24;        John belongs to the subclass Ph.D.student;        John is a ReviewerOf ICDE 2005.        
An RDF model is a set of assertions. Hence, as a set, it should not contain duplicate assertions, that is, all <subject, predicate, object> data triples should be unique, and not be repeated within a model. However, two distinct RDF models may contain some data triples that are the same in the two models. The requirement that data triples not be duplicated or repeated in an RDF model is referred to as the set property.
Inferencing in RDF Models
As is well known, an inferencing operation in RDF derives additional triples by applying RDF rules to the existing triples. These rules specify one or more triple patterns to be matched. If the patterns of the rule are matched, the output is a number of new triples. The rules used for inferencing may either be built in to RDF or particular to a model. In the latter case, the rules are specified with the model. The built-in RDF rule that the rdfs:subClassof predicate is transitive is an example of how a RDF rule can be used to infer new triples.
In FIG. 1, none of these triples states that John is a Person; however, the fact that he is a Person and a Reviewer is inferred from the fact that he is stated to be a Ph.D.student, which is defined in schema 111 as a subclass of both Person and Reviewer. Because the rdfs:subClassof predicate is transitive (by virtue of the built-in rule to that effect), the fact that John is a PhD Student means that he is a potential subject of the Age and ReviewerOf properties.
Using RDF Patterns to Query RDF Models
RDF models are queried by applying to an RDF model a set of RDF triples in which one or more subject and objects are replaced by variables. Such an RDF triple is termed an RDF pattern. As is well known, an RDF query (such as may be done using an RDF query language such as SPARQL) applies this set of query triples to the RDF model and returns the subgraphs that satisfy the query as a result. For a description of SPARQL, see SPARQL Query Language for RDF, W3C Working Draft, 12 Oct. 2004, found at http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/.
For purposes of the present discussion RDF models are best represented as lists of RDF triples instead of graphs. FIG. 2 shows a table of triples 201 that lists triples making up schema 111 and a table of triples 203 that lists triples making up RDF data 113. At the bottom of FIG. 2 is an RDF pattern 205. There are many different ways of expressing RDF patterns; what follows is a typical example. When RDF pattern 205 is applied to RDF model 101, it will return a subgraph of RDF model 101 that includes all of the reviewers of conference papers who are PhD students. The pattern is made up of one or more patterns 207 for RDF triples followed by an optional filter that further restricts the RDF triples identified by the pattern. The identifiers beginning with the character “?” are variables that represent values in the triples belonging to the subgraph specified by the RDF pattern. Thus, the first pattern 207(1) specifies every Reviewer for every Conference indicated in the RDF data 203; the second pattern 207(2) specifies every Reviewer who belongs to the subclass Ph.D.student, and the third pattern 207(3) specifies every Person for which an age is specified. The result of the application of these three patterns to RDF data 203 is the intersection of the sets of persons specified by each of the patterns, that is, the intersection of the set of reviewers and the set of PhD Students of any age. The intersection is John, Tom, Gary, and Bob, who are indicated by the triples in data 203 as being both PhD students and reviewers.
Implementations of Systems for Querying RDF Models
A number of query languages have been developed for querying RDF models. Among them are:                RDQL, see RDQL—A Query Language for RDF, W3C Member Submission 9 Jan. 2004, http://www.w3.org/Submission/2004/SUBM-RDQL-20040109;        RDFQL, see RDFQL, Database Command Reference, http://www.intellidimension.com/default.rsp?topic=/pages/rdfgateway/reference/db/default.rsp;        RQL, see G. Karvounarakis, S. Alexaki, V. Christophides, D. Plexousakis, M. Scholl. RQL: A Declarative Query Language for RDF WWW2002, May 7-11, 2002, Honolulu, Hi., USA.        SPARQL, see SPARQL Query Language for RDF, W3C Working Draft, 12 Oct. 2004, http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/.        SquishQL, see RDF Primer. W3C Recommendation, 10 Feb. 2004, http://www.w3.org/TR/rdf-primer.        
The query languages described in the above references are declarative query languages with quite a few similarities to SQL, which is the query language used in standard relational database management systems. Indeed, systems using these query languages are typically implemented on top of relational database systems. However, because these systems are not standard relational database systems, they cannot take advantage of the decades of engineering that have been invested and continue to be invested in the standard relational database systems. Examples of the fruits of this engineering that are available in standard relational database systems are automatic optimization, powerful indexing mechanisms, facilities for the creation and automatic maintenance of materialized views and of indexes, and the automatic use of available materialized views and indexes by the optimizer.
US Published Patent Application 2006/0235823 A1 describes how an RDF querying system may be integrated into an RDBMS: for convenience, this will be referred to as the 2006/0235823 reference. An additional relevant prior art reference is the Oracle Database release 10G: for convenience, this will be referred to as the Oracle 10G reference (see: Oracle Database Documentation Library, http://download.oracle.com/docs/cd/B14117—01/index.htm).
Overview of an RDBMS into which RDF has been Integrated
The systems of the 2006/0235823 and Oracle 10G prior art references, and the system of this invention, are implemented in an RDBMS. FIG. 4 is a functional block diagram of a relational database management system 401 into which RDF has been integrated. RDBMS systems are characterized by the fact that the information they contain is organized into tables having rows and named columns. A row of data establishes a relationship between the items of data in the row and the SQL query language uses the relationships thus established to locate information in the tables. RDBMS system 401 may be any RDBMS in which RDF queries have been integrated into the SQL used in the RDBMS. In RDBMS 401, a built-in table function has been used to integrate the RDF queries into the SQL.
The main components of RDBMS system 401 are a processor 421, memory 403, which contains data and programs accessible to the processor, and persistent storage 423, which contains the information organized by system 401. Processor 421 further can provide information to and receive information from display and input devices 422, can provide information to and receive information from networks 424, and can provide information to and receive information from file system 426. Processor 421 creates RDBMS system 401 as the processor 421 executes programs in memory 403 using data contained in memory. The programs typically include an operating system 407, which manages the resources used by RDBMS 401, relational database program 409, which interprets the SQL language, and application programs 411, which provide queries to RDB program 409. Data used by these programs includes operating system data 419, used by the operating system RDBMS data 417, used by RDB program 409, and application program data 415, used by application programs 411.
The information that RDB program 409 maintains in persistent storage 423 is stored as objects that RDBMS system 401 is able to manipulate. Among the objects are fields, rows, and columns in the tables, the tables themselves, indexes to the tables, and functions written in the SQL language. The objects fall into two broad classes: user-defined objects 441, which are defined by users of the RDBMS, and system-defined objects 425, which are defined by the system. RDBMS 401 maintains definitions of all of the objects in the database system in data dictionary 427, which is part of DB system objects 425. For the present discussion, the most important definitions in data dictionary 427 are table definitions 429, which include definitions 431 of RDF tables 443, table function definitions 433, which define table functions including RDF_MATCH table function 435, which permits use of RDF patterns to query RDF models in RDBMS 401, and SQL function definitions 437, which includes RDF_GENMODEL function 439, which takes RDF triples and makes them into RDF tables 443.
The tables of interest in user objects 441 are RDF tables 443, which are tables in RDBMS 401 that are typically made from character-string representations of RDF models and their triples. The character-string representations are typically contained in files. Tables 443 fall into three groups: RDF triple tables 445, which represent the triples making up an RDF model 101, RDF rule tables 449, which contain the rule bases belonging to RDF information 313, and RDF optimization objects 447, which are tables and other objects which are used to speed up queries on the RDF models represented by RDF triple tables 445 and the RDF rules in rules tables 449. All of these tables and objects will be explained in more detail below.
Representations of RDF Triples
The 2006/0235823 reference discloses a normalized representation for RDF triples. The tables used to represent RDF triples are shown in detail in FIG. 6. There are two main tables: IdTriples 601, which is a list of models and their RDF triples, as represented by internal identifiers for lexical values of the triple, and UriMap 613, which maps each distinct lexical value to a distinct internal identifier and thus permits conversions between the URIs and literals and the internal identifiers. The internal identifiers are typically integers or other values having datatypes native to the database management system.
The relationship established between each distinct value in one set, to a distinct value in another set, as just described between the lexical values and unique internal identifiers, termed in the following UIDs, is referred to as a mapping between the first set and the second set. The mapping between the lexical values and the UIDs is also one-to-one: for each distinct lexical value, there is one particular UID, and for each UID, there is one distinct lexical value. Further, the mapping is bi-directional: for any given UID, it is always possible to know what the corresponding lexical value is, and vice versa. These properties of the mapping allow the UID to be used to “stand in” for the lexical value, and is an important technique used in RDBMS systems.
The process of mapping lexical values to UIDs is termed normalization, and a table like the IdTriples table 601 containing UIDs for the lexical values of RDF triples, with a second table like the UriMap table 613 mapping the UID values to the lexical values, is a normalized representation of a set of RDF models. Each distinct lexical value in the RDF triples belonging to the set of models in IdTriples table 601 must have a UID mapped to it in RDBMS 401. A table such as IdTriples table 601, in which the lexical values in the triples are represented by their UIDs, is said to contain normalized forms of the RDF triples. Advantages of normalization include the following:                1. URIs of RDF data tend to be large and are usually repeated many times in the data of an RDF model. Storing them as they are (typically as Strings) would be wasteful in storage, thereby making the table and dependent indices unnecessarily large and hence resulting in lower performance. Integer UID values generally require substantially less storage space than do strings: use of integer UID values instead of the original strings in the IdTriples table thus saves substantially on storage space. The reduction in storage space for the table and dependent indices further leads to performance improvements, such as by allowing more of the table and dependent indices to fit into available main memory for processing.        2. String comparisons are further much less efficient than integer comparisons. For this reason, operations such as tests for equality (sameness) of one triple to another, or queries to locate triples that have a particular value in them, execute more quickly if they are performed using the UIDs that represent the lexical values in the triples rather than the lexical values themselves.        
In the prior art, the UIDs used for normalization are typically generated by the RDBMS. Generally, the RDBMS produces sequential values for the UIDs: 1, 2, 3, 4, etc., and maps these sequential values to the distinct lexical values. Because each value in the sequence of generated values is different from the others, a UID represents each distinct lexical. One limitation of such a scheme is that UID values must be generated serially, and must be generated by a single register, which precludes the possibility of generating UID values concurrently or on multiple systems for improved performance. The mechanism for getting the UID value for a particular lexical value, is to store each pair of lexical value with UID value in a table as each UID value is generated and related to the lexical value, and then to look up the lexical value in the table when the UID is needed. One limitation of this technique is the time required to look up the UIDs, especially when the resulting table becomes large,
RDBMS JOIN Operations and their Use with Mapping Tables:
An important functionality in RDBMS systems is the JOIN operation. The JOIN operation is used as an optimization and programming convenience, to combine two tables into a temporary or simulated table, when the two tables both contain common columns that refer to the same values. JOIN operations are frequently used to combine a mapping table, such as the URIMap table 613, with a table that has been created to take advantage of the mapping, such as the IdTriples table 601. The common columns in this case are the InternalId column 615 of the URIMap table 613, and the SubjectId 605, PropertyId 607, and ObjectId 609 columns of the IdTriples table 601. A JOIN operation performs the necessary lookup operations to combine the two tables. For example, a JOIN operation could be performed on the URIMap 613 mapping table, for each of the three columns in the IdTriples table 601, to produce a temporary or virtual table that appeared to have the full strings for Subject, Object, and Predicate, rather than the UID values of the IdTriples table 601.
JOIN operations in an RDBMS operation simplify the design and programming of many applications, and generally result in less storage space being used, because it is not necessary to repeat data in several different permanent tables. JOIN operations are also often more efficient than creating an actual table like the temporary or virtual table of the JOIN. JOIN operations are also a convenient way to establish and exploit relationships among several tables.
For further information about JOIN operations, see                Oracle® Database SQL Language Reference, 11g Release 1 (11.1), Joins, download.oracle.com/docs/cd/B28359—01/server.111/b28286/queries006.htm        Join(SQL), en.wikipedia.org/wiki/Join_(SQL)Using Hashing to Generate UIDs        
Some systems for storing RDF data in an RDBMS use the technique of assigning a UID that is mathematically derived from the input data value alone.
The most common form of this technique is to derive the UID values for normalization mathematically using a hashing function, also referred to just as a hash function. For the purposes of this presentation, a hash function is a function or operation that                takes a value as an input, and generates another value as an output        always produces an output value for every valid input value.        for a given input value, always generates the same output value.. . . and thus maps its input values to its output values. Such a mapping operation with a hash function is also referred to as hashing.        
Flash functions as described here are widely used to speed up table lookup in data processing systems. The data value is referred to as the hashed value—the input to the hashing function—and the output of the function for a particular hashed value is referred to as the hash value. Many well-known hash function produce output values that are integers, or are a set of bits of a particular length such as 128 bits, or a set of bytes of a particular length such as two bytes or characters.
However, hash functions generally do not always generate unique values: a case where two different values that are hashed by the hash function result in the same hash value is known as a hash collision. The technique of computing a different hash value by re-doing the hash with modified input data or an equivalent operation, for one of the hashed values in a collision, so that the resulting hash values no longer collide, is generally referred to as hash collision resolution.
Prior art systems that use hashing functions to generate UIDs for normalization from lexical values in RDF triples are unable to resolve hash collisions. When a collision occurs, these prior art systems do one or more of the following:                Reject the data that resulted in a hash collision:        in this case, the system cannot handle all input data.        Require that the input data be modified so that no collision occurs:        in this case, the system is no longer a system that answers queries or does processing about the actual input data.        Disregard the probability that hash collisions may occur:        in this case, the system fails to operate correctly when a hash collision does occur.        
None of these alternatives is acceptable in a production system for manipulating RDF models. An example of a prior art system which uses hashing for normalization but does not resolve collisions is 3Store: (see “3 store: Efficient Bulk RDF Storage”, 1st International Workshop on Practical and Scalable Semantic Systems, Oct. 3, 2003, Sanibel Island, Fla., km.aifb.uni-karlsruhe.de/ws/psss03/proceedings/harris-et-al.pdf).
Details of IdTriples Table 601
Continuing in detail with IdTriples table 601, this table has a row 611 for every RDF triple in the RDF models that have been loaded into RDBMS. The table has four columns:                ModelID 603,        which contains the internal identifier of the model to which the RDF triple belongs;        SubjectID 605,        which contains the UID for the RDF triple's subject;        PropertyID 607,        which contains the UID for the RDF triple's predicate; and        ObjectID 609,        which contains the UID of the RDF triple's object.        
As shown in FIG. 6, IdTriples table 601 shows the rows for the first four data triples of data triples 203. It would of course contain a row for every schema triple in table 201 and every data triple in table 203.
The IdTriples table is partitioned in the RDBMS on ModelId with each partition holding a separate RDF graph or model. This maintains locality of each model within the table. The rows for the model Reviewers are illustrated at 631. Further, the rows for a separate model Farmers are illustrated at 633.
In a typical RDBMS, when a table is a partitioned table, the different partitions of a table may be indexed, modified, and updated separately from each other. An operation of particular interest in the system of the Oracle 10G reference, is the EXCHANGE PARTITION operation that allows an entire partition of a table to be updated in a “zero cost” operation—that is, an operation in the RDBMS which does not involve moving or copying significant amounts of data. The operation changes the internal definition of the table so that a particular partition of the table now refers to a separate part of the RDBMS storage that is already prepared with appropriate data. Depending on the implementation in the particular DBMS, the different partitions of the table may be stored in different groups of blocks on disk, in separate files, in separate directories of a filesystem, or on physically separate filesystems or data servers. Techniques for supporting partitioned tables within an RDBMS are well known in the art.
As an example, a possible partitioning of a database table involving ZIP codes would be to partition the data into two separate sections, one named ZIPEAST for rows for ZIP codes less 50000, and another ZIPWEST for rows for ZIP codes greater than or equal to 50000.
Partitioning the IdTriples table 601 in the RDBMS brings advantages such as the following:                A given RDF model may be updated, have its index rebuilt or disabled, or modified in other ways without affecting the data of other models.        Indices can be defined for a table but categorized as local, and thus maintained separately for each partition, resulting in more efficient performance and smaller indices in each partition. Further, compression features of the RDBMS allow the model column to be substantially compressed and use less storage space.        Inserting a row in one RDF model does not affect or involve the storage or indices of other models.        It is easier to set and enforce access control on a per-model basis.        
A uniqueness constraint in the RDBMS is defined on the (SubjectID, PropertyID, and ObjectID, ModelId) columns in the IdTriples table to ensure that no duplicate triples can be inserted into a model in error. Because the table is partitioned on ModelId, this constraint is enforced by an index categorized as local, which results in separate index storage for each partition. The separate index storage on the model/partition ReviewersId 631 is shown at 635, applying to the three columns SubjectID 605, PropertyId 607, and ObjectId 609. The separate index storage on the model/partition FarmersId 633 is shown at 637, and applies to the same columns, but within the FarmersId model/partition only.
Uniqueness constraints are defined on a table in an RDBMS such as Oracle by the CREATE UNIQUE INDEX operation. For further information on indices and constraints, see                Oracle® Database SQL Reference 10 g Release 1 (10.1), download.oracle.com/docs/cd/B14117—01/server.101/b10759.pdf.        
UriMap table 613 has a single row 619 for every UID that appears in IdTriples table 601. There are four columns that are of interest in the present context:                InternalID 615, which contains the UID; and        RDFVal 617, which specifies a lexical value corresponding to the UID;        A flag that indicates for an RDFval 617 whether it is a literal value whether the value is in the canonical form;        The type of RDFVal 617.        
Uniqueness constraints in the RDBMS are defined on the InternalID 615 and RDFVal 617 columns respectively in the UriMap table, to ensure that all InternalID values and also all RDFVal values are distinct. The uniqueness index and constraint on InternalID 615 is shown at 641. The uniqueness index and constraint on RDFVal 617 is shown at 642.
The canonical form for a literal value is a standard form for writing the value. For example, the numeric value 24 may be written as 024, 24.00, 2.4×101, and so on. Depending on the application, any of these may be used as the canonical form, or a different form may be used as the canonical form. Canonicalization is the technique of translating different formats for the same information value to the standard form. In the 2006/0235823 reference, the form used for the value when the first entry is made for the value in UriMap 613 is treated as the canonical value. There is further an index, idx_num 627, which indexes a given numerical value to a row in UriMap 613 that contains the canonical representation
Ancillary Application Tables
The Oracle 10G reference also describes ancillary application tables. These ancillary application tables are per-model: a particular such application table only contains information relevant to a particular RDF model.
An application may involve additional information about RDF triples from a particular model that are not part of the RDF model. Depending on the application, this additional information may be included in the input data for the RDF model, or it may be input separately. For example, there may be information associated with the RDF triple giving the provenance of that triple, such as which organization or individual input that particular triple into the dataset. FIG. 8 illustrates an exemplary application table 801.
The application table 801 contains three initial columns: column ID shown at 812 holds an internal sequentially-generated MD value for each row of the application table 801, virtual column SDO_RDF_TRIPLE_S at 813 holds a data object which contains five values. The first of these values model_id, a,b,c consists of the ModelId identifier, SubjectId, PropertyId, and ObjectId UID values corresponding to the model, subject, predicate, and original object strings of the RDF triple. The additional link_id value is an identifier for the row in the model's partition in the IdTriples table 601 that holds the normalized form of the RDF triple—this normalized triple contains the UID of the canonical form of the original object of the triple, and not the UID of the original object string. Together, the model_id and link_id values could be used for the functionality of a foreign key from the application table into the IdTriples table.
Columns 814 source_db and further columns 815, etc. contain the additional information associated with that original RDF triple. The row at 811 shows example values for a triple (model-id, a,b,c, link_id) showing that this particular RDF triple in this particular model came from a source identified as SourceDB23.
Fidelity
An additional requirement of RDF databases is that the implementation not only translate input values to a canonical form, but also that it maintain fidelity or data integrity to the original form of the values. Fidelity is the ability to reproduce the original lexical values or data exactly as it was in the original data. Fidelity can required for a number of reasons in different application, for example, it may be necessary to be able to verify the information in the RDF triples by comparing it with original data exactly, or it may be necessary to produce the original value in order to export data back to the original source.
Limitations of Prior Systems in which RDF is Integrated into an RDBMS
Experience with prior-art systems such as system 401 of FIG. 4 has shown that improvements are needed in dealing with collisions when UIDs are produced by hashing, in bulk loading of RDF data into the RDBMS, and the compression of URIs.
Generation and Use of UIDs
Real-world RDF datasets tend to be quite large. For example, the UniProt RDF model is a well-known RDF representation of the data from the Universal Protein Resource (UniProt) database about biological proteins and related annotation data (see UniProt Database, www.isb-sib.ch/˜ejain/rdf). This model currently (2008) contains about 207 million triples, referencing some 33 million or more lexical values, constituting approximately 12 Gigabytes of data in the character string format used to distribute the model. Systems for manipulating and querying large real-world RDF datasets need to be able to operate on datasets of a billion (1,000,000,000) triples and more.
FIG. 5 gives a summary of prior art normalization for RDF Triples (subject, predicate, object). Normalized triples are stored in two tables, a LexValues (lexval, id) table 521 and an IdTriples (subj-id, pred-id, obj-id) table 501.
For the purposes of this presentation, IdTriples table 501 is equivalent to IdTriples table 601, and the LexValues table 521 is equivalent to UriMap table 613.
The LexValues table 521 has two columns, lexval 533 for the lexical value, and id 538 for the normalized UID to which that lexical value has been mapped. The row at 531 shows that the lexical value string “John” will be represented by the UID value 100. The IdTriples table 521 has three columns subj-id 505, pred-id 507, and obj-id 509 for the normalized UIDs for the subject, predicate, and object parts respectively of the RDF triples. The row at 511 shows a normalized triple to represent the RDF triple (“John”, “managerOf”, “Mary”) with the three UID values 100, 300, and 200 respectively.
The conversion of a set RDF triples to an IdTriples table such as table 501 requires that first, the LexValues table be constructed to establish the relationship of each distinct lexical value lexval 533 to a distinct id value 538. Further, the RDF triples are processed to translate each of the three lexical strings in each triple to the corresponding UID value. This involves three separate lookups of values in the LexValues table (once for each string in the triple), times the number of triples to be processed.
There are two factors that make this unwieldy and slow for large RDF models. In the example of the UniProt RDF model, each look-up is into a table with over 33 million entries: performing this look-up is a substantial burden. RDBMS systems provide special indexing means for speeding up look-up operations into tables, but with very large tables this is still a time-consuming operation. Further, there will be about 621 million such look-ups. RDBMS systems provide special capabilities that are useful in doing the type of look-ups used in this normalization process, such as doing multiple RDBMS JOINs of an unnormalized RDF triples table with three copies of the LexValues table. However, at the size of real-world RDF models, such as the single UniProt model, even the use of joins results in very slow processing. Thus, the prior art does not scale to the sizes required for very large real-world RDF models. For a system that is intended to support multiple RDF models, it is even more the case that the prior art does not scale.
A further prior art technique for dealing with this problem is the use of HASH JOIN operations. However. HASH JOIN operations only perform well when the join table fits completely into available main memory. Given the immense size of real-world RDF models, this means that RDBMS systems with the amount of main memory required for satisfactory performance with real-world RDF data systems will be both rare and expensive. Thus, this prior art also does not scale for very large real-world RDF models.
Bulk Loading
Bulk loading is a well-understood functionality that is provided by database management systems for loading large amounts or batches of data, into RDBMS tables from external files Support for bulk loading of DBMS data is included in all almost all commercial DBMS systems.
Bulk loading consists generally of dropping indices on the tables to which data will be added, and importing the additional data directly into the relevant tables with minimal processing on the data, followed by re-indexing the data. Bulk loading as just described does not, however, work well for bulk loading of RDF data, as it does not deal with the need to transform the RDF triples by normalizing lexical values, compressing URIs, and generating canonical forms for literal values.
In the preferred embodiment, RDF data to be bulk-loaded is contained in files. In these files, the RDF data may be represented in a number of standard formats. One of these is the N-Triple format. FIG. 19 shows examples of the N-Triple format for RDF data. In this format, each element of this triple is enclosed in angle brackets and the elements have the order subject, predicate, and object. For further information on the N-Triple format, see N-Triples, www.w3.org/TR/rdf-testcases/#ntriples.
In N-Triple format, URIs and typed literals may employ delimiters within the value string: a delimiter is a character or specific sequence of characters that appear between two parts of the string, and thus delimit, or divide, the string into a first part and a second part. Unless specified otherwise, a delimiter can appear anywhere in the string: a delimiter which is at start of a string, for example, would “divide” the string into a second part, which is the rest of the string, and a first part, which would be nothing, also called an empty string. An internal delimiter is a delimiter which is not the first character or characters in the string, and also not the last character or characters of the string. The rearmost or final internal delimiter in a string would be the last such delimiter in the string, except for a delimiter which was at the end of the string. Another term for the first part of a string divided into two parts is the prefix, similarly a term for the second part of divided string is a suffix.                1904 shows a single triple representing that a Female is a subclass of the type Person. Each element of this triple is enclosed in angle brackets and the elements have the order subject 1942, predicate 1943, and object 1944. The object part of the triple is an example of an object value that is a URI, as shown at 1941.        1903 shows a triple representing that Tom was born at 8:10:56 P.M. on Dec. 10, 2004 (Greenwich Mean Time). The triple consists of the subject 1952, predicate 1953, and object 1954. The object part of the triple, shown at 1931 and 1932, is an example of a literal value in the typed literal format (see W3C RDF/XML Syntax Specification (Revised), www.w3.org/TR/rdf-syntax-grammar/): the value part of the typed literal string is at 1931, an internal delimiter consisting of two carets is at 1932, and the part of the string which states its type, including the strict syntax of the value part, is at 1933.        1902 shows a short except of another RDF file format based on XML: this example is excerpted from the UniProt database (see www.uniprot.org). The example 1902 states information about the location of a gene related to an organelle (a structure inside a cell) known as a chromatophore.        1901 shows three further examples of possible URI values, such as might be used in a triple in N-Triple format.        
The need to make normalized and canonicalized forms of the RDF triples being loaded in bulk makes the problem of bulk-load for RDF data challenging. The challenges include:                All input triples must be transformed and checked for conformance to the syntax and semantics of the standardized RDF format and data models.        All lexical values must be normalized by mapping them to UIDs. This requires that the bulk load process further determine which lexical values being loaded already have been mapped to UIDs.        Different representations in literal values of the same value in the input data must be translated reliably to a standardized and predictable canonical form, and further a UID must be assigned to the canonical form. This cationicalization requirement however is in tension with the need to preserve the original literal value, as required by the property of fidelity.        Duplicate triples within a model must be eliminated, in order to ensure the set property of the RDF data for the model. The set property requires that no triple exists more than once in the model.        There are further needs for reducing the storage required to hold the RDF data, as the datasets are quite large, and increased storage requirements result both in greater expense, and in reduced query performance.        