This specification relates to conversion of data storage formats for large collections of data.
Large collections of data are often stored in databases. A database can be understood as a self-describing collection of integrated records. Different types of databases exist, but a commonly used type is the relational database. A relational database is a database that organizes data using common attributes found in the data set. The resulting “relations” or “tables” of organized data may then be systematically queried and manipulated using operators from a formal mathematical system such as a relational algebra or calculus.
Each table in a relational database typically includes many “rows” of data, which is a individual record in a relational database. A relational database row is consists of a collection of field values that occupy specific columns of a database table. As an example, a database table for MP4 video attributes might contain the columns (Video_Name, Video_Framerate, Movie_Script_Name) and contain a row with column field values: (“Roger Rabbit MP4 video.mp4”, “25 FPS”, “Who Shot Roger Rabbit.doc”). A relational database table will typically include a primary key (PK), which defines a column (or combination of columns) within the relational database table whose values together form the “principal unique identifier” for each specific row within that table. A table's PK serves to uniquely identify a row in that table, e.g., the name of the video in the above example.
A foreign key (FK) is a column (or combination of columns) whose values match those of a primary key in the same or different relational database table. A given foreign key value represents a reference from the row(s) containing the foreign key to the specific row containing the matching primary key value. The problem of ensuring that every foreign key value does in fact match a value of the corresponding primary key is known as the referential integrity problem.
Referential integrity is the consistency or reference integrity between coupled tables in a relational database. Referential integrity is usually enforced by the combination of a PK (or candidate key/alternate key) and a FK. For referential integrity to be true, any field(s) in one or more rows of a table that are declared to be a foreign key must contain only values from a parent table's primary key or candidate key; e.g., deleting a record that contains a value referred to by a FK in another table would break referential integrity. A relational database management system (RDBMS) enforces referential integrity, normally either by deleting the FK rows as well to maintain integrity, or by returning an error and not performing the deletion command. The method used can be determined by the referential integrity constraint, as defined in the data dictionary.
Most commonly, referential integrity is used to ensure that a row cannot exist in one table without the existence of a related row in another (or the same) table. For example, an order cannot exist in an Orders table without a corresponding customer record row in a Customers table. It also ensures that data cannot be altered in a way that would defeat the established relationships. Again, a record row in the Customer table could not be deleted without deleting all of the related rows in the Order table, as well as any other related dependent table.
With the development of the semantic Web, a new kind of storage technology, called a triplestore, is being used more often. In a triplestore, one stores triples. A triple is an expression or mathematical n-tuple consisting of three elements: a subject, a predicate, and an object typically written as (S, P, O). For example, the triple (“Roger Rabbit MP4 video.mp4”, “has_script”, “Who Shot Roger Rabbit.doc”), and the triple (“Roger Rabbit MP4 video.mp4”, “has framerate”, “25 FPS”) indicates that video content file for Roger Rabbit has associated with it a movie script named “Who Shot Roger Rabbit.doc” and a framerate attribute with a value of 25 frames per second.
Such triple statements are essentially assertions that may or may not be true, but as one learns information about the correctness of the assertions, one can begin to do inferencing or reasoning about the information in a triplestore. A triplestore is a special-purpose database for the storage and retrieval of triples. Specifically, an RDF triplestore uses the W3C (World Wide Web Consortium) nomenclature and semantics defined by the Resource Description Framework (RDF) for metadata. Similar to relational databases, metadata saved in the triplestore is retrieved via a query language. Unlike relational databases, triplestores are optimized for the storage and retrieval of many short statements or triples. As such, a triplestore may contain billions of such triples and standard methods exist to measure the performance of a particular triplestore, such as the Lehigh University Benchmark (LUBM).