Analyzing large data sets has become an increasingly critical task as the amount of digital data increases at extreme rates. The term “big data” refers to data sets that are so large and complex that traditional data processing methods are rendered impractical. New developments in the field of machine processing include the Semantic Web, which is a structure of linked data that provides a common framework to allow data to be shared and reused across application and enterprise boundaries, and facilitate intelligent machine processing of the data. The Semantic Web framework refers to or includes certain formats and technologies that enable the collection, structuring, and processing of linked data. These include the Resource Description Framework (RDF), which is a simple language for expressing data models; RDF Schema (RDFS), which is a vocabulary for describing properties and classes of RDF-based resources; Web Ontology Language (OWL), which is a query language for semantic web data sources; SPARQL, which is an RDF query language; N-triples, which a format for storing and transmitting data; Rule Interchange Format (RIF), which is a framework of web rule language dialects supporting rule interchange on the Web; and other technologies.
As the amount and type of web data explodes, software agents used by data processing engines need a query capability that supports a combination of description, logic, geospatial and temporal reasoning, and social network knowledge. Depending on the data application, vendors may use large data warehouses with disparate RDF-based triple stores that describe various events, objects, or data elements. Such data may be stored in or across a vast array of disks or other memory storage devices such that traditional storage techniques and query tools are required to search over a large number of disks to return a result. Clearly, this can lead to excessively long searches in the case of large data sets. What is needed, therefore, is a method and system to partition data in such a way that optimizes data queries and takes full advantage of the data linkage mechanisms of the Semantic Web. What is further needed is an efficient way to join data elements from one data set with data elements in another data set to perform a query simultaneously. In general, in a parallel query the same query is sent to different self-contained databases and the results are collected. In the parallel system, the data is partitioned and each data partition is self-contained and the same query is performed against each data partition. In federation, data elements in one database are joined or connected with data elements in another database. In a federated query, the query is sent to one of the databases, and data connection routes the low-level parts of the query through the other data partitions.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.