1. Technical Field
The present invention generally relates to resource description framework data and, more particularly, to creating benchmark graph data.
2. Description of the Related Art
The RDF (Resource Description Framework) is quickly becoming the de-facto standard for the representation and exchange of information. This is nowhere more evident than in the recent Linked Open Data (LOD) initiative where data from varying domains like geographic locations, people, companies, books, films, scientific data (genes, proteins, drugs), statistical data, and the like, are interlinked to provide one large data cloud. As of October 2010, this cloud consists of around 200 data sources contributing a total of 25 billion RDF triples. The acceptance of RDF is not limited, however, to open data that are available on the web. Governments are also adopting RDF. Many large companies and organizations are using RDF as the business data representation format, either for semantic data integration, search engine optimization and better product search, or for representation of data from information extraction. Indeed, with search engines such as GOOGLE and YAHOO promoting the use of RDF for search engine optimization, there is clearly incentive for its growth on the web.
One of the main reasons for the widespread acceptance of RDF is its inherent flexibility: A diverse set of data, ranging from structured data (e.g., DBLP) to unstructured data (e.g., WIKIPEDIA/DBpedia, where WIKIPEDIA is an example of an online encyclopedia and DBpedia is a project for extracting structured data from information created as part of WIKEPEDIA), can all be represented in RDF. Traditionally, the structuredness of a dataset, which is defined herein to refer to an amount of structure, if any, is one of the key considerations while deciding an appropriate data representation format (e.g., relational for structured and XML for semi-structured data). The choice, in turn, largely determines how we organize data (e.g., dependency theory and normal forms for the relational model, and XML). It is of central importance when deciding how to index it (e.g., B+-tree indexes for relational and numbering scheme-based indexes for XML). Structuredness also influences how we query the data (e.g., using SQL for relational data and XPath/XQuery for XML). In other words, data structuredness permeates every aspect of data management and accordingly the performance of data management systems is commonly measured against data with the expected level of structuredness (e.g., the TPC-H benchmark for relational and the XMark benchmark for XML data). The main strength of RDF is precisely that it can be used to represent data across the full spectrum of structuredness, from unstructured to structured. This flexibility of RDF, however, comes at a cost. By blurring the structuredness lines, the management of RDF data becomes a challenge since no assumptions can be made a-priori by an RDF DBMS as to what type(s) of data it is going to manage. Unlike the relational and XML case, an RDF DBMS has the onerous requirement that its performance should be tested against very diverse data sets (in terms of structuredness).
A number of RDF data management systems (a.k.a. RDF stores) are currently available. There are also research prototypes supporting the storage of RDF over relational (column) stores. To test the performance of these RDF stores, a number of RDF benchmarks have also been developed. For the same purposes of testing RDF stores, the use of certain real datasets has been popularized. While the focus of existing benchmarks is mainly on the performance of the RDF stores in terms of scalability (i.e., the number of triples in the tested RDF data), a natural question to ask is which types of RDF data these RDF stores are actually tested against. That is, we want to investigate: (a) whether existing performance tests are limited to certain areas of the structuredness spectrum; and (b) what are these tested areas in the spectrum. To that end and in particular, we show that (i) the structuredness of each benchmark dataset is practically fixed; and (ii) even if a store is tested against the full set of available benchmark data, these tests cover only a small portion of the structuredness spectrum. However, we show that many real RDF datasets lie in currently untested parts of the spectrum.