Graph and document datasets, e.g., resource description framework (RDF) and JavaScript Object Notation (JSON), are commonly used and widespread. One of the advantages of the corresponding models that led to the proliferation of such data is the schema flexibility associated with JSON or RDF data. For example, users of an RDF dataset can add new predicates to a particular entity without concern for “breaking or violating the schema”, and different instances of the same type might have completely different predicates. Although this provides great flexibility and value in a number of situations, e.g., during the exchange or integration of multiple datasets, knowledge of the actual schema to which the underlying data conform is desired for purposes of data analysis, data understanding, data query and data storage.
The challenge is in finding the appropriate schema and ensuring conformance with this appropriate schema for all the data contained within a given dataset. Studies have shown that even if semi-structure datasets come with a schema, there is seldom any relationship between what the schema claims about the structure of the data and the actual structure of the data. Although the existing schema can be used as a guide, further work is required to identify the actual schema, i.e., structure or organization, of a given dataset associated with that existing schema. Identifying the actual schema of the data provides for better querying and storage of the data as well as better understanding of the data for purposes of data analytics.
This mis-match between a given schema and data that conform to the given schema exists for various types of datasets including RDF graphs. A prominent example of this is DBpedia, which is RDF data extracted from Wikipedia, a publicly editable source of information. One example is the popular sort of DBpedia persons, which includes all the people having an entry in Wikipedia. According to the sort definition, each person in DBpedia has 8 properties—a name, a given Name, a surName, a birth Date, a birth Place, a death Date, a death Place, and a description. There are currently 790,703 people, and while a large portion of them are likely alive, i.e., they do not have a death date or death place, data exist regarding when and where these people were born. The statistics for these people, however, are very revealing. Only 420,242 people have a birthdate and only 323,368 have a birthplace. For only 241,156 are both the birthdate and the birthplace known. There are approximately 40,000 people that do not even have an associated last name. When it comes to death places and death dates, these data re know for only 90,246 and 173,507 people, respectively.
There is actually nothing wrong with the DBpedia person data. The data reflect the simple fact that the available information about any domain of discourse, e.g., people, is inherently incomplete. While this is the nature of information practice, data sorts go against this trend and favor uniformity, i.e., sorts require that the data tightly conform to the provided sorts. In the example of DBpedia person, the expectation is to have all 8 properties for every DBpedia person. Therefore, there is a need to bridge the gap between these two worlds, the sorts and the respective data.