Software developers may use data schemas to exchange data between data systems or to store and organize data in storage systems. Data schemas, which are frameworks that help organize and interpret information, usually describe data objects and data object interrelationships. For example, a schema, such as the one illustrated in FIG. 1a, may represent a person and may consist of s such as: formatted name (fn), family name, given name, social security number, email, telephone number, and address. The address field may link to another schema, illustrated in FIG. 1b, that defines an address with fields such as: post office box, extended address, street address, address locality, address region, postal code, and address country.
In the last few years, Internet data has grown rapidly in amount and complexity. Additionally, data systems now have the ability to aggregate large amounts of data from many different sources. Data schemas therefore have become significantly larger and more intricate in order to handle the vast amounts of complex data from various places. Data schemas have also become more interdependent on one another since, for storage reasons, different parts of the same data are often divided among several different schemas and combined when the data needs to be manipulated or displayed. The scale, complexity, and dependencies of data schemas have become major challenges to managing data warehouses and large-scale data storage systems, especially since there may be problems with individual data schemas or consistency issues between data schemas that make it hard to provide users with meaningful information about data and correlations among different parts of the data.
Data schemas can have many different problems including problems with quality, consistency, and interoperability. Specifically, an individual schema may not conform to schema quality requirements for the dataset which the schema represents. These requirements may be articulated by a producer, consumer, developer, or dictated by technical capabilities of the storage system. For example, the storage system's consistency model may require sequential consistency or the storage system may not support transactions. Schemas should follow the requirements exactly for effective storage, integration, communication, and retrieval.
Correctness, compliance, and consistency issues arise during schema development. For example, there may be a requirement for a schema that represents a person as illustrated in FIG. 1a to contain a telephone number as an object that contains two fields: a type of telephone number as a string and a value for the telephone number also as a string. However, the schema may be implemented with the value of the telephone number as a number instead of a string. This wrong schema implementation could cause problems with data entry and integration as well as data retrieval. Another example of data schemas' correctness problems is that a schema may contain redundant information. For example, data schemas may contain duplicate fields or fields that can be derived from other fields. A schema may contain a field for age and a field for date of birth. Age is a field that may be derived from the date of birth field. The age field is therefore considered redundant information.
In addition to correctness problems, a schema may also not comply with policy guidelines. There may be, for example, a requirement that a data schema representing a person, such as the one of FIG. 2a, not include both a person's social security number and credit card number since these two pieces of identifying information, if stolen, may make it easy for an identity thief to commit identity fraud.
Although it may be easy to manually identify schemas that directly include incorrect fields, such as two first name fields for a person, or non-compliant fields, such as a co-located social security number and a credit card number in the example above, there may be certain instances in which schemas indirectly link to other schemas. This indirect linkage can cause a schema to have incorrect fields or be non-compliant. For example, there may be a policy, as discussed above, in which a schema representing a person cannot contain, either directly or indirectly, a social security number and a credit card number. A person data schema, such as the one illustrated in FIG. 2a, would violate that policy because the schema contains a social security number and linking to a billing schema, illustrated in FIG. 2b, that contains a credit card number. While the person schema does not directly contain both a social security number and a credit card number, indirectly a person is linked to a credit card number via the billing schema. This type of indirect non-compliance is hard to detect manually especially in large-scale systems where the data schemas are highly complex and interdependent.
Data schemas are also not always consistent, making data communication and transfer among systems and data integration difficult. Software developers and/or data analysts usually expect data in a certain format from a data source in order to write code that uses the data. If the data is not received in the anticipated format, there may be problems with the software developer's code not being compatible with the data format. Additionally, software developers may find certain schema structures difficult to understand or to use to perform statistical analysis on data.
Techniques have been developed to address some schema issues, namely data quality and schema quality. Most schema formats allow for specifying constraints on data in an effort to control data quality. These constraints can be specified on columns of tables in relational databases. For example, using the person schema example from above, there may be a person table. Each field from the person schema may be represented in a column of the person table. The columns for last name and first name may have constraints on them requiring them not to be empty when a new person is added. Each person inserted into the table may also be required to have a unique key to differentiate one person from another. Additionally, the address column may be a foreign key reference to another table, an address table, that holds all address information. The address column in the person table may have a foreign key constraint to reference a valid primary key of the address table. A column may have a constraint that it is unique for every row inserted into the table or may have a check constraint which is a conditional expression that further limits the domain of the column. However, these column constraints do not address the quality of data schemas themselves.
There are a few customary automated and manual techniques for improving schema quality. One automated technique is to validate a schema based on declarative predicates, which are assertions of certain properties of the schema, during database view updates. This technique only works on deductive databases where schemas are a set of deduction rules and constraints. Furthermore, this technique is limited to schema field checking.
A second automated technique allows validation of a few pre-defined schema properties. Using this technique, schema validation is performed by an assertion-based set of tests which only allows checking a fixed set of properties on data schemas.
Manual techniques for improving schema quality usually involve data engineers or users to review proposed data schemas and their relations. This process can be laborious, error-prone, and costly. The process can also be inconsistent because it requires aggregating human opinions. It is also infeasible to scale manual processes to handle large and fast-growing or frequently changing schemas.