In application development, assumptions about the structure of data that applications use must be made. Once the structure of the data is known, a structure may be assumed and applications developed accordingly. Applications can only run correctly using data that conforms to the structure assumed. Hence, conformance of data used by an application to a schema is important to usability of the data by the application.
In application development, the relational database model has been a dominant data model. A relational database model is schema based, which means that writing data in a relational database requires that the data conform to a schema explicitly defined for the relational database (“explicit schema”). Data in a relational database is very usable because, among other reasons, the data conforms to a known schema defined for the relational database.
The relational database model requires that a schema be developed and implemented within a relational database before database data is stored in the database. This requirement may hinder iterative development of applications, an important ability for many software development endeavors. Under iterative development, changes are made to applications in smaller increments but in a greater number of iterations. As an application changes between iterations, new and/or modified schemas with new or modified fields must be defined for the relational database, possibly requiring downtime and database migration.
Schema-less data models facilitate iterative development of applications. Under the schema-less data model, data may conform to an “implicit schema”, and applications may be developed according to the implicit schema. However, the data does not have conform to an explicit schema defined for a database before storing the data in the database. This capability makes it easy to make significant application changes rapidly, without worrying about having first to change the schema of a database and possibly migrate the database to the new schema.
Relational databases are managed by relational database management systems (RDBMS). An RDBMS provides powerful querying capabilities that make data in a relational database very usable, such as the capability to query data using a query language such as SQL and present the data in relational form, as rows with columns. These powerful query capabilities are being extended to cover schema-less data. Thus, RDBMS's are enabled to not only store schema-based data but also schema-less data, providing powerful query capabilities for schema-less data.
Realization of the most powerful query capabilities of an RDBMS depends on an explicit schema, for both schema-based and schema-less data. However, unlike for schema-based data, an explicit schema for schema-less data may be and is often developed after the schema-less data is added to a database.
Defining an explicit schema for schema-less data entails a complex, time-consuming, and error prone manual task. The schema-less data is examined to discover its structure. Statements describing the structure and relational views for accessing the schema-less data are submitted to the RDBMS. Because schema-less data is often hierarchically marked-up, such statements involve writing complicated path expressions. As schema-less data is added, it is examined to discover new structures, and new statements are submitted to the RDBMS to reflect the changes. Because these tasks are time-consuming, development of explicit schemas for schema-less data is delayed, thereby delaying the ability to query schema-less data using the powerful querying capabilities of an RDBMS.
Some aspects of developing explicit schemas for schema-less data may be automated using schema-discovery utilities, which generate schemas for a body of schema-less data. When the schema-discovery utilities are run, the entire body of schema-less data is processed, which may entail significant expenditure of time and computing resources. Schema-definition utilities are often run during off-hours to minimize impact on computing resources. Capturing schema changes to a body of schema-less data entails re-running the schema-discovery utilities against the whole body of schema-less data. The schema generated by schema-discovery utilities is often manually examined before actual implementation in an RDBMS, to ensure that the schemas are feasible. While schema-discovery utilities may alleviate the delay attendant manual development of explicit schemas for schema-less data, the delay is not eliminated and may be significant.
Based on the foregoing, an approach for automatically defining explicit schemas on schema-less data that is faster and consumes fewer computer resources is desirable.