A schema of a database refers to the organization of data that defines how to construct the database. For example, in a relational database, the schema specifies how the data are stored and managed by defining the tables, fields, relationships, views, indexes, functions, queues, triggers, types, and other aspects. Managing data with flexible schema is desired for semi-structured data, multi-tenant databases, and many other scenarios.
At present, various applications with flexible schema are confronted with a problem that the schemas are often not well-defined and may evolve over time. Conventional databases, such as structured query language (SQL) databases, have difficulties in supporting efficient schema evolution. As known, schema evolution may require re-normalization and table repartitioning, for example, when evolving a one-to-one relationship to a many-to-one relationship. Table repartitioning is time-consuming due to the data migration from old tables to new tables. Additionally, this process often needs to block transactions because of exclusive locks on the old and new tables. To avoid these problems, applications usually do not allow online re-normalization and table repartitioning, thereby introducing considerable downtime in the presence of schema evolution.
Traditional solutions for flexible schema often work on the basis of key-value pairs to avoid re-normalization and table repartitioning upon schema evolution. For example, associative arrays have been proposed for flexible schema. An associative array is a collection of key-value pairs that represents atomic fields of the entities as well as internal/external relationships that are later used to reconstruct data. Associative arrays may relieve data engines from re-normalization and table repartitioning. However, without data normalization, connecting any two entities or attributes must require a join operation. As a result, use of the associative arrays will introduce excessive join operations and therefore decrease query performance of the databases. Specifically, it is found that query performance decreases dramatically when queries go beyond simple predicate selections and their execution plans involve a large number of joins over key-value stores. Furthermore, key-value pairs do not work for those applications where data normalization cannot be omitted.