Apache Hadoop project (hereinafter “Hadoop”) is an open-source software framework for developing software for reliable, scalable and distributed processing of large data sets across clusters of commodity machines. Hadoop includes a distributed file system, known as Hadoop Distributed File System (HDFS). HDFS links together the file systems on local nodes to form a unified file system that spans an entire Hadoop cluster. Hadoop also includes Hadoop YARN that provides a framework for job scheduling and cluster resource management that is utilized by a programming framework known as MapReduce. Hadoop is also supplemented by other Apache projects including Apache Hive (hereinafter “Hive”) and Apache HBase (hereinafter “HBase”). Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying. HBase is a scalable, distributed NoSQL (No Structured Query Language) database or data store that supports structured data storage for large tables.
Hadoop currently does not support a relational database management system (RDBMS). For a relational database, a schema—the organization of data into tables having specific columns together with a set of integrity constraints ensuring compatibility between the columns of the tables—can be defined. A typical RDBMS implements a schema-on-write model, where a schema is enforced on data as the data is written into the database. Specifically, the data is reorganized and filtered using the integrity constraints before the data is stored into the database. A schema-on-write model works well for answering known questions. If a previously unknown question needs to be answered, new data may need to be captured. However, the RDBMS cannot accept new data that does not match the schema. To accommodate the new data, typically old data needs to be deleted from the database, the schema needs to be modified, and new data needs to be parsed and loaded into the database. In addition, data architects typically need to ensure that all the systems connected to the RDBMS work with the updated schema. This process of accommodating the new data can take a long time. Until then, the new data cannot be captured to answer the previously unknown question.
On the other hand, Hadoop currently follows a schema-on-read model, where a schema is not enforced on data until the data is read from the database. In this case, a schema generally specifies an organization of data in terms of file formats. As a result, the processing of the data can be decoupled from the storage of the data. Specifically, the underlying storage system in Hadoop can take files in their original format (e.g., tab-delimited text files, CSV, XML, JSON, images, etc.), while allowing an associated schema to be designed later and stored separately. In response to a query, the stored data is then transformed in-memory according to the separately stored schema. By virtue of the schema-on-read model, input data can be quickly updated in the database, which encourages users to experiment with different schemas.
The schema-on-read model and the schema-on-write model have their distinct merits. It would be useful for Hadoop to offer users the flexibility of using either one or a combination of them depending on the specific needs and requirements.