In today's complex Information Technology (IT) environments, the plethora of heterogeneous data sources pose a challenge for information architects, data scientists, and analysts. Some of the data is relational in nature and stored in a Relational DataBase Management System (RDBMS), while some of the data is contained in files in a distributed file system and processed by massively scalable big data systems, such as an Apache® Hadoop® Distributed File System (HDFS™). (Apache, Hadoop, and HDFS are trademarks or registered trademarks of Apache Software Foundation in the United States and/or other countries.)
Relational databases are organized into tables that consist of rows (also referred to as tuples or records) and columns (also referred to as fields or attributes) of data. A table in a database can be accessed using an index. An index is an ordered set of references (e.g., pointers) to the records in the table. The index is used to access each record in the table using a key (i.e., one of the fields or attributes of the record, which corresponds to a column). A query of the relational database may be described as a request for information based on specific conditions. A query typically includes one or more predicates. A predicate may be described as an element of a search condition that expresses or implies a comparison operation (e.g., A=3).
On the other hand, data in distributed files systems, such as an Apache® Hadoop® Distributed File System (HDFS™), is accessed in the form of directories and files.