MapReduce is a programming model used in parallelizing large-scale data processing, which requires the writing of map and reduce functions. By way of example, Hadoop® is an open-source implementation of a MapReduce framework that manages communication across various nodes. Further, the Hadoop® Distributed File System (HDFS) is a storage system used by Hadoop® applications, wherein disk space is shared across all machines on a Hadoop® cluster and a file can be distributed across multiple machines.
Executing a MapReduce program incurs costs, however. Such costs commonly include, for example, disk input/output (I/O) costs, communication costs, and processing costs. For instance, disk I/O costs can include reading and/or parsing large amounts of data, and writing such data to an HDFS. Communication costs can include, for example, the communication of key-value pairs among cluster nodes, and the cost of shuffle and/or sort operations. Additionally, processing costs can include computations carried out to generate key-value pairs by map tasks, as well as computations carried out to generate outputs by reduce tasks.
A particular area that presents challenges includes processing multi-way theta join queries involving arithmetic operators on MapReduce. Join queries are an important class of queries that arise in various analytics scenarios. Join predicates may be equality predicates or inequality predicates, wherein an equality predicate involves checking two attributes for equality, while an inequality predicate (also referred to as a theta join predicate) is of the form wherein the difference between two attribute values is less than a given threshold. A two-way join query involves only two relations, while a multi-way join query involves multiple relations (and hence multiple theta join predicates). Existing query processing approaches include processing two-way inequality join queries, processing two-way and multi-way equality joins, and also processing multi-way inequality join queries using a sequence of multiple chain joins.
Also, existing query processing approaches include processing interval joins, which involves correlating intervals belonging to two or more relations. An interval has a starting point and an ending point. For example, consider the observation that it rained between 7:00 PM and 8:00 PM. Here, [between 7:00 PM and 8:00 PM] constitutes an interval. An interval predicate may check whether two intervals overlap, or whether one interval is contained within another interval, or whether one interval ends before a second interval starts, etc. Also, interval join queries can be processed much more easily vis-à-vis theta join queries on real-valued data.
However, efficient techniques for handling such multi-way theta join queries on real-valued are not encompassed by the existing approaches. Existing query processing approaches include solving multi-way theta join queries as a cascade of intermediate joins, which is computationally expensive. Consequently, a need exists for techniques for processing multi-way theta joins without requiring a cascade of intermediate joins.