Various embodiments of this disclosure relate to query optimization and, more particularly, to using pilot runs (i.e., sample executions) to improve query optimization.
Traditional query optimizers rely on data statistics to estimate predicate selectivity (i.e., the selectiveness or quantity of values selected responsive to a predicate) and cardinality of operators. These estimates are used to select a low-cost execution plan for a given query. Even in the relational setting, optimizers are plagued with incorrect cardinality estimates, mainly due to undetected data correlations, the existence of user-defined functions (UDFs) and other complex predicates, and external variables in parameterized queries. Because UDFs are unknowns to the optimizers, they cannot provide accurate optimizations based on the UDFs. Various solutions have been proposed to capture data correlations, but these require detailed and targeted statistics. Collecting such statistics on all datasets may be prohibitively expensive in some cases, such as in case of large clusters.
The problem of query optimization is further exacerbated in the context of large-scale data platforms, such as Hadoop Distributed File System (HDFS), which have become popular recently. In addition to large data volumes, there are other important characteristics that distinguish query processing in such an environment from traditional relational query processing. For instance, nested data structures, such as structs, maps, and arrays, are pervasive in these environments because users commonly store data in denormalized form. Additionally, users push more complex business logic closer to the data, resulting in heavy usage of UDFs in queries. Such environments are often cloud-based, in which case query optimization plays a crucial role for scheduling as well as pricing purposes.