Today's enterprises often deploy mission-critical databases that can comprise several hundred gigabytes, and often several terabytes of data. The support and maintenance requirements of such very large databases (VLDBs) can be challenging for the enterprise. With the continued proliferation of information sensing devices (e.g., mobile phones, online computers, RFID tags, sensors, etc.), increasingly larger volumes of data are collected for various business intelligence purposes. For example, the web browsing activities of online users are captured in various datasets (e.g., cookies, log files, etc.) for use by online advertisers in targeted advertising campaigns. Data from operational sources (e.g., point of sale systems, accounting systems, CRM systems, etc.) can also be combined with the data from online sources. In some cases, such VLDBs can be “append only” stores and can comprise fact tables with over a billion rows. In some cases, the foregoing VLDBs can comprise a large number of duplicate rows.
Still, the users (e.g., enterprise business analysts, etc.) of such VLDBs query the datasets with expectations of high performance, including fast query response times. Database providers strive to satisfy such user desires by continually improving database query capabilities and/or performance. For example, the JOIN clause and/or other join conditions in structured query language (SQL) queries to combine two tables (e.g., views, etc.) in a database can be expanded to implement various methods for determining join results. For example, nested loop joins, hash joins, sort merge joins, Cartesian joins, and/or other join methods can be selected by the database system based on expected performance results (e.g., latency, resource costs, etc.).
Further, join types such as inner joins, outer joins, semi-joins, and anti joins can be implemented based at least in part on detected join conditions. Various keywords (e.g., DISTINCT) can also be included in query blocks as directives to eliminate the duplicate rows from the join results. The semantics of a “duplicate” can vary, depending on the query. In some cases, the presence of duplicates in query results is semantically insignificant as compared with absence of duplicates in query results. In many cases, a query can be examined to determine if the query results returned will (or might) exhibit presence or absence of insignificant duplicates. The aforementioned query examination can be performed over any portions, statements, or blocks of a given query.
A duplicate insignificant query block is one where semantics of the corresponding results are unaffected by the presence or absence of duplicates in the query block results. For example, eliminating duplicates from the query results of a duplicate insignificant query block does not alter the semantics of the query block, but can, in some cases, improve its overall performance. For some duplicate insignificant query blocks (e.g., query blocks with the DISTINCT keyword), a full removal of duplicates can be performed. For other duplicate insignificant query blocks, no duplicates or merely only some duplicates might be removed. In either case, full or partial removal of duplicates can serve to improve the efficiency of the overall set of operations of the query (e.g., JOIN, DISTINCT, GROUP BY, etc.) and/or database operations that follow the query.
Attempted approaches might execute a duplicate insignificant query block by performing a join operation followed by a duplicate elimination operation. For example, an execution plan derived from a certain duplicate insignificant query block might perform a hash join operation followed by a DISTINCT operation performed in set of post processing steps. In most cases, post-processing (e.g., to process the DISTINCT operation) for duplicate elimination can be expensive. What is needed is a way to perform duplicate elimination operations (e.g., that observe the distinct semantics) while performing other database operations (e.g., joins), so as to eliminate or reduce the costs of post processing steps.
What is needed is a technique or techniques to improve over various considered approaches. Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.