The present invention relates to systems of solving a set-level query in a database, and more specifically, to a system for efficiently computing a set comparison within a predicate associated with a GROUP BY using a bit-map index, and generalizations thereby to more complex queries.
Simple query language (SQL) supports the selection of tuples using conditions on each tuple (i.e., the predicates in WHERE), and a group of tuples (i.e., the predicates in HAVING). The comparison of conditions in a predicate, even for a group of tuples, is performed at a scalar level. In other words, for each comparison, only one value extracted or aggregated from one column is compared to another. Such scalar level predicates become increasingly inadequate to support a new class of operations that require set-level comparison semantics (i.e., comparing a group of tuples with multiple values). Complex SQL queries composed by scalar-level operations are often formed to obtain very simple set-level semantics. Such queries are not only difficult to write, but also challenging for a database engine to optimize, and may therefore result in costly evaluation.
Conventional methods include scanning a bitmap index of the grouping attributes in a hierarchical manner. During this scanning, the bitmap index of the set predicate attributes is used to help pruning and boost the performance of a query. More specifically, the scanning first checks the first bitmap vector of the grouping attribute, and then the second bitmap vector, and so on, through the last bitmap vector. When there are n bitmap vectors for the grouping attributes, the scanning is essentially a complete tree traversal with depth n. The set predicate condition, however, can be used to prune most tree paths, and thus can have certain efficiency benefits in practice. This is true, however, only when the set predicate condition is very selective (i.e., only a small percentage of groups satisfy the condition).
There is a high demand for querying data with the semantics of set-level comparisons. By way of example, a company may want to search a resume database for candidates having a set of mandatory skills; an executive may want to find a department whose monthly average customer service ratings in a year have been good and bad (i.e., the rating subsumes {excellent, poor}); or an online advertising marketing strategist may want to find websites that publish ads for ING and Emigrant, but not HSBC.
In a typical database design that does not include explicit support of set-level attributes, set-level attributes are often modeled as a separate table. For example, a table Resume_Skills (id, skill) can be used to connect a set of skills to job candidates, who are identified by id, in another table. Here, a GROUP BY clause may be used to dynamically group tuples by id, with the values on attribute skill in the same group forming a set. Current implementations of a GROUP BY clause can only perform scalar value comparison by an accompanying HAVING clause. By way of further example, SUM, COUNT, AVG, and MAX functions will produce a single numeric value that is compared to a constant or other single aggregate value. Therefore, a set-level comparison such as the set of skills subsuming {‘Java programming’, Web services’} cannot be accommodated without complex SQL queries.
Set-valued attributes provide, however, a concise and natural way to model complex data concepts. Database management systems (DBMS) support attributes involving a set of values (e.g., nested table in Oracle, and SET data type in MySQL). In the above example, the “skill” attribute can be defined as a set data type. Set processing may be performed on a set-valued attribute and a set containment join, which is a join operation between the set-valued attributes of two relations, where the join condition is specified by set containment relationship. Set-value attributes together with set containment joins may be used to support set-level comparisons, but doing so has inherent limitations: 1) set-value attributes must be pre-defined at the schema definition time, and 2) a set can only cover one attribute and cannot be across multiple attributes. Therefore, users can not form dynamic set-level comparisons based on their query needs due to limitations caused by the database schema, and set-level comparisons are only possible on set-valued attributes.
For real-world decision making, groups and corresponding sets are often dynamically formed according to the query needs. In the above customer service example, sets are formed by departments (i.e., the monthly ratings of each department form a set). In a different query, the sets may be formed by individual employees. It is self-evident that pre-defined set-valued attributes cannot support such dynamic set formation.
Additionally, a definition of a set may be across multiple attributes. In the online advertisement example above, a strategist may want to discover the websites that publish ads for ING with good profit returns, because those are proven to be profitable with bank ads. In this case, the set is defined over both the advertiser and the profit attribute. Many systems only provide a set-valued attribute defined on a single attribute, and therefore cannot capture the cross-attribute association in the above example. Nested tables in Oracle allow sets over multiple attributes, but do not easily support set-level comparisons on such attributes.
As can be seen there is a need for an improved system and method of performing set-level comparison operations.