A column of a database table may have an inclusion dependency with another column of the same database table or another database table. An exact inclusion dependency between column A and column B (expressed as A ⊂ B) exists if each and every value in column A is also in column B. For example, if values in column X are foreign key values that uniquely identify values within column Y, then column X is said to have an inclusion dependency with column Y since each value in column X is in column Y.
On the other hand, an approximate inclusion dependency exists between column A and column B if some, but not all, of the values in column A are also in column B. Current methods for detecting inclusion dependency relationships within data sets are largely directed towards identifying exact inclusion dependency relationships, although identifying approximate inclusion dependency relationships within a data set also yields useful information about the data set.
The identification of exact inclusion dependencies and approximate inclusion dependencies that exist between columns of tables stored in a database may be desirable for a variety of reasons. The identification of inclusion dependencies facilitates the job of a database administrator to ensure the quality and consistency of the data stored in the database. Additionally, the identification of inclusion dependencies is a central task in data profiling.
Typically, to identify whether any inclusion dependencies exist between columns of database tables, a join must be performed on every combination of column pairs in the database tables. This is undesirable, as joins are both time and resource intensive for a single join, let alone the numerous joins that are required by this approach. For example, if a first database table had 20 columns and a second database table had 18 columns, then in order to identify any inclusion dependencies, a join must be performed 1406 times (the number of permutations is equal to P (37,2)), which is very time and resource intensive for database tables with a lot of rows.
Other approaches towards identifying inclusion dependencies involve the use of minimum and maximum values. To illustrate, if one seeks to determine whether column A has an inclusion dependency with column B, and the minimum values and maximum values of column A and column B are known, then the nonexistence of an inclusion dependency between column A and column B may be verified if either the minimum value of column A is lower than the minimum value of column B or if the maximum value of column A is higher than the maximum value of column B. If the nonexistence of an inclusion dependency between column pairs can be identified, then the column pair may be eliminated from the potential set of column pairs to test for an inclusion dependency.
Transitive properties of exact inclusion dependencies state that if X ⊂ Y, and Y ⊂ Z, then X ⊂ Z. Transitive properties may be used to assist the identification of exact inclusion dependencies. However, transitive properties do not work with approximate inclusion dependencies. For example, consider columns A, B, and C which contain the following values:                A: {1, . . . , 10}        B: {1, . . . ,8,11, . . . ,80}        C: {17, . . . ,96}Note that A ⊂ B 80% of the time (i.e., 80% of the values in column A are in column B), and B ⊂ C 80% of the time, but A ⊂ C 0% of the time. Clearly, transitive properties do not work for approximate inclusion dependencies.        
Unfortunately, in order to determine whether an inclusion dependency exists between columns of tables in a database, a join still must be performed on the database tables. Further, the joins may be performed on database tables storing large string values or other data types that require increased time and resources to process. Moreover, defining indexes on all the columns of the database tables to be joined is impractical, which further impedes the efficient performance of the join.
Consequently, there is a need in the art to discover approximate and exact unary inclusion dependencies without incurring the disadvantageous of the approaches described above. The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.