In the field of data modeling and data analysis, a common problem is the determination of combinations of parameters whose combination of respective parameter values is unique for each individual data record in a database. The determination of such unique parameter combinations allows specifying index structures being derived from parameter values of the unique parameter combination. The index structures and/or knowledge of the unique combination of parameters allows to improve the performance of query optimizes in databases and thus to increase the speed of data retrieval and data analysis. As one of the most common database types for storing large amounts of data for production or analysis purposes is a relational database management system, where “parameters” or “attributes” of data objects are represented by columns, the problem of identifying unique combinations of parameters is also known as the problem to identify multi-column unique key sets in a database.
Current data profiling approaches for identifying multi-column unique key sets are too slow to be practically applicable on larger databases. Gunopulos (Gunopulos D., et al: “Discovering all most specific sentences”. In: Transactions on Database Systems (TODS), Volume 28 Issue 2, 2003) showed that the detection of multi-column composite unique keys is very time consuming especially for larger databases as the number of possible keys increases exponentially with the number of parameters/columns.
Identifying multi-column composite key sets manually by a data domain expert is also not an option, especially for larger databases. Multi-column composite key sets are a characteristic inherent to the data. The characteristic is not always known by the application developer or database administrator. Therefore multi-column unique key sets cannot be foreseen without an in depth data analysis. A manual data evaluation for identifying multi-column composite key sets would take too much time to be a practical option.
A further approach is to enforce multi-column unique key sets by creating a corresponding constraint, e.g., in a relational database. This approach has the disadvantage that not all existing multi-column unique key sets are detected. In addition, such a manually imposed constraint may result in errors in case a further data record comprising an already existing combination of property values is inserted into the database.