Electronic storage mechanisms have enabled accumulation of massive amounts of data. For instance, data that previously required volumes of books for recordation can now be stored electronically without expense of printing paper and with a fraction of physical space needed for storage of paper. In one particular example, deeds and mortgages that were previously recorded in paper volumes can now be stored electronically. Moreover, advances in sensors and other electronic mechanisms now allow massive amounts of data to be collected and stored. For instance, GPS systems can determine location of an individual or entity by way of satellites and GPS receivers, and electronic storage devices connected thereto can then be employed to retain locations associated with such systems. Various other sensors and data collection devices can also be utilized for obtainment and storage of data.
Database systems are often employed for storage and organization of data, wherein such databases can be queried by users to retrieve desirable data. Poor data quality is a well-recognized problem in database applications, including inconsistencies in data as well as missing values. Tasks performed over such data can be associated with high expense and may operate tediously due to poor data quality. In an exemplary database system, relational databases include index keys that are employed in connection with searching for data and analyzing data. For instance, a column can include a social security number, and such column can be labeled as a key due to uniqueness of the social security numbers. Thus, if a user were to search for an individual based upon social security number, the user would be able to quickly locate desired information. Due to improper user input, corruption of a database, and the like, columns can become poor for utilization as keys. For example, due to user input error a social security number can be repeated several times throughout a column within a database. Accordingly, if the database is searched using such a column and social security number, a user will be provided with a plurality of returned records. The user will then have to make determinations as to which record is relevant. If duplicity problems such as that described above are commonplace within the column, then such column may not be a desirable to utilize as a key.
In another example, a hospital can include a database that contains information relating to patients that have been provided services. Patients whose names are difficult to discover often are entered into a database under a common name, such as “John Doe” or “Jane Doe.” Thus, searching by the name column for “John Doe” may prove to be fruitless, as hundreds if not thousands of records can be provided to a user. However, outside of such a default value, the name column may be quite useful as a key column. More particularly, the names “John Doe” and “Jane Doe” may be the only duplicative names within a column, thereby rendering such column highly useful as a key column. Estimating a strength of such columns as key columns, however, has proven to be a difficult task. For instance, some conventional systems analyze each record within the database in connection with determining strength of a column as a key column. Such robust analysis, however, requires utilization of substantial time and resources, particularly in light of database systems that are rapidly increasing in size. Other conventional estimation systems are inadequate and associated with error that is too high to provide meaningful data relating to strength of functional relationships in data.