1. Field of the Invention
The present invention generally relates to identifying mergeable data in a data processing system and, more particularly, to identifying correlated columns from one or more database tables.
2. Description of the Related Art
Databases are computerized information storage and retrieval systems which can be organized in multiple different ways. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. A distributed database is one that can be dispersed or replicated among different points in a network. An object-oriented programming database is one that is congruent with the data defined in object classes and subclasses.
An overall database organization is typically referred to as a schema for the database, such as a hierarchical or relational schema. A database schema is often compactly expressed using table names and names of columns in tables. Accordingly, a database may contain multiple tables, each containing a plurality of columns. The tables can be populated with data from different data sources. For instance, assume a pharmaceutical company having a database with a multiplicity of tables, each having a plurality of columns for storing data collected from different investigators. By way of example, the database includes two tables from two different investigators, i.e., a “Results Allergy A” table from a first investigator and a “Allergy A results” table from a second investigator. The two tables could be expressed as:Results Allergy A (Patient Name, . . . , Address, . . . )Allergy A Results (Patient, Location, . . . )
Both tables store information about an investigation concerning an allergy A which has been performed with respect to a multiplicity of individuals. The “Results Allergy A” table has been created by the first investigator for storing investigation results from the first investigator. Specifically, the “Results Allergy A” table contains personal information for the multiplicity of individuals, such as names and addresses, and additional information relevant to the allergy A and aspects of the investigation in a multiplicity of columns “PatientName”, “Address” etc. Each row of the “Results Allergy A” table is associated with a specific individual. The “Allergy A Results” table has been created by the second investigator for storing investigation results from the second investigator. The “Allergy A Results” table contains personal information for the multiplicity of individuals and additional information relevant to the allergy A and aspects of the investigation in a multiplicity of columns “Patient”, “Location”, etc. Assume now that the pharmaceutical company wants to determine triggering factors for the allergy A. To this end, the pharmaceutical company would analyze all collected data from both tables. To efficiently analyze the collected data, it is desirable to merge both tables into a single table.
However, one difficulty when merging independently created database tables consists in identifying correlated columns in the database tables which are mergeable to create a meaningful merged result. For example, assume that each of the two distinct database tables “Results Allergy A” and “Allergy A results” has 175 columns. By joining these two tables, a new table having 350 columns is obtained. However, in order to determine how these tables can be merged efficiently, a database administrator would need to compare each column of the 175 columns of the “Results Allergy A” table to each column of the 175 columns of the “Allergy A results” table to determine mergeable columns. In the given example, the database administrator may accordingly determine that the “PatientName” and the “Patient” columns both contain names of individual patients and that the “Address” and “Location” columns both contain address information for the patients. Accordingly, the database administrator could merge the columns “PatientName” and “Patient” into a column “Patient Name” and the columns “Address” and “Location” into a column “Patient Address” in order to provide a single resultant table “Allergy A Investigation” which can be expressed as follows:Allergy A Investigation(Patient Name, Patient Address, . . . )
However, even if the database administrator knows the exact meaning of each column in the two distinct database tables “Results Allergy A” and “Allergy A results”, this is a tedious process depending on the number of columns to be compared.
Therefore, there is a need for an efficient technique for identifying correlated columns from one or more database tables.