For performing data integration in large enterprises, database structure and semantic relations between schema elements of a database have to be known and understood. Due to legacy data management in large enterprises, the semantic relations between schema elements, for example, columns of tables in a database are often lost or corrupted. Determining the semantic relations between the schema elements is a time consuming and costly process. When an explicit semantic relation is not available, it becomes time consuming to identify the semantic relation due to the vast size of the database and nature of data stored in the tables.
A significant class of semantic relations between schema elements is primary key-foreign key relationships. A primary key is a column in a table that uniquely identifies each record in the table. A foreign key is an attribute or a column in another table that maps to the primary key to form a relationship between the tables. The relationship between the primary key and the foreign key is referred to as a “primary key-foreign key relationship”. The primary key-foreign key relationship helps in fetching records queried by a user of the database quickly without having to access numerous tables in an enterprise, thereby saving time and effort. The primary key-foreign key relationship allows determination of a data flow in a schema whenever any form of perturbation, insertion, or deletion occurs to any record in the database. However, determining primary key-foreign key relationships in databases of large enterprises is a time intensive, processor intensive, and computation intensive process. Consider an example where a primary key-foreign key relationship between 100 tables with 20 columns each has to be found. Around four million pairs of columns have to be explored to find a possible primary key-foreign key relationship. A brute force method cannot be used to find a possible primary key-foreign key relationship among four million pairs of columns. For a primary key-foreign key relationship to exist between the schema elements, a precondition of inclusion dependency needs to be satisfied. Inclusion dependency is a property of data which, when satisfied, requires every value of one column of a table to exist as a value of another column in a different or the same table. On determining inclusion dependency between the schema elements, the number of pairs of schema elements to be evaluated for primary key-foreign key relationships is significantly reduced, but is still significantly large for a manual determination of primary key-foreign key relationships. Therefore, there is a need for automating the process for determining primary key-foreign key relationships between the schema elements using computation intensive processors.
There are a few conventional methods for determining primary key-foreign key relationships. In one conventional method, metadata from the pairs of columns in tables is used to determine primary key-foreign key relationships. A single feature, for example, a column name is used to determine the primary key-foreign key relationships in this method. With the single feature, that is, the column name, the determination of the primary key-foreign key relationship may not be as accurate as the primary key identified may have repeated values and may not satisfy the property of a primary key. There is a need for a method for determining primary key-foreign key relationships between data in tables by using an intelligent combination of features of the data in the tables and by not relying on a single feature.
In another conventional method for determining primary key-foreign key relationships between data in tables, a pruning criterion and a ranking function are employed. The pruning criterion eliminates pairs of columns in the tables that are unlikely to have a primary key-foreign key relationship. The ranking function ranks the pairs of columns in the tables that satisfy the pruning criterion. The pruning criterion and the ranking function are limited by definition by a programmer. In other conventional methods, primary key-foreign key relationships are found using rule based approaches. One of the rule based approaches is to determine a cumulative probability distribution of the primary key and the foreign key in the pairs of columns in the tables and compute a score for each of the pairs of columns. However, the scope of determination of primary key-foreign key relationships using the rule based approaches is limited by the rules defined for the determination, which is typically not accurate. The rule based approaches also needs a human expert or a programmer to define the rules and execute an algorithm. The computation of a score is a processor intensive and time consuming process. There is a need for a method for determining primary key-foreign key relationships between data based on past observations more accurately while consuming less time. The machine learning based approach is one such approach for determining primary key-foreign key relationships accurately. The machine learning based approach learns inherent property of primary key-foreign key relationships from already classified data and hence is a robust approach for determining primary key-foreign key relationships.
In the machine learning approach, a machine learning classification algorithm learns from patterns of already classified data. The already classified data acts as training data for training the machine learning classification algorithm. If the machine learning classification algorithm is applied for a classification problem, the machine learning classification algorithm is trained on the already classified data that is classified into two categories: positive training data comprising primary key-foreign key pairs and negative training data comprising non-primary key-foreign key pairs. If the machine learning classification algorithm is trained on predominantly negative training data, a trained machine learning model generated from application of the machine learning classification algorithm will be biased towards the negative training data. When subjected to actual data whose primary key-foreign key relationship is to be determined, the trained machine learning model classifies the actual data spuriously. Furthermore, using a single machine learning classification algorithm for determining a primary key-foreign key relationship between schema elements may result in a spurious classification. A combination of multiple machine learning classification algorithms will result in a better determination of the primary key-foreign key relationship between the actual data.
Hence, there is a long felt need for a method and a system that determine primary key-foreign key relationships among schema elements, that is, data in multiple tables through machine learning using computation intensive processors. Moreover, there is a need for a method and a system that intelligently combine the features of data in a table for determining the primary key-foreign key relationship between the actual schema elements. Furthermore, there is a need for a method and a system that split the training data in a manner to avoid skewing of the training data towards negative training data to avoid an erroneous classification of the schema elements by a machine learning classification algorithm. Furthermore, there is a need for a method and a system that intelligently combines multiple machine learning classification algorithms for determining the primary key-foreign key relationship between the actual schema elements.