Big data refers to very large data sets. In big data paradigms, there may be a need to integrate various data assets. The data assets may be structured, semi-structured, or unstructured. A structured data asset may be described as a set of attributes and corresponding values. This integration may be done using join, merge or union operations. The data sets may be stored in tables having rows and columns (“attributes”) in a Relational Database Management System (RDBMS). For performing these operations across data assets, entity mappings between them need to be obtained. The entity mappings describe the column values that should be compared to know whether the same real-world entity is described in the two data assets. Currently, such column mapping is done manually, which is not suitable for data discovery in big data paradigms.
Some systems identify foreign keys in relational tables. The primary key (PK) to foreign key (FK) relationship may involve a single column or multiple columns. These systems assume that at-least one key is a primary key and that the relationship has to be one-to-one (e.g., 100%) (i.e., each foreign key should be a primary key of the other data asset).
Some systems estimate individual column mappings using semantic similarities. However, these systems are not based on joins between the two datasets.
Some systems identify attribute pairs that may be used to link two tables, but such systems discover only single attribute mappings.