In various applications, it is required to compare two or more data sources and identify similarity between contents in the data sources being compared. For example, consider that an organization has multiple branches spread across the globe. The organization may be maintaining a global database that has information pertaining to various products and services offered and/or managed by the organization. However, it is possible that when the organization collects data from each of its branches, the data is in heterogeneous format, which means each branch may be using data that is customized as per local standards and/or requirements that helps each branch effectively manage activities in that specific locality. That means the organization would end up collecting data in heterogeneous format.
The inventors here have recognized several technical problems with such conventional systems, as explained below. If the organization intends to collect data from different branches and analyze the data, analysis becomes a hurdle as the data is in heterogeneous format. Existing systems that facilitate heterogeneous data processing and analysis rely on textual similarity feature based techniques, which are unsupervised. The mechanism used being unsupervised, affects quality of outputs.