The present disclosure relates generally to data preparation and analysis. More particularly, techniques are disclosed for analysis, generation, and visualization of data obtained from multiple data sources.
Before “big data” systems can analyze data to provide useful results, the data needs to be added to the big data system and formatted such that it can be analyzed. This data onboarding presents a challenge for current cloud and “big data” systems. Typically, data being added to a big data system is noisy (e.g., the data is formatted incorrectly, erroneous, outdated, includes duplicates, etc.). When the data is analyzed (e.g., for reporting, predictive modeling, etc.) the poor signal to noise ratio of the data means the results are not useful. As a result, current solutions require substantial manual processes to clean and curate the data and/or the analyzed results. However, these manual processes cannot scale. As the amount of data being added and analyzed increases, the manual processes become impossible to implement.
The rapid proliferation of data from a variety of sources: internal and external, unstructured and structured, traditional and new data types, presents an enormous opportunity for businesses to gain valuable insights that can help them make improved and timely decisions that will win, serve, and retain customers. A key part of preparing these data sources for analysis is the ability to determine a degree of similarity between two or more datasets originating from different sources into a single file ready to be used for further processing, such as by a big data analytics system. The heterogeneity and size of datasets introduce tremendous challenges to find similarities in the data, such as columns, which could be used as a basis for merging these datasets.
Data from disparate sources may include different types of data having different formats. For example, data from an enterprise may be different than data from click stream or error logs, or structured data from social media sources. Users may desire to use data from multiple sources to build a data lake, to perform downstream processing for applications, and to perform ETL (extract, transform, and load) processing. The data from web logs and social media may be completely unrelated to data about transactions for users. A significant amount of time, money, and computing resources may be spent to utilize data from various sources to provide enrichment of the data.
Data from multiple different data sources may have diverse forms. Using traditional methods of processing the data to determine similarity may be expensive in terms of computing resources. At scale of extremely large datasets, such as big data, computing systems are unable to scale to handle a processing load of the datasets to determine similarities.
Certain embodiments of the present disclosure address these and other problems.