The present disclosure relates generally to data preparation and analysis. More particularly, techniques are disclosed for analysis, generation, and visualization of data obtained from multiple data sources.
Before “big data” systems can analyze data to provide useful results, the data needs to be added to the big data system and formatted such that it can be analyzed. This data onboarding presents a challenge for current cloud and “big data” systems. Typically, data being added to a big data system is noisy (e.g., the data is formatted incorrectly, erroneous, outdated, includes duplicates, etc.). When the data is analyzed (e.g., for reporting, predictive modeling, etc.) the poor signal to noise ratio of the data means the results are not useful. As a result, current solutions require substantial manual processes to clean and curate the data and/or the analyzed results. However, these manual processes cannot scale. As the amount of data being added and analyzed increases, the manual processes become impossible to implement.
The rapid proliferation of data from a variety of sources: internal and external, unstructured and structured, traditional and new data types, presents an enormous opportunity for businesses to gain valuable insights that can help them make improved and timely decisions that will win, serve, and retain customers. A key part of preparing these data sources for analysis is the ability to merge or join (e.g., blend) two or more datasets originating from different sources into a single file ready to be used for further processing, such as by a big data analytics system. The heterogeneity and size of datasets introduce tremendous challenges to find similarities in the data, such as columns, which could be used as a basis for merging these datasets.
Data from disparate sources may include different types of data having different formats. For example, data from an enterprise may be different than data from click stream or error logs, or structured data from social media sources. Users may desire to use data from multiple sources to build a data lake, to perform downstream processing for applications, and to perform ETL (extract, transform, and load) processing. The data from web logs and social media may be completely unrelated to data about transactions for users. A significant amount of time, money, and computing resources may be spent to utilize data from various sources to provide enrichment of the data.
The heterogeneity and size of the data being compares introduces an additional challenge to find columns in the data that could be used for merging (e.g., joining or blending) the data. In addition to identifying columns for potential merger or joining, large datasets are hard to visualize and therefore, difficult to combine. Some have tried to provide a visualization to show the options and results for combining data that can be merged or joined. However, visualizations may not be adequate to assisting users in understanding the efficacy of options for merging data, in addition to helping users request a better set of results for merging data. Data in a particular format, such as tabular format, may aide in the identification and presentation of options for merging data. Often, large datasets from different sources may have little or no formatting, such as a columnar format. For example, the datasets may not be obtained from a database, which provides data in a structured format. The data may be obtained from different, often unrelated sources, such that merging the data becomes even more challenging. These datasets introduce greater challenges for merging. Certain embodiments of the present disclosure address these and other problems.