The disclosure relates in general to preprocessing data for analysis by big data analysis systems such as parallel and distributed systems and more specifically to analyzing data for purposes of generating a data profile of datasets.
Big data analysis systems process the huge amount of data generated by organizations. These big data analysis systems typically use parallel and distributed architectures to process the data. The raw data that is analyzed may be data representing transactions performed, data received from external systems, data generated by sensors, data entered manually by users, and so on. This data often includes structured data as well as unstructured and/or semi-structured data stored in a wide variety of formats. Big data analysis systems typically need the data to be available in a specific format to be able to analyze that data and exploit the parallelism inherent in the data.
However the quality of raw data received for analysis by a big data analysis system is often poor. In other words, raw data generated by the disparate sources is not in a format that can be readily processed by big data systems. Such raw data often contains missing fields, data anomalies, erroneous values, duplicate values, nested structures that cannot be processed by the big data analysis system, data that does not conform to certain type constraints, and so on. The amount of data that is in a proper format that can be processed by big data systems is often a fraction of the overall data available. The quality of results obtained by analyzing the data is limited by the amount of data that the big data system can process.
The amount of data that can be processed by the big data systems can be improved by preprocessing the raw data by transforming the data to a form that can be efficiently processed by the big data systems. Preprocessing of data requires analyzing the data and performing transformations to the data to bring the data to a desired form. Analyzing raw data is often a tedious and time consuming process that requires experts who can analyze the data and developers who can write the scripts. As a result, preprocessing data in preparation for analysis is often an expensive process.
The steps of processes illustrated as flowcharts described herein can be executed in an order different from that described herein. Furthermore, actions described as being executed by certain software modules may be executed by other software modules than those indicated herein.
Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. The figures depict embodiments of the disclosed system (or method) for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein.