The disclosure relates in general to preprocessing data for analysis by big data analysis systems, for example, parallel and distributed systems and more specifically to analyzing transformations for developing transformation scripts for preprocessing data.
Organizations generate large amount of data during their normal operations. Data may be generated by systems as a result of transactions performed within the organization, as a result of interactions with external systems, by sensors, by manual entry of data, and so on. This data often includes structured data as well as unstructured and/or semi-structured data stored in a wide variety of formats.
Organizations perform data mining operations on the data generated to extract different types of information. This includes information indicating health of various components of the organization, information predicting performance of various projects within the organization, information describing allocation of resources within the organization, and so on. Big data analysis systems process the huge amount of data being generated by organizations. These big data analysis systems typically use parallel and distributed systems to process the data. Big data analysis systems typically need the data to be available in a specific format to be able to analyze that data and exploit the parallelism inherent in the data.
However the quality of raw data that is generated by various systems within the organization is often poor. In other words, raw data generated by the disparate sources within the organization is not in a format that can be readily processed by big data systems. Such raw data often contains missing fields, data anomalies, erroneous values, duplicate values, nested structures that cannot be processed by the big data analysis system, data that does not conform to certain type constraints, and so on. The amount of data that is in a proper format that can be processed by big data systems is often a fraction of the overall data available. The quality of results obtained by analyzing the data is limited by the amount of data that the big data system can process.
The amount of data that can be processed by the big data systems can be improved by preprocessing the raw data by transforming the data to a form that can be efficiently processed by the big data systems. Preprocessing of data requires performing transformations to the data to bring the data to a desired form. Automatic transformation of data requires generation of scripts for performing the transformations. Developing these transformation scripts is often a tedious and time consuming process that requires experts who can analyze the data and developers who can write the scripts. As a result, cleaning data generated by organizations is often an expensive process.