The disclosure relates in general to preprocessing data for analysis by big data analysis systems, for example, parallel and distributed systems and more specifically to determining transformations for preprocessing datasets with nested data structures.
Organizations generate large amount of data having different sources, for example, data generated by systems within the organization, data generated by interactions with external systems, data generated by sensors, manual entry of data, and so on. This data may include structured data as well as unstructured and/or semi-structured data that is stored in a wide variety of formats. Organizations perform data mining operations on this data to extract various types of information, for example, information indicating health of various parts of the organization, information predicting performance of various projects within the organization, information describing allocation of resources within the organization and so on. Big data analysis systems are being developed to process the huge amount of data being generated by organizations. These big data analysis systems typically use parallel and distributed systems to process the data. Big data analysis systems typically need the data to be available in a specific format to be able to analyze that data and exploit the parallelism inherent in the data.
However the quality of raw data that is generated by various systems within the organization is often not in a format that can be readily processed by big data systems. This is so because the raw data often contains missing fields, data anomalies, erroneous values, duplicate values, nested structures that cannot be processed by the big data analysis system, data that does not conform to certain type constraints, and so on. The amount of data that is in a format in which it can be processed by big data systems is typically a small fraction of the overall data available. The quality of results obtained by analyzing the data is therefore limited by the amount of data that the big data system can process. The amount of data that can be processed by the big data systems can be improved by preprocessing the raw data by transforming the data to a form that can be efficiently processed by the big data systems. Preprocessing of data requires performing transformations to the data to bring the data to a desired form. Automatic transformation of data requires generation of scripts for performing the transformations.
Datasets often include nested structures having arbitrary sizes and arbitrary levels of nesting of data. Generating transformations for such datasets is often difficult since a user has to understand the structure of a complex dataset to be able to specify transformations for them. These datasets may be obtained from sources that either do not provide documentation of the structure or provide inaccurate or outdated documentation. Trying to develop transformations for such datasets without having proper documentation describing the structure of the datasets can be a tedious and error prone task. Conventional techniques for developing transformation scripts are inadequate in handling nested data structures in datasets.