The disclosure relates to preprocessing of data for purposes of big data analysis in general and more specifically to sampling of data for use in development of a transforms for preprocessing big data.
Organizations generate large amount of data as a result of system activity within the organization, interactions with external systems, data generated by sensors, manual entry of data, and so on. This data is often unstructured or semi-structured and stored in a wide variety of formats. Organizations perform data mining operations on this data to extract various types of information, for example, information indicating health of various parts of the organization, for predicting performance of various projects within the organization, for determining allocation of resources within the organization and so on. Big data systems are being developed to process the huge amount of data being generated by organizations. These big data systems use parallel and distributed systems to process the data. Big data systems need the data to be available in a specific format to be able to exploit the parallelism inherent in the data.
However the quality of raw data that is generated by various systems within the organization is often not in a format that can be readily processed by big data systems. This is so because the raw data often contains missing fields, data anomalies, erroneous values, duplicate values, data that does not conform to certain type constraints, and so on. The amount of data that is in a format in which it can be processed by big data systems is typically a small fraction of the overall data available. The quality of results obtained by analyzing the data is therefore limited by the amount of data that the big data system can process. The amount of data that can be processed by the big data systems can be improved by preprocessing the raw data by transforming the data to a form that can be efficiently processed by the big data systems.
Preprocessing of data requires performing transformations to the data to bring the data to a desired form. Automatic transformation of data requires generation of scripts for performing the transformations. The development of the transformation scripts requires working with sample data since the entire dataset is typically very large. The quality of the transformation script depends on the quality of the sample dataset used to develop the transformation script. For example, if the full dataset includes anomalies that are absent in the sample dataset, the transformation script is unlikely to be able to process data having these anomalies. Conventional sampling techniques include random sampling or reading the first few rows of a dataset. These sampling techniques are very likely to provide data that does not exhibit features of the entire dataset that are necessary to test the transformation scripts. For example, if the transformation script includes a join operation and the samples of the input datasets do not include rows that can be joined based on the join criteria of the join operation, the samples are not helpful for testing the join operation. Therefore, conventional sampling techniques are inadequate and do not provide a sample set that can be used for generating a robust transformation script.