Profiling a large dataset is a difficult task and often takes very long time. Existing profiling tools take very long time (many hours or days) or fail to generate extensive statistic metrics on multi-terabyte tables.
Specifically, data profiling tools generating statistics on columns are currently available in relational database systems such as DB2, Oracle, etc. In the big data space, databases like Hadoop-based Hive do not maintain many of the statistics upfront. A user has to either create the custom solution to get the data statistics or use one of the commercial profiling tools in the marketplace to obtain the statistics on columns. Almost all existing big data profiling tools, with the exception of a few, use the traditional MapReduce approach to profile a large dataset from a Hadoop system, either directly using a MapReduce process or indirectly via a Hive/Pig query process. The MapReduce approach suffers the performance problem, and is especially problematic for computation intensive metrics such as histograms, topN values, etc. It either takes a very long time (hours or days) to complete, or outright fails to profile a multi-terabyte dataset with billions of rows and thousands of columns and trillions of values.