Embodiments of the invention relate generally to computing systems, and more particularly to processing a data set stored in a computer system.
Data profiling is a mandatory step in all types of information integration projects such as data warehousing (DW), master data management (MDM) or application consolidation. The purpose of data profiling is to assess data quality of data residing in source systems before loading it into target systems.
In some projects, source systems may comprise over 70,000 tables containing several terabytes of data without any documentation. Running the data profiling, for example, of 6000 tables, as well as understanding and assessing the profiling result reports (PRRs) can last weeks or months. Even if full weeks or months would have been available to process the results, assuming the analysis would have not required any time in this case, this would have not been possible because there would not be enough resources to assess and classify the PRRs within the given timeframe. This is due to the fact that each of the PRRs has to be evaluated to be sure not to miss quality problems. In such a project there are simply not enough resources available for cost reasons to review 6,000+ PRRs. In addition, with extreme time pressure assessing PRRs, the probability for critical errors increases, which can cause load failures or broken processes later in the information integration project due to unhandled data quality issues.