Data profiling generally refers to the process of examining the data available in an existing data source (e.g., a database or a file) and collecting statistics and information about that data for various purposes. For example, data profiling may be used to optimize the execution of logic code in a computing environment by trying to understand data challenges early on in a data or calculation intensive project.
If complex calculations are not performed in advance, execution of the logic code may be delayed. Using data profiling, one can determine in advance, for example, the values that may be commonly used by certain variables during execution of logic code. Once such values are determined, then the execution of the logic code may be optimized by using the calculated values instead of having to calculate those values at a later point in time when the execution of the logic code needs to use said values.
In particular, data profiling is important to profile-based compilation and optimization. Many compilers and optimization tools, such as just-in-time (JIT) compilers, require efficient data profile gathering in order to optimize the program code—JIT compilation involves a method where logic code segments are dynamically compiled at execution time from a high-level language to executable code, in contrast to a static compilation method in which the entire program code is compiled into executable code once, and is then executed multiple times.
Most data profiling schemes require instrumenting the program code for data profiling by first performing a test run and allowing the program code to fully or partially run the entire course of execution. During the test run, all the related values for the target variables may be temporarily stored in memory or in a database. After all the values are stored, then an analysis program is run on the stored data to perform a statistical analysis by applying the desired statistical formulas to the data.
The above method, depending on the length of the test run, the number of variables involved, the complexity of the calculations and the amount of data that is stored and analyzed, may require a substantial overhead in terms of execution and storage resources. Further, the length of time that it may take to analyze the data or to calculate the results of the test run may be prohibitively long, in an implementation in which the data analysis or calculation is performed during the execution of the program code.