1. Field of the Invention
This invention relates to computing system resource estimation and optimization, and more particularly relates to estimating and optimizing disk space and processor utilization in a parallel data integration environment.
2. Description of the Related Art
Data integration in a parallel computing environment typically involves computations with large data sets spread across many computers, or partitions. A data set consumes an area on a computer disk as a permanent storage location. An originating data set, or data source operator, consumes disk space on the originating computer, and a destination data set, or data sink operator, consumes an area on a computer disk as a permanent storage location. While data sets are processed, some operations require memory to store intermediate result sets. For example, a sorting operation may require memory to store intermediate results as the sort is executed. Intermediate memory for processing operations may be called scratch space, and is used to describe an area of disk used to store intermediate result sets. Additionally, performing operations to process data sets consumes processing capacity for the computer performing the operations.
In a parallel computing environment, various operations on data sets can be performed across many different computers, and the operational steps for processing the data sets can be intertwined and rather complicated. A description of the processing steps to be performed on one or more data sets may be called a data flow graph, or data graph. A data flow graph can be a pictorial or string representation of the relationships between operations in a data processing flow. The data flow graph describes the input data sources, the operations to be performed, and the data sink operators. The data flow graph may be coupled with various other information—for example a configuration file to describe which operations are performed on which partitions, and the data schema for each operator at the input and output for the given operator.
Predicting the resource utilization—or the disk space usage, scratch space usage, and processor utilization—for a job is important in designing the data processing job and in designing hardware or planning for upgrades to hardware. Also, if a fast and accurate prediction is available, the parameters of the job can be tested and optimized. For example, a job may comprise datasets to be merged and sorted, and the speed of completion for the job may vary considerably depending upon whether the merge or sort occurs first. The current methods for predicting resource utilization depend upon rules of thumb and extensive experience of the predictor. The nuances of a complicated parallel processing job are likely to be missed, and running multiple scenarios with a dataset to determine the best order for job execution is time prohibitive. Further, only a person with expertise in estimating resource utilization based on job parameters can even make a reasonable estimate. Further, if the estimate should prove to be incorrect, current methods make it difficult to systematically correct the estimate to develop a more reasonable model.