Internet companies have an increasing need to store and analyze massive data sets, such as search logs, web content, and click streams collected from a variety of web services. Data analysis may involve tens or hundreds of terabytes of data. To be able to perform such massive analysis in a cost-effective manner, distributed data storage and processing platforms have been developed on large clusters of shared-nothing commodity servers. A typical cluster may include hundreds or thousands of commodity machines connected via a high-bandwidth network.
Massive data analysis on such large clusters presents new opportunities and challenges for query optimization. Query optimizers operate to generate efficient query plans that make use of all cluster resources. It is often desirable to generate parallel plans. Query optimizers in database systems typically start with an optimal serial plan and then add parallelism in a post-processing step. This approach, however, may result in sub-optimal plans.