Database systems perform a vital role in the information technology infrastructure of a business or corporation. Specialized databases for data warehousing and data analysis are becoming an important segment of the total database market. Business intelligence systems transform raw data into useful information. Common applications of business intelligence systems are, for example, fraud detection, risk analysis, market segmentation, and profitability analysis.
In a business intelligence system, data is extracted from heterogeneous operational databases and external data sources, then cleansed, transformed, and loaded into a large data warehouse or data mart storage areas. Data warehouses are subject-oriented, integrated, and time-varied collections of data used primarily for making decisions. Data marts are departmentalized subsets of the data warehouse focusing on selected subjects, rather than the entire enterprise data.
Data is stored and managed by one or more data warehouse servers that provide data access to front end tools for querying, reporting, analysis, and mining. Specialized online analytical processing (OLAP) servers may also be used to construct multidimensional views of the data, where operations on data can be performed.
Business intelligence workloads have different characteristics than the traditional transaction processing workloads used in conventional capacity planning and sizing methods. Business intelligence workloads place a greater emphasis on summarized and consolidated data as opposed to individual records. Business intelligence workloads typically use a very large size database. Queries of business intelligence workloads are heterogeneous, complex, and ad-hoc in nature, varying greatly in the amount of time required to execute the queries. These queries often touch millions of records and may perform many table joins, sorts, or aggregations. Furthermore, queries of business intelligence workloads can produce very large results sets, requiring a large amount of concurrent I/O.
Computer capacity planning is the process of analyzing and projecting an existing workload to determine the type of hardware resources needed to meet future demand and to predict when system saturation occurs. The capacity planning process can be long and challenging, depending on the size and complexity of the application, the quality and quantity of information available, as well as the approaches and tools employed. A computer capacity-sizing expert often performs computer capacity planning manually with insufficient information and using an unstructured, informal approach.
Database system sizing attempts to arrive at an initial estimate of a hardware configuration that satisfies performance demands, cost constraints, and functional requirements of a new business intelligence system. Typically, detailed information about the system and its workload are not available during the sizing process. In conventional database system sizing, a sizing expert uses published performance results of a similar workload with similar performance requirements. The sizing expert extrapolates these results to the new business intelligence system. This extrapolation is performed using informal industry guidelines (“rules of thumb”) and published performance relationships between different types of hardware. The sizing expert thus obtains an initial estimate of the hardware configuration comprising the processor, disk, and memory required to meet resource demands of the expected workload and the expected size of the database.
Selecting the appropriate hardware resources can be a complicated task because of the wide variety of processor, disk, network, and memory technologies available. Further, determining the quantity of each resource needed and predicting how the different components interact under a specific workload are non-trivial tasks.
Many of the approaches and tools used for capacity planning were developed in the late-1970s and early-1980s when mainframe computers were the dominant computing platform. Mainframes were very expensive; therefore it was critical to perform detailed planning and analysis before a particular model was purchased. A variety of tools were created to help a planner with this task, including tools for performance monitoring, workload forecasting, performance simulation, and design/configuration advice.
As mainframe architectures slowly gave way to client-server, and more recently, n-tier architectures, the focus on planning was not as systematic. This may be partly attributed to financial factors; namely the declining cost and improving performance of computer hardware. The cost of a cluster of inexpensive server machines networked together became substantially less than that of a mainframe. Fixing configuration errors resulting from poor planning could cost in the thousands of dollars for n-tier architectures versus millions of dollars for mainframes. The additional cost to perform detailed planning analysis often exceeded the costs to correct configuration errors, thus complete planning studies were relatively unattractive.
The complexity of modeling performance in n-tier architectures also makes planning more difficult. Traditional methods used for mainframes are not directly transferable to n-tier architectures. In the mainframe domain, components such as processors, disks, and memory share similar designs and characteristics; however, this is not the case in n-tier environments. The proliferation of competing and sophisticated processor, disk, memory, and network technologies makes creating generic performance models very difficult.
The resource demands of modern applications are also more complex and demanding in nature than in the past, making their performance less predictable. The popularity and commercialization of the Internet and World Wide Web fostered the demand for newer and richer data such as graphics, audio, video, and XML. Whereas this data was once stored for archival purposes only, companies have now started analyzing it with specialized data analysis applications to discover new information about their business and customers. This places additional resource burdens on systems in addition to the traditional transaction processing workloads being handled.
Time and business pressures also make detailed capacity planning studies infeasible. In today's e-business on demand environment, customers demand and expect answers in a timely fashion. A day or week is often a critical amount of time for completing a hardware sale. This implies that any planning analysis needs to be performed quickly while maintaining a high degree of accuracy.
The result of the sizing process is an initial estimate of the hardware configuration (processor, disk, and memory) needed to meet the resource demands of the expected workload and size of database. Customers expect a cost-efficient and effective hardware solution that meets the performance requirements of their application while offering the maneuverability to accommodate future expansion. There is generally no opportunity for experts to validate their hardware recommendations because of financial and time constraints. The sizing process currently involves significant manual effort to complete.
Successful sizing of a business intelligence system requires a characterization of the anticipated workload. Workload characterization dates back to the 1970s when workloads largely comprised large transactions and batch jobs performed on mainframe computers. Techniques have evolved to accommodate some of the modern workloads encountered in current computing environments. The majority of conventional workload characterization approaches assume that detailed performance measurements from a production environment are available to build models of system performance.
One conventional workload characterization approach uses clustering analysis to construct a profile of a data warehousing workload to summarize the characteristics of the workload. The profile can be used to help a designer during logical and physical optimization. The profile can be further used to generate workloads useful for evaluating system performance in testing and benchmark settings.
Conventional techniques for workload characterization rely on a combination of structural query properties and statistical parameters to perform a clustering. The structural properties are based on the text of a query, such as the number of table joins, the number of predicates, and the type of predicates. Statistical parameters comprise quantitative values in the catalog tables of a database system, such as table size, size, the type of indexes on a table, and the skew of data values in the table. Although these techniques have proven to be useful, it would be desirable to present additional improvements. Currently, there is no known conventional technique for workload characterization that makes use of performance-oriented measurements, that are subsequently used to aid in the sizing of a new database system running a business intelligence workload.
Conventional techniques for selecting an initial size of a hardware configuration for a database system are manually performed by sizing experts. Furthermore, conventional techniques assume that little system environment information or performance measurements are available, thus a sizing expert relies on extrapolations from similar workloads, personal experience, industry benchmarks, informal industry guidelines, and hardware performance guidelines to determine the type and quantity of required resources. Currently, there is no available method for characterizing an anticipated workload based on performance-oriented characterization of a similar workload.
What is therefore needed is a system, a service, a computer program product, and an associated method for characterizing a business intelligence workload to aid in sizing the hardware configuration of a new database system. The need for such a solution has heretofore remained unsatisfied.