Relational database systems store data in tables that are organized into rows and columns. This allows large volumes of data to be stored and accessed efficiently and conveniently in response to user commands. User commands are typically submitted in the form of SQL (Structured Query Language) statements, such as SELECT, INSERT, DELETE, and UPDATE. Since SQL is a declarative language, i.e., it only specifies what data is to be accessed, not how that data is to be accessed, database systems include optimizers that formulate different plans for accessing data and evaluate those plans to select one for execution based on, for instance, costs or rules.
For cost-based optimizers, picking the “correct” plan, i.e., the least costly plan in terms of resources, time, or a combination of both, depends upon the statistics (i.e., information about the underlying data to be accessed) available to the optimizer. Database systems usually maintain some very basic statistics, such as size and column-data distribution of base tables. Using those basic statistics, the cost of certain operations, such as simple predicates (‘col=value’), can be accurately estimated. The cost associated with some operations, however, are much more difficult to estimate. For example, the cardinality (i.e., the number of rows in the result set) for join operations is notoriously difficult to estimate. Additionally, once the cardinality for one operation is wrongly estimated, that error propagates to subsequent operations.
To improve optimizer performance, database administrators have the option of manually declaring views on statistics collected for various operations handled by the database system (i.e., statistical views). However, since the number of statistical views that can be declared for a database system could be unlimited, it would be impractical to manually declare and maintain all of those views as the costs will quickly outweigh the benefits. For example, additional storage will be needed to store the numerous statistical views. In addition, the costs involved in keeping the statistical views up-to-date may be too high. Moreover, the time it would take an optimizer to search through all of the statistical views when compiling an execution plan may be too much of an overhead.
Accordingly, there is a need for an invention that not only automates the process of generating statistical views, but also balances the need to minimize optimizer errors with the costs associated with generating and maintaining statistical views.