1. Field of the Invention
This invention relates in general to computer-implemented database management systems, and, in particular, to cost-based optimization of queries by identifying and eliminating redundant execution steps in processing the queries and to identifying generic views for use in executing the queries.
2. Description of Related Art
Next generation decision support applications are typically capable of processing huge amounts of data, and they may have the ability to integrate data from multiple, heterogeneous data sources. Such data sources may include traditional database systems, repositories on the Internet/World Wide Web ("the Web"), semi-structured documents, and file systems. These data sources often differ in a variety of aspects, such as their data models, the query languages they support, and their network protocols. Additionally, they are frequently spread over a wide geographical area. Decision support queries may be used to analyze and compare information from diverse sources. Processing decision support queries in this setting often involves redundant processing because comparing information requires comparing the same data with different GROUP BY operations. A GROUP BY operation causes rows in an intermediate query answer set to be grouped according to the values in the column(s) specified in the GROUP BY operation. Exemplary redundancies may include repeated access of the same data source and multiple executions of similar processing sequences. Thus, the cost of processing decision support queries in this setting can be quite high.
This problem of efficiently processing heterogeneous decision support queries has recently received considerable attention from database researchers: Ahmed, R., Smedt, P., Du, W., Kent, W., Ketabchi, A., and Litwin, W, The Pegasus Heterogeneous Multidatabase System, IEEE Computer, December 1991, [hereinafter "[ASD+91]"]; Chawathe, S., Garcia-Molina, H., Hammer, H., Ireland, K., Papakonstantinou, Y., Ullman, J. D., and Widom, J.; The TSIMMIS Project: Integration of Heterogeneous Information Sources, In Proc. of IPSJ, Tokyo, Japan, 1994, [hereinafter "[CGH+94]"]; Christophides, V., Cluet, S. Abiteboul, S., and Scholl, M., From Structured Documents to Novel Query Facilities, In ACM SIGMOD Intl. Conf on Management ofData, 1994, [hereinafter "[CAS94]"]; Papakonstantinou, Yannis, Garcia-Molina, H., and Widom, Jennifer, Object Exchange Across Heterogeneous Information Sources, In Proc. Intl. Conf on Data Engineering, Taipei, Taiwan, February 1995, [hereinafter "[PGW95]"]; Subrahmanian, V. S., Adali, S., Brink, A., Emery, R., Lu, J. J., Raiput, A., Rogers, T. J., Ross, R., and Ward, C. Hermes, Heterogeneous Reasoning and Mediator System, Tech.report, submitted for publication, Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College Park, Md. 20742, 1995, [hereinafter "[SAB+95]"]; Levy, A. Y., Rajaraman, A., and Ordille, J. J., Querying Heterogeneous Informnation Sources Using Source Descriptions, In Proc. 22nd VLDB Conf, pages 251-262, 1996, [hereinafter "[LRO96]"]; Tomasic, A., Raschid, L., and Valduriez, P., Scaling Heterogeneous Databases and the Design of Disco, In Proc. IEEE Intl. Conf on Distributed Computing Systems, 1996, [hereinafter "[TRV96]"]; Lakshmanan, L.V.S., Sadri, F., and Subramanian, I. N. SchemaSQL--a language for querying and restructuring multidatabase systems, In Proc. IEEE Int. Conf on Very Large Databases (VLDB'96), pages 239-250, Bombay, India, September 1996, [hereinafter "[LSS96]"]; Atzeni, Paolo, Mecca, Giansalvatore, Merialdo, Paolo, and Tabet, Elena. Structures in the Web, Technical Report, DDS, Sezione Informatica, Universita di Roma Tre, 1996, [hereinafter "[ANIMT96]"]; L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang, Optimizing Queries Across Diverse Data Sources, In Proceeding of the VLDB Conference, Aug. 1997, "[HKWY97]"]; and Abiteboul, Serge, Querying Semi-Structured Data, In 6th International Conf on Database Theory, Delphi, Greece, January 1997, [hereinafter "[Abi97]"], which are incorporated by reference herein.
The majority of the approaches are based on the idea of developing a database-like `wrapper` for data sources and implementing queries against these sources [CGH+94], [HKWY97], and Tork R. M. and P. Schwarz, Dont Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources, In Proceeding of the VLDB Conference, Aug. 1997, [hereinafter "[RS97]"], which are incorporated by reference herein. Typically, wrappers provide a relational or object-relational view of the data in the non-traditional sources and enable the user to use a common language/interface to query data from the diverse sources. Systems that provide end users with an integrated view of data in multiple data sources are referred to as Heterogeneous Database Systems (HDBS) and Multi-database Systems (MDBS) and are increasingly becoming relevant in the context of real-life business applications.
As an illustration, consider an application in which an investment broker manages the investment portfolios of his clients. The portfolio information may be stored in a relational database, which also contains other information about the clients such as their address, profession, etc. The broker obtains the latest stock price, as well as historical stock price information from the stock exchange servers on the Web. The broker also maintains account information in a spreadsheet for each client. In order to make complex decisions involving the buying and selling of stocks for the clients, the broker would have to use decision support queries to analyze and compare information from all of these sources.
Decision support queries analyze and compare information from diverse sources. Comparing information from diverse sources may require comparing the same data with different GROUP BY operations. Such a comparison may result in a query specification that contains computational redundancies. An analysis of TPCD benchmark queries, which were modeled after conventional decision support queries, reveal that redundancies even exist in computations of answers for simple queries, TPC, TPC Benchmark.RTM. D (Decision Support), Working draft 6.0, Transaction Processing Performance Council, August 1993, [hereinafter "[TPC93]"], which is incorporated by reference herein (see examples in Appendix A). Conventional database query optimizers generally lack the capability of identifying these redundancies. Hence, the results of one executed query segment are rarely used for processing another query segment. Since decision support queries are typically time consuming to run, especially in a HDBS setting, identifying and sharing computational results judiciously could lead to significant improvements in performance. The example that follows illustrates the kind of redundant computation that is typical of decision support queries.
Consider the following decision support query of the investment broker discussed above: list techno stocks owned by computer engineers that had a higher average sales volume over the past year than the maximum sales volume, which was reached in the first six months of the year, of any oil stock owned by a chemical engineer; and list the name of the computer engineer.
For this example, a relational wrapper is implemented which enables the user to utilize a common language/interface, e.g., a Structured Query Language (SQL) interface. Accordingly, a representative SQL query is shown below. In the example below, Rinvest in a relational database, represented as Rinvest(name, profession, ticker, qty, buyDate, buyPrice). Wstock is a Web data source, represented as Wstock(ticker, category, date, volume, endprice)
SELECT Rinvest.name, Rinvest.ticker PA1 FROM Rinvest, Wstock PA1 WHERE Wstock.cat=`Tech` AND PA1 GROUP BY Rinvest.name, Rinvest.ticker, Wstock.ticker PA1 HAVING AVG(Wstock.volume)&gt; PA1 1. query the web data source to identify the sub-relation of oil stocks in the Wstock source for the first six months of 1997; PA1 2. obtain portfolio information for investors who are chemical engineers; PA1 3. JOIN portfolio information with the volume information for oil stocks obtained from the web; PA1 4. perform a GROUP BY on volume to obtain the maximum volume of sales of any oil stock traded; PA1 5. query the web source to identify the tech stocks in the Wstock source for 1997; PA1 6. obtain portfolio information for investors who are computer engineers, PA1 7. JOIN the investor portfolio information with the volume information of tech stocks; PA1 8. perform a GROUP BY on the ticker to obtain the average volume of sales for each tech stock for each investor; and PA1 9. filter the tech stocks based on whether the average volume exceeds the average volume obtained in step (4). PA1 1. Query the web data source to identify the sub-relation of oil stocks and tech stocks at the web source for the year 1997; store the results in a temporary relation. PA1 2. Obtain portfolio information for investors who are chemical engineers or computer engineers; store the results in a temporary relation. PA1 3. JOIN the volume information obtained from the web (temporary relation of step (1)) with the portfolio information (temporary relation of step (2)). PA1 4. Perform a GROUP BY on investor name, investor profession, and volume to obtain the maximum volume, and average volume of sales for the technology and oil stocks and store this in a temporary relation. PA1 5. Scan the temporary relation of step (4) to obtain the maximum volume for the oil stocks owned by chemical engineers. PA1 6. Scan the temporary relation of step (4) to obtain the average volume of the tech stocks owned by computer engineers. PA1 7. Filter the tech stocks based on if the average volume exceeded the number obtained in step (5).
Rinvest.Profession=`Computer Engineer` AND PA2 Wstock.date&lt;=`Dec. 31, 1997` AND PA2 Wstock.date&gt;=`Jan. 1, 19997` AND PA2 Rinvest.ticker=Wstock.ticker PA2 (SELECT MAX(Wstock.volume) PA2 FROM Wstock, Rinvest PA2 WHERE Rinvest.cat=`Oil` AND PA2 Rinvest.Profession=`Chemical Engineer` AND PA2 Rinvest.ticker=Wstock.ticker AND PA2 Wstock.date&lt;=`Jun. 30, 1997` AND PA2 Wstock.date&lt;=`Jan. 1, 1997`)
The above query contains two query blocks. The first query block computes the average sales volume per day of the techno stocks owned by computer engineers over the 1997 year. The second query block computes the maximum sales volume, which was reached in the first six months of 1997, of any oil stock owned by a chemical engineer.
Operationally, one technique of processing this query could involve the following steps which closely reflect the way some relational query optimizers execute the above query:
Step (1) and step (5) are similar. Specifically, they both connect to the web source and obtain information on a certain type of stock. Connecting to the web source multiple times could be costly if the broker is charged based on the number of times he connects to the web source. Hence, it would be profitable to combine these steps together and connect to the web source once. A query could be issued that obtains both the oil stock information and the techno stock information at the same time. The necessary filtering operation could be performed later to separate the oil data from the techno data. Combining steps in this manner would result in the extra overhead of storing the `temporary data` and the cost of applying the filtering operations. However, a worthwhile trade-off exists between these two methods of performing the steps.
Similarly, steps (2) and (6) can be combined to obtain, with one scan of the Rinvest table, the portfolio information of investors who are either chemical engineers or computer engineers. Such a combination would especially be profitable if the Rinvest table lacks an index. Likewise, steps (3) and (7) can be combined to perform the JOIN operation on the information obtained from the web source and from the investor table. The GROUP BY operation obtained by combining (3) and (7) can also be combined into a single operation. This result can then be used to compute the average volume of technology stocks and the maximum volume of the oil stocks.
If the `redundancies` were reduced in the manner discussed above, the following steps would be involved in processing the query.
The above example illustrates that various steps in the processing of even some simple decision support queries, can be combined and later compensated with an aim towards improving overall performance.
However, this combine and compensate approach may not always lead to performance improvements. Thus, there is a need for a technique that reduces redundancies by exploiting the similarities in the steps involved when processing a complex decision support query.
One of the early techniques on optimizing queries with common sub-expression is discussed in Hall, P. V. Hall, Common Subexpression Identification in General Algebraic Systems. Technical, Report UKSC 0060, IBM United Kingdom Scientific Center, Nov. 1974, [hereinafter "[Hal74]"]; and P. V. Hall, Optimization of a Single Relational Expression in a Relational Database System, IBM Journal of Research and Development, 20(3), May 1976, [hereinafter "[Hal76]"], which are incorporated by reference herein. This work was based on identifying identical common sub-expressions by examining the query syntax using the operator tree approach. Yet, identifying redundant expressions based on the query syntax can have a detrimental effect of performance as it limits the optimizer's choice for generating optimal plans.
The Sellis article, T-K Sellis, Multiple-Ouery Optimization, ACM Transaction on Database Systems, 13(1):2352, March 1988, [hereinafter "[Sel88]"], which is incorporated by reference herein, provides a description of an alternative method for optimizing queries. Sellis focuses on optimizing multiple queries submitted as a batch of queries to the database. When multiple queries are submitted for batch execution, Sellis describes two techniques, Interleaved Execution (IE) and Heuristic Algorithm (HA), to identify identical sub-plans. The IE technique generates several redundant temporary relations for each pair of identical sub-plan operators. The HA technique is based on the work by Grant and Minker, J. Grant and J. Minker, On Optimizing the Evaluation of a Set of Expressions, Technical Report TR-916, University of Maryland College Park, July 1980, [hereinafter "[GM80]"]; and J. Grant and J. Minker, Optimization in Deductive and Conventional Relational Database Systems, Advances in Database Theory, 1:195-234,1981, [hereinafter "[GM81]"]. To optimize the plan for the batch of queries, such that the global plan is optimized, the HA uses many non-optimal intermediate query plans for each query in a query batch. The HA and the IE techniques combine query plans only when a strict notion of equality exists between the query plans. The techniques also assume that the identical sub-plans can be found easily. Moreover, they assume that the JOIN predicates are simple EQUI-JOINs.
In an attempt to find fast access paths for view processing, Grant Roussopoulos, N. Roussopoulos, The Logical Access Path Schema of A Database, IEEE Transactions on Software Engineering, 8(6):563-573, Nov. 1982, [hereinafter "[Rou82a]"]; and N. Roussopoulos, View Indexing in Relational Databases, ACM Transactions on Database Systems, 7(2):258-290, June 1982, [hereinafter "[Rou82b]"], which are incorporated by reference herein, provides a framework for interquery analysis based on query graphs introduced by Wong and Youssefi, E. Wong and K. Youssefl, Decomposition: A Strategy for Query Processing, ACM Transactions on Database Systems, 1(3):223-241, Sept. 1976, [hereinafter "[WY76]"], which is incorporated by reference herein. Kim, W. Kim, Global Optimization of Relational Oueries, Query Processing in Database Systems, Springer Verlag, 1984, [hereinafter "[Kim84]"], describes a two stage optimization procedure similar to that of [GM80]. The unit of sharing in queries in Kim's proposal is a relation and does not extend to arbitrary expressions that are part of a query. Thus, Kim's work is applicable in a restricted setting of single relation queries.
There are several papers that are based on caching query results, for answering future queries. Franklin S. Dar, M. Franklin, B Jonsson, D. Srivastava, and M. Tan. Semantic Data Caching and Replacement, In Proceedings of the 22nd VLDB Conference, Mumbai, India, Sept 1996, [hereinafter "[DFJ+96]"], which is incorporated by reference herein, uses semantic data caching to identify data in the client buffer in a client-server environment. Franklin replaces pages in the clients buffer based on the least used semantic data regions and compares this technique with tuple and page caching. Finkelstein, S. Finkelstein, Common Subexpression Analysis in Database Applications, In Proceedings of the ACM SIGMOD, 1982, [hereinafter "[Fin82]"], also uses similar methods to cache data, of previously executed queries to answer future queries. Jarke, M. Jarke, Common Subexpression Isolation in Multiple Query Optimization, Query Pro-cessing in Database Systems, Springer Verlab, pages 191-205,1984, [hereinafter "[Jar84]"], which is incorporated by reference herein, discusses the problem of common sub-expression isolation. Jarke presents several different formulations of the problem under various query language frameworks. Jarke discusses how common expressions can be detected and used according to their type.
Recently there has been a lot of work in the area of using materialized views to answer decision support queries. There are also related works that address the problem of maintaining materialized views so that they can be maintained efficiently when there are updates to the base tables. Ross et al., K. A. Ross, D. Srivastava, and S. Sudarshan, Materialized View Maintenance and Integrity Constraint Checking: Trading Space for Time, In Proceedings of the International Conference on Management of Data, May 1996, [hereinafter "[RSS96]"], which is incorporated by reference herein, addresses the problem of efficiently maintaining materialized views by maintaining other materialized views. Divesh et al., D. Srivastava, S. Dar, S. Jagadish, and A. Levy, Answering Queries with Aggregation Using Views, In Proceedings of the 22nd International Conference on Very Large Data Bases, Sept 1996, [hereinafter "[SDJL96]"]; and Levy et al., Levy, A. Y., Rajaraman, A., and Ordille, J. J., Querying Heterogeneous Information Sources Using Source Descriptions, In Proc. 22nd VLDB Conf. pages 251-262, 1996, [hereinafter "[LMSS95]"], which are incorporated by reference herein, describe techniques to determine the portions of the query that can be expressed efficiently using the definition of materialized views. Chaudhuri et al., S. Chaudhuri, R. Krishnamurthy, S. Potamianos, and K. Shim. Optimizing Oueries with Materialized Views; In Proceedings of the IEEE Conference on Data Engineering, March 1995, [hereinafter "[CKPS95]"], which is incorporated by reference herein, describes the problem of optimizing queries in the presence of materialized views. Chaudhuri identifies portions of the query that can be answered with the materialized view and determines if it is efficient to answer the query using the materialized view.
Papers in the area of Online Analytical Processing (OLAP) have studied the problem of determining the views to materialize in the presence of space constraints so that computation can be speeded up to compute the Cube By operator proposed by Gray et al, J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By Cross-Tab, and Sub-Totals, Technical Report MSR-TR95-22, Microsoft Technical Report, 1995, [hereinafter "[GBLP95]"], which is incorporated reference herein. Harinaryanan et al., V. Harinaryanan, A. Rajaraman, and J. D. Ullman, Implementing Data Cubes Efficiently, In Proceedings of the ACMSIGMOD Conference, pages 205-216, May 1996, [hereinafter "[HRU96]"], which is incorporated by reference herein, have studied the problem of computing data cubes efficiently by materializing intermediate results. They have also studied techniques on when to construct indexes on intermediate results, H. Gupta, V. Harinarayanan, A. Rajaraman, and J. Ullman, Index Selection for OLAP, In Proceedings of the International Conference on Data Engineering, May 1997, [hereinafter "[GHRU97]"], which is incorporated by reference herein. H. Gupta, Selection of Views to Materialize in a Data Warehouse, In Proceedings of the ICDT, January 1997, [hereinafter "[Gup97]"], which is incorporated by reference herein, develops a theoretical framework for identifying the views to materialize so that they can be maintained efficiently in the presence of storage constraints.
Some of the approaches discussed above, limit the optimizer's choice for generating optimal plans or require a strict notion of equality between query plans. Additionally, some of the plans are only applicable in a restricted setting of single relation queries. Thus, there is a need in the art for an improved query optimizer.