A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates generally to information processing environments and, more particularly, to access and processing of information in a data processing system embodied, at least in part, in portable devices.
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as xe2x80x9crecordsxe2x80x9d having xe2x80x9cfieldsxe2x80x9d of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level.
The general construction and operation of a database management system is known in the art. See e.g., Date, C., An Introduction to Database Systems, Volume I and II, Addison Wesley, 1990; the disclosure of which is hereby incorporated by reference.
DBMS systems have long since moved from a centralized mainframe environment to a de-centralized or distributed environment. One or more PC xe2x80x9cclientxe2x80x9d systems, for instance, may be connected via a network to one or more server-based database systems (SQL database server), to form a client/server database system. In operation, clients issue one or more query language (e.g., SQL) commands to the server. A query language is a specialized language for accessing or processing information from a database. SQL commands may, for instance, specify a query for retrieving particular data (i.e., data records meeting the query condition) from a database table. The syntax of SQL (Structured Query Language) is well documented; see, e.g., the abovementioned An Introduction to Database Systems. As used herein, xe2x80x9cSQLxe2x80x9d shall also include vendor-specific variants of SQL, such as Sybase(copyright) Transact-SQL. In addition to retrieving the data from database server tables, the clients also include the ability to insert new rows of data records into the table; clients can also modify and/or delete existing records in the table. Commercial examples of these xe2x80x9cclient/serverxe2x80x9d systems include Powersoft(trademark) clients connected to one or more Sybase Adaptive Server(trademark) database servers. Both Powersoft(trademark) and Sybase Adaptive Server(trademark) (formerly Sybase SQL Server(trademark)) are available from Sybase, Inc. of Emeryville, Calif.
In today""s computing environment, database technology can be found on virtually any device, from traditional mainframe computers to cellular phones. Sophisticated applications, whether enterprise information portals or sales force automation systems, can xe2x80x9cpushxe2x80x9d much of their complexity into the database itself. Indeed, this represents one of the main benefits of database technology. The challenge, however, is to support these complex applications, and the queries they generate, on small computing devices. At the same time, users expect the productivity and reliability advantages of using an SQL database, while maintaining the size and performance advantages of hand-coded applications.
Consider, for instance, the execution of an SQL request or query. A query xe2x80x9coptimizerxe2x80x9d in a relational DBMS is responsible for transforming an SQL request into an access plan composed of specific implementations of the algebraic operators selection, projection, join, and so on. Typically, this is done by generating many different join strategies, evaluating the cost of each, and selecting the access plan with the lowest overall cost, where xe2x80x9ccostxe2x80x9d is a metric that measures a combination of factors, including but not limited to the estimated amount of computational overhead, number of physical I/O operations, and response time. The process of generating these alternative join strategies is termed xe2x80x9cjoin enumeration.xe2x80x9d However, producing an optimal access plan for an arbitrary SQL query is an NP-complete problem (see, e.g., Ibaraki, T. and Kameda, T., xe2x80x9cOn the optimal nesting order for computing n-relational joinsxe2x80x9d, ACM Transactions on Database Systems, 9(3): 482-502, September 1984; Ono, K. and Lohman, G. M., xe2x80x9cMeasuring the complexity of join enumeration in query optimizationxe2x80x9d, Proceedings of the 16th International Conference on Very Large Data Bases, pp. 314-325, Brisbane, Australia, August 1990, Morgan Kaufmann; Ozsu, M. T. and Valdariez, P., xe2x80x9cPrinciples of Distributed Database Systemsxe2x80x9d, Prentice-Hall, Englewood Cliffs, New Jersey, 1991; Steinbrunn, M., et. al., xe2x80x9cHeuristic and randomized optimization for the join ordering problemxe2x80x9d, The VLDB Journal, 6(3): 191-208, August 1997), to discover an optimal strategy requires an exhaustive search. Consequently, optimizers often use heuristics (see, e.g., Ono, K. and Lohman, G. M. above; Steinbrunn, M., et. al. above; Ullman, J. D., xe2x80x9cPrinciples of Database and Knowledge-Base Systems, Volume 2xe2x80x9d, Computer Science Press, Rockville, Md., 1989) to reduce the number of strategies that the plan selection phase must consider.
A common heuristic used in most commercial optimizers is to restrict the strategy space to those that perform unary operations (particularly restriction) first, thus reducing the size of intermediate results. See, e.g., Smith, J. M. and Chang, P. Y.-T., xe2x80x9cOptimizing the performance of a relational algebra database interfacexe2x80x9d, Communications of the ACM, 18(10): 568-579, October 1975; Ullman, J. D. above. Another common optimization heuristic, and one used by IBM""s STARBURST, is to defer the evaluation of any Cartesian products to as late in the strategy as possible. See, e.g., Morishita, S., xe2x80x9cAvoiding Cartesian products for multiple joinsxe2x80x9d, Journal of the ACM, 44(1): 57-85, January 1997; Ono, K. and Lohman, G. M. above. To further reduce the number of alternative plans, an optimizer may consider only left-deep processing trees. See, e.g., Cluet, S. and Moerkotte, G., xe2x80x9cOn the complexity of generating optimal left-deep processing trees with cross productsxe2x80x9d, Proceedings of the Fifth International Conference on Database Theoryxe2x80x94ICDT 1995, pp. 54-67, Prague, Czech Republic, January 1995, Springer-Verlag; Ibaraki, T. and Kameda, T. above; Selinger, P. G., et. al., xe2x80x9cAccess path selection in a relational database management systemxe2x80x9d, ACM SIGMOD International Conference on Management of Data, pp. 23-34, Boston, Mass., May 1979. For SPJ queries a left-deep processing tree is one where the right child of any join must be a base table. For more complex queries, a left-deep tree means that the right child of any binary operator cannot be a join, though it could be the (possibly materialized) result of a view or table expression containing Union, Group by, or aggregation. Left-deep trees are desirable because (1) they reduce the need to materialize intermediate results, (2) for several types of join implementations they result in more efficient execution plans, and (3) the space of xe2x80x9cbushyxe2x80x9d plans is considerably larger, and hence more expensive to search. See, e.g., Vance, B. and Maier, D., xe2x80x9cRapid bushy join-order optimization with Cartesian productsxe2x80x9d, ACM SIGMOD International Conference on Management of Data, pp. 35-46, Montreal, Quebec, June 1996, Association of Computing Machinery. In the worst case, a completely connected join graph for a query with n quantifiers has n! alternative strategies with left-deep trees, and (2nxe2x88x922)!/(nxe2x88x921)! alternatives when considering bushy processing trees. See, e.g., Pellenkoft, A., et. al., xe2x80x9cThe complexity of transformation-based join enumerationxe2x80x9d, Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 306-315, Athens, Greece, August 1997, Morgan-Kaufmann.
Restricting the optimization space to left-deep trees still results in an NP-complete problem for queries of arbitrary complexity. See, e.g., Cluett, S. and Moerkotte, G. above; Ibaraki, T. and Kameda, T. above. Ono and Lohman (see, e.g., Ono, K. and Lohman, G. M. above) gave a lower bound of O(3n), with n the number of quantifiers, on the complexity. Vance and Maier (see, e.g., Pellenkoft, A., et. al. above; Vance, B. and Maier, D. above) show that bottom-up evaluation of alternatives does not achieve this lower bound in all cases; they give a complexity of O(4n) of join enumeration for the deterministic dynamic programming technique used by IBM""s System R (see, e.g., Selinger, P. G., et. al. above) and STARBURST. See, e.g., Ioannidis, Y. E. and Kang, Y. C., xe2x80x9cRandomized algorithms for optimizing large join queriesxe2x80x9d, ACM SIGMOD International Conference on Management of Data, pp. 312-321, Atlantic City, N.J., May 1990. Moreover, join enumeration is notorious for its space utilization (see, e.g., Kabra, N. and DeWitt, D. J., xe2x80x9cOPT++: An object-oriented implementation for extensible database query optimizationxe2x80x9d, The VLDB Journal, 8(1): 55-78, May 1999; Pellenkoft, A., xe2x80x9cProbabilistic and Transformation-based Query Optimizationxe2x80x9d, PhD Thesis, Wiskunde en Informatica, CWI, Amsterdam, The Netherlands, November, 1997; Scheufele, W. and Moerkotte, G., xe2x80x9cEfficient dynamic programming algorithms for ordering expensive joins and selectionsxe2x80x9d, Advances in Database Technologyxe2x80x94EDBT 1998, Proceedings of the 6th International Conference on Extending Database Technology, pp. 201-215, Springer-Verlag, Valencia, Spain, March 1998), particularly because dynamic programming approaches require the memorization of partial access plans.
A recent survey by Steinbrunn, Moerkotte, and Kemper classifies join enumeration algorithms into four categories: randomized algorithms, genetic algorithms, deterministic algorithms, and hybrid algorithms. Randomized algorithms view solutions as points in a solution space; they randomly xe2x80x9cwalkxe2x80x9d through this solution space from one point to another using a pre-defined set of moves. Two well-known examples of randomized approaches include iterative improvement (see, e.g., Ioannidis, Y. E. and Kang, Y. C. above; Swami, A., xe2x80x9cOptimization of large join queries: Combining heuristics and combinatorial techniquesxe2x80x9d, ACM SIGMOD International Conference on Management of Data, Portland, Oreg., June 1989; Swami, A. and Gupta, A., xe2x80x9cOptimization of large join queriesxe2x80x9d, ACM SIGMOD International Conference on Management of Data, pp. 8-17, Chicago, Ill., June 1988) and simulated annealing. See, e.g., Ioannidis, Y. E. and Wong, E., xe2x80x9cQuery optimization by simulated annealingxe2x80x9d, ACM SIGMOD International Conference on Management of Data, pp. 9-22, San Francisco, Calif., May 1987; Swami, A. and Gupta, A. above. Galindo-Legaria, Pellenkoft, and Kersten (see, e.g., Galindo-Legaria, C., et. al., xe2x80x9cRandomized join-order selection: Why use transformations?xe2x80x9d, Proceedings of the 20th International Conference on Very Large Data Bases, pp. 85-95, Santiago, Chile, September 1994, Morgan-Kaufmann; Galindo-Legaria, C. A., et. al., xe2x80x9cUniformly-distributed random generation of join ordersxe2x80x9d, Proceedings of the Fifth International Conference on Database Theoryxe2x80x94ICDT 1995, pp. 280-293, Prague, Czech Republic, January 1995, Springer-Verlag) have recently proposed a hybrid approach that randomly xe2x80x9cprobesxe2x80x9d the space of all valid join strategies in an attempt to quickly find a xe2x80x9creasonablexe2x80x9d plan, whose cost can then be used to limit a deterministic search of the entire strategy space. Purely randomized algorithms offer at least one advantage over deterministic ones: memorization of any strategy (or portion thereof) is unnecessary, so the memory requirement is fixed.
Several deterministic join enumeration algorithms have appeared in the literature. INGRES uses a dynamic optimization algorithm (see, e.g., Kooi, R. P., xe2x80x9cThe Optimization of Queries in Relational Databasesxe2x80x9d, PhD Thesis, Case Western Reserve University, Cleveland, Ohio, September 1980; Wong, E. and Youssefi, K., xe2x80x9cDecompositionxe2x80x94A strategy for query processingxe2x80x9d, ACM Transactions on Database Systems, 1(3): 223-241, September 1976) that recursively breaks up a calculus (QUEL) query into smaller pieces by decomposing queries over multiple relations into a sequence of queries having one relation (tuple variable) in common, using as a basis the estimated cardinality of each. Each single-relation query is optimized by assessing the access paths and statistical information for that relation in isolation. Ibaraki and Kameda (see, e.g., Ibaraki, T. and Kameda, T. above) showed that it is possible to compute the optimal join strategy in polynomial time, given certain restrictions on the query graph and properties of the cost model. Krishnamurthy et al. (see, e.g., Krishnamurthy, R., et. al., xe2x80x9cOptimization of nonrecursive queriesxe2x80x9d, Proceedings of the 12th International Conference on Very Large Data Bases, pp. 128-137, Kyoto, Japan, August 1986, Morgan Kaufmann) proposed a polynomial-time algorithm that provides an optimal solution, though it can handle only a simplified cost model and is restricted to nested-loop joins. Swami and Iyer (see, e.g., Swami, A. and Iyer, B., xe2x80x9cA polynomial time algorithm for optimizing join queriesxe2x80x9d, Proceedings, Ninth IEEE International Conference on Data Engineering, pp. 345-354, IEEE Computer Society Press, April 1993) subsequently extended their work in an attempt to remove some of its restrictions, and to also consider access plans containing sort-merge joins. Cluet and Moerkotte (see, e.g., Cluet, S. and Moerkotte, G. above) recently showed that generating an optimal left-deep processing tree possibly containing Cartesian products for xe2x80x9cstarxe2x80x9d queries is also NP-complete.
An NP-complete problem is any one of a class of computational problems for which no efficient solution algorithm has been found. Many significant computer-science problems belong to this class, including graph covering problems. In contrast, so-called easy, or tractable, problems can be solved by computer algorithms that run in polynomial time; i.e., for a problem of size n, the time or number of steps needed to find the solution is a polynomial function of n. Algorithms for solving hard, or intractable, problems, on the other hand, require times that are exponential functions of the problem size n. Polynomial-time algorithms are considered to be efficient, while exponential-time algorithms are considered inefficient, because the execution times of the latter grow much more rapidly as the problem size increases.
A problem is called NP (nondeterministic polynomial) if its solution (if one exists) can be guessed and verified in polynomial time; nondeterministic means that no particular rule is followed to make the guess. If a problem is NP and all other NP problems are polynomial-time reducible to it, the problem is NP-complete. Thus, finding an efficient algorithm for any NP-complete problem implies that an efficient algorithm can be found for all such problems, since any problem belonging to this class can be recast into any other member of the class. It is not known whether any polynomial-time algorithms will ever be found for NP-complete problems, and determining whether these problems are tractable or intractable remains one of the most important questions in theoretical computer science. When an NP-complete problem must be solved, one approach is to use a polynomial algorithm to approximate the solution; the answer thus obtained will not necessarily be optimal but will be reasonably close.
The best example of a deterministic algorithm is dynamic programming, the xe2x80x9cclassicalxe2x80x9d join enumeration algorithm used by SYSTEM R. See, e.g., Selinger, P. G., et. al. above. It performs static query optimization by searching the solution space using a modified dynamic programming approach. See, e.g., Ono, K. and Lohman, G. M. above; Selinger, P. G., et. al. above. The optimizer assigns a cost to every candidate access plan, and retains the one with the lowest cost. In addition, the algorithm keeps track of the xe2x80x9csorted-nessxe2x80x9d of each intermediate result, termed interesting orders (see, e.g., Selinger, P. G., et. al. above; Simmen, D., et. al., xe2x80x9cFundamental techniques for order optimizationxe2x80x9d, ACM SIGMOD International Conference on Management of Data, pp. 57-67, Montreal, Quebec, June 1996, Association of Computing Machinery), which can lead to the discovery of less expensive strategies by avoiding (usually expensive) sorts on intermediate results.
All told, a major challenge in database systems designed to run on small computing devices is to significantly reduce the amount of memory required for join enumeration. The present invention provides a solution to this challenge.
The present invention provides a deterministic join enumeration methodology for left-deep processing trees, currently implemented in Sybase SQL Anywhere, a small-footprint relational database system whose target market ranges from workgroup servers to small hand-held devices. By providing a deterministic branch-and-bound join enumeration method for left-deep processing trees, the invention is able to efficiently optimize complex queries with high join degree by employing a novel approach to cost-based pruning of the search space.
Plan generation involves the generation of alternative xe2x80x9cjoin strategiesxe2x80x9d for each subquery block. A xe2x80x9csubqueryxe2x80x9d may be thought of as a xe2x80x9cquery within a queryxe2x80x9d (i.e., a nested query). Unlike IBM""s Starburst optimizer, which separates join enumeration from plan generation and cost estimation (see, e.g., Ono, K. and Lohman, G. M. above), the ASA optimizer combines these processes into a single unified methodology. Cost estimation is an integral part of the enumeration method, because it is through comparing the costs of partial access plans that the ASA optimizer can quickly prune significant portions of the join strategy search space.
The plan generation phase optimizes each subquery in the QOG independently, starting with the leaves. For each subquery, plan generation involves the following four distinct steps:
1. adjust predicate selectivities to account for disjuncts, Between predicates, and user estimates of selectivities;
2. construct a join graph for the query that models inner and outer equijoin predicates, sargable single-variable predicates on single quantifiers, and Cartesian products;
3. enumerate join strategies and prune the search space using a branch-and-bound heuristic;
4. recall the cheapest strategy and construct the detailed access plan for that strategy.
Empirical performance results on several production queries obtained from SQL Anywhere customers show that this approach requires significantly less memory than other deterministic join enumeration approaches, which have been described in the literature.