The present invention relates generally to database systems and, more particularly, to methods for building an efficient query plan for vertical schema.
The usage of relational databases traditionally requires careful analysis of the data to be stored in order to define and create a set of tables that can efficiently represent elements and their relations to the data. For example, a red T-shirt with collar size 15 and a half and made of cotton is a typical description of a shirt, and this information needs to be captured. When data about the T-shirt is mapped into relational schema, it is well known in the art to create a table for T-shirt data with columns for a stock-keeping unit (SKU) to uniquely identify this product, a column for color, a column for collar size, and a column for type of material. For the example T-shirt, a row is inserted into the table with red in the color column, the value 15 and a half in the collar size column, and cotton in the material column. The traditional schema layout is referred to as horizontal schema. The above steps require the knowledge of T-shirt data to be available at the schema layout stage to name tables and columns.
In an electronic marketplace, the products that are traded and sold can vary from day to day, and there may be no prior knowledge about certain products to follow the above steps to define and create horizontal schema. It is therefore popular among software vendors of electronic marketplaces to choose a schema-neutral format for their catalogs. The layout of these electronic catalogs are flexible to accommodate all kinds of products without the need to create separate tables for each product. This layout can be best described as xe2x80x98name-value pairsxe2x80x99 or vertical schema. In the vertical schema, conceptually, there are three columns: the first column to uniquely identify the product, the second column to record the name of the attribute, and the third column to record the value of the attribute. To store the T-shirt data in the vertical schema, three rows are used. For all three rows, the first column is the same to store the SKU of the T-shirt. The second column stores attribute names, such as color, collar size, and material. The third column stores attribute values matched to the second column, such as red (matched to color), 15 and a half (matched to collar size), and cotton (matched to material).
The flexibility of the vertical schema can be seen by using the same table to store a totally different product such as transistors. Transistors may have a manufacturer, current, and voltage as their attributes. Again, three rows can be inserted with the SKU of the transistor in the first column. One row can store the attribute xe2x80x98Manufacturerxe2x80x99 and its value xe2x80x98IBMxe2x80x99; another row can store the attribute xe2x80x98Currentxe2x80x99 and its value xe2x80x986.4xe2x80x99; and the third row can store the attribute xe2x80x98Voltagexe2x80x99 and its value xe2x80x983.3xe2x80x99. There is no need to create a second table to manage transistor data. In an electronic marketplace where a variety of products are traded daily, such high flexibility is desirable to avoid the burden of creating and managing individual product tables.
Certain aspects of the vertical schema were discussed in the paper by Agrawal et al. (R. Agrawal, A. Somani, and Y. Xu, xe2x80x9cStorage and Querying of E-Commerce Data,xe2x80x9d Proceedings of the 27th International Conference on Very Large Data Bases, Rome, Italy, pp. 149-158, 2001). This paper did not address the query performance problem. It discussed the problem of xe2x80x98hidingxe2x80x99 the vertical schema from database users, who are more familiar with the traditional horizontal schema. The contribution of the paper is to propose an automated translator from queries written for the horizontal schema to queries written for the vertical schema. The present invention addresses the query processing issue of queries written for the vertical schema. It is adequate to say that the present invention and the above-mentioned paper by Agrawal et al. are targeted at different problems in the art.
A significant disadvantage of the vertical schema is its slow query performance. Thus, it would be desirable and highly advantageous to have methods for building efficient query plans for vertical schema.
The present invention offers methods to address the query processing issue by first observing two critical problems of optimizing query plans on the vertical schema. The first problem is inaccurate selectivity estimation; the second problem is inaccurate query cost and query plan selection based on inaccurate selectivity numbers. The novel method presented in the invention uses an external histogram and an external query planner to remedy these two problems.
According to a first aspect of the invention, there is provided a method for efficiently building a query plan for vertical schema. The method includes the step of receiving query information including at least one constraint. An external histogram is retrieved having information for name-value pairs of the vertical schema. Then, selectivity estimates are determined for at least one of the constrains using the external histogram. Finally, the query plan is generated using at least some of the selectivity estimates.
According to a second aspect of the invention, the method further includes the step of checking whether any of the constraints is a high selectivity constraint using the selectivity estimates. According to a third aspect of the invention, if there is a high selectivity constraint, it is used to drive a join operation. According to a fourth aspect of the invention, if there is no high selectivity constraint, separate queries are issued. According to a fifth aspect of the invention, the separate queries are joined in memory.
According to a sixth aspect of the invention, the vertical schema model information for an electronic catalog. According to a seventh aspect of the invention, each of the name-value pairs is associated with an attribute of a product and its value.
According to an eighth aspect of the invention, the method further includes the step of updating the external histogram with actual result size information. According to a ninth aspect of the invention, the actual result size information relates to at least one user query. According to a tenth aspect of the invention, the actual result size information relates to at least one query not issued by a user.
These and other aspects, features and advantages of the present invention will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.