1. Field of the Invention
The present invention generally relates to a query processing method for use in a relational database and, more particularly, to a data random sampling processing method for efficient execution of queries including random sampling processing for a large-scale database.
2. Description of Related Art
Recently, as corporate in-house information processing systems become widespread, various business data such as dealing information and customer information have come to be stored in databases. As a result, the range of information made available through these databases is quickly expanding.
At the same time, demand for data mining is increasing that is intended for extending business chance and enhancing business efficiency by analyzing large-scale data stored in database to extract features and regularities of the data.
Generally, data mining requires to issue queries with different data item combinations and condition settings and repeat data analysis to extract features and regularities of large-scale data.
However, as the size of data to be stored in database increases, the time for processing one piece of query increases, making it more difficult to efficiently extract data features and regularities. For a technology for enhancing the response of query processing, a data cube approach is known as disclosed in "ACM SIGMOD International Conference on Management of Data (SIGMOD '96)," ACM Press, pp. 205-216.
In the data cube approach, some part of estimated queies are processed before they are actually accepted. If an already processed query is issued to the data base, the query is not actually processed but only returning the result. In this approach, however, has drawbacks such as requiring to have a large storage area for preparing query results and a long query processing time for queries for which no query results have been prepared because a range of queries that can be handled by performing necessary processing beforehand is restricted.
On the other hand, feature quantity calculation and regularity extraction of large-scale data require only the trend and feature of large-scale data, not an exact query result in many cases. Therefore, in order to significantly shorten query processing time, a method is used in which random sampling is introduced in query processing and feature quantity and regularity are estimated from the randomly sampled data to reduce the data quantity to be processed and shorten response time.
It is important for the execution of a query including random sampling not only to simply reduce data processing quantity by random sampling but also to convert a query to an equivalent query of higher execution efficiency immediately before execution without changing query processing result, thereby significantly shortening the execution time. In other words, random sampling is applied as early as possible in query processing to reduce the quantity of data to be subsequently processed, thereby reducing the processing quantity of the entire query.
Generally, a logical structure of data to be processed for query has a form of a table 20 shown in FIG. 2. The table 20 has records 21 in lateral direction and columns 22 in vertical direction. The same columns of these records have data of the same form. A set of records obtained after database processing performed on a table that is a set of records becomes a table again.
The database processing performed on a table includes condition evaluation processing, projection processing, join processing, and grouping and aggregation processing.
The following describes the above-mentioned four types of processing.
The condition evaluation processing in a database processing system denotes that one or more condition evaluation columns and conditions set to the values of these columns are designated. Of the records included in a table to be processed, records are extracted which satisfy a condition designated for a designated column and the table is reconstituted by these extracted records.
The projection processing in a database processing system denotes that one of more projection columns are designated, only the designated columns are extracted for the records included in a table, and the table is reconstituted by these extracted records.
The join processing in a database processing system denotes that or more columns commonly included in two tables to be processed are designated, all records included in one table are joined with all records having the same value in the join column of the records included in the other table, and the table is reconstituted by records newly generated by the join operation.
The grouping and aggregation processing in a database processing system denotes that one or more grouping columns and one or more columns to be aggregated are designated, the records included in a table to be processed which have the same value of the designated one or more grouping columns are classified as one group, statistical quantity such as amount or average of the values of a column to be aggregated is calculated for each group, and the result of this calculation is outputted as one record.
The present invention introduces random sampling defined as follows into a database. The following describes a query converting method including the random sampling.
The random sampling in a database processing system denotes that, from a table which is a set of records, records are randomly sampled and the table is reconstituted by the sampled records. In the random sampling, a set of records taken out by one sampling operation is referred to as a sampling unit. It is assured that, in one sampling processing, the sampling probabilities of the sampling units are equal to each other.
Thus, a query issued for a database is constituted by applying the above-mentioned various types of processing in an appropriate sequence to a table to be queried. Therefore, in optimization of query including random sampling, it is important to shorten processing time by deformed a query in a range that does not lose random sampling characteristic and apply random sampling as early as possible in query processing to reduce the data quantity to be handled in subsequent processing.
For a conventional method of converting query including random sampling, the query optimizing method is known as disclosed in "International Conference On very Large Databases (VLDB' 86)," Morgan Kaufmann Publishers, pp. 160-169. The disclosed method changes the processing application sequence while retaining the random sampling characteristic of random sampling in queries including random sampling and such basic database processing such as condition evaluation processing, projection processing, and join processing.
However, the conventional method involves a problem that query conversion processing cannot be applied to the case in which grouping and aggregation processing is included in a query for optimizing random sampling. Therefore, for the query optimizing method in data mining application, the conventional method provides only a limited availability.
A first problem involved in optimizing random sampling in the query conversion method including grouping and aggregation processing is that the sampling unit of random sampling is not handled properly. For example, if customer purchase patterns are to be checked by classifying product sale detail information by customer, attempting to analyze the purchase pattern of each customer after performing random sampling simply on product sale detail information level does not provide the complete product purchase history of each customer, failing to obtain efficient purchase pattern analysis. This is because, although random sampling for purchase pattern analysis for each customer should be performed considering the purchase history of each customer as a sampling unit of the sampling process, the actual sampling is performed ignoring this sampling unit.
A second problem is that, if random sampling is applied to the records stored in such a storage device as a magnetic disk in record read processing, random access to the storage device occurs because the storage locations of the records to be sampled become random, thereby increasing the random sampling time.
A third problem is that the conventional method has no feature for designating query processing time, query result precision and so on in query issue processing, so that the user cannot easily issue a query which reads part of data by random sampling to realize use-defined query result estimation.