The present invention relates to a grouping method for discriminating, as a group, plural records of database in which values of particular columns are identical and a duplicate removal method for deleting the overlapping records in a database.
In a dataware house system, a user or an application server issues a query to a database. Particularly, in these years, query has often been issued to a database from application servers such as multiple dimensional analysis server, data mining server and ERP (Enterprise Resource Planning) (business integration package). A database is generally composed of plural records and each record includes the data for a plurality of items. For example, a parts database including information about parts of a certain product is composed of records indicating the data of a certain part and each record includes the data of a plurality of items determined for such part. For instance, such data include size of part, strength, price and supplier, etc. These items are identical not depending on the part, therefore, on the record. Therefore, the database having such structure can generally be considered as the table format data. Each record corresponds to one part and each column corresponds to an item determined for the part. Such table format database will be discussed below. In this case, a database is called as a table or a table data and a certain record as a record data or only as a record in some cases. Aggregation of data corresponding each other in plural records regarding the same item is called as column data or only as a column. In the data warehouse system, a plurality of record data satisfying the conditions indicated by the query from users are retrieved from the database. The process designated by the query regarding a plurality of retrieved record data is further executed and the result of this retrieval is then notified to users. Aggregating process and grouping process are particularly often executed recently among the query of database. The aggregating process is the numerical value computing process to be executed for query result data by obtaining the maximum value, minimum value or average value of particular column data designated by query or obtaining the number of records attained as a result of query in regard to a plurality of record data attained as a result of query. The grouping process for sorting, as one group, plural records in which data of a plurality of pairs of columns (grouping columns) designated by query among plural records attained as a result of query are identical. In the grouping process, aggregating process is conducted for each group. The grouping process is used, for example, for the processes to obtain the average value of sales amount in every factor of age, region and sex. For instance, result of the grouping is composed of a set of the grouping column value and result of aggregation process.
As the grouping method of the related art disclosed in the system based on the merge and sort disclosed in xe2x80x9cParallel Sorting Algorithms,xe2x80x9d by Selim G. Akl, Academic Press, Inc. (USA), p. 48 and 49 and the hash system disclosed in xe2x80x9cRelational Database Managementxe2x80x9d, by M. Papazoglou, W. Valder, PRENTICE HALL (USA), p. 262 to 264 can be listed.
FIG. 4 illustrates a flow chart of the grouping by the merge sort system. In this merge sort system, all records are classified so that plural records in which the value of the particular column of plural records forming the database is identical belong to the same group and the records in which value of the column is different from that of the other records belong to the other group. Therefore, in the merge sort system, all records are sorted in regard to the particular column before the grouping. Namely, plural records are first read sequentially from the secondary memory.
Here, the secondary memory is assumed to be structured to perform data reading therefrom or data writing thereto in unit of a constant size called a block. One block generally includes plural records. Therefore, one block including plural records is read into the main memory (40). A plural records in the block read out are sorted for the data of any column using a working area on the main memory (41) and thereafter such record is written back to the secondary memory as one block (42). This process is repeated for a plurality of blocks until all records are read (43). After plural records sorted for each block (namely, record list) are formed in the secondary memory as explained above, merge sort is performed between these blocks using the working area in the secondary memory and a list of the sorted records belonging to all blocks is obtained in the secondary memory (44). After the process explained above, while the data belonging to the particular column of all records of the sorted record list is sequentially read from the secondary memory (45, 47), the grouping process is performed for the relevant column for these records (46). Namely, plural records having the identical value of the column belong to the same grouping and all records having the column values different from that of the other records are classified so that these records belong to the other groups.
In this merge sort system, since the merge sort of all records are performed in the step 44 using the working area in the secondary memory, a large amount of I/O operations are generated for the secondary memory and therefore a longer time is required for the merge sort process. In addition, the merge sort process (step 44) and grouping process (step 46) cannot be executed in parallel different from the process to be conducted in the pipeline and therefore access must be done again to the secondary memory for the purpose of grouping process (step 46) after the access to the secondary memory for the merge sort process (step 44). Therefore, the processing time which is identical to the total time required for the merge sort process and grouping process is required in this system and the total processing time becomes longer. As a method of solving this problem, a hash system has been introduced.
FIG. 5 illustrates a flow chart of the grouping process using the hash system of the prior art. In the hash system, when one block stored in the secondary memory is read on the main memory (50), the grouping process is performed using the hash functions by making use of the working area on the main memory for each record among plural records within the read out block (51). The aggregation result of each group held in the aggregating area in the main memory is updated by executing the parallel aggregating processes (52).
Namely, each time when the record is newly read into the working area on the main memory, the hash value is obtained for the data belonging to the particular columns among the data in such record and this record is then registered to the hash table prepared in the working area on the main memory. The hash table designates the pointer area and memory area of each record explained below and aggregation of the collecting area and pointer area provided in the successive position of the memory area of each record. Namely, in the hash table, a plurality of pointer areas holding a plurality of pointers corresponding to different hash values is prepared and when any record is newly read to the main memory, the hash value is calculated using the hash function for the data belonging to the particular record used for the grouping process among the data of such record. If the record having the such hash value is not yet detected, the pointer indicating the storing position in the main memory area of such record is stored in the pointer area corresponding to this hash value in this pointer area. As explained above, such record is registered to the group having such hash value. Moreover, such record is assumed as only a member of the group having such hash value and the aggregating process is conducted for such group using the data in such record under the above assumption and thereafter the aggregating process result data is stored to the aggregating area next to the storing area of such record. The subsequent area of this aggregating area is reserved as the area for storing the pointer which points out the other records belonging to the same group.
When the other record having the hash value for the record newly read into the main memory explained above is already registered to the above hash table, an address of the storing area of this new record is stored to the successive pointer storing area of the record which is already registered. As explained above, a new record is registered to the group having such hash value. Moreover, new aggregating area and pointer area are reserved to the subsequent storing area of the storing area of this new record. The aggregating process result is calculated for the group including this new record based on the aggregating result data already stored for the record which is already registered and the data in this new record and it is then written to the subsequent aggregating result area of the storing area of this new record in the new aggregating area. The process explained above is repeated for all records (55). However, when the number of records read into the main memory increases, the area of above hash table becomes too large to hold the next new record into the main memory. In this case, it is determined whether the next record can be stored to the hash table area of the main memory or not (53). When determination result is NO, the other records registered to the hash table is ruled out to the secondary memory (54) and the next new record process is started.
These grouping process (51) and aggregating process (52) can be executed in parallel as can be executed for the pipeline. In other words, the grouping process is executed for the read newly read into the main memory and thereafter the aggregating process is sequentially executed for the groups to which such record belongs. On the occasion of executing this aggregating process, the grouping process is also executed to the further successive records newly read into the main memory in parallel. Accordingly, the grouping process is conducted without generation of a temporary file for the grouping. However, data transfer is generated at random between the main memory and external memory in the step 54. Therefore, here rises a problem of the time required for I/O access to the external memory.
The grouping process introducing the hash grouping system is based on the remarkable reduction in number of records of the data through the grouping process. Therefore, when an average value (it is called as an average duplicate degree of group) between the groups of the group duplicate degree (number of records belonging to one group) is almost equal to 1, this method is not so much effective.
However, in the actual multiple-dimensional analysis, when the number of axes of analysis increases, many groups have only one record belonging to each group. In this case, there rises a problem that the hash table cannot be developed on the main memory in the hash system, many I/O operations are generated to the external memory and the time required for the grouping process increases. In addition, since the effect of reduction in number of records by the pre- grouping process cannot be obtained even in the multiple- stage converting system, there rises a problem also that the processing cost increases due to the overhead disassembled into multiple-stage.
In addition, a duplicate degree of record becomes a problem here even in regard to the duplicate removal process which is the basic process of the other query process. Here, the number of records having the identical record value is called as a duplicate degree of such record. When an average value of duplicate degree of each record (hereinafter, it is called as an average record duplicate degree) is almost equal to one (1), the duplicate record does not almost exist. In this case, it has been a problem that efficient process cannot be attained in the existing duplicate removal process and the system depending on the sort merge process must inevitably be introduced in this case.
It is therefore an object of the present invention to provide a grouping method for realizing highly efficient grouping process in such a case that the average group duplicate degree is almost identical to one (1).
It is another object of the present invention to realize a duplicate removal method for realizing highly efficient duplicate removal process in such a case that the average group duplicate degree is almost identical to one (1).
In view of attaining the objects explained above, the grouping method of the present invention determines whether there is possibility or not for existence of at least another record having the identical value to that of at least a part of one or more predetermined columns of the records for each of plural records of the record list included in the database, also determines plural records of a part determined there is no possibility for existence of another record as the records belonging to one group respectively by above determination process, and executes the second process to classify a plurality of other groups other than a part of records explained above among plural records of the record list explained above so that plural records having the identical column value for the grouping belong to the same group.
In more practical, the step for determining possibility includes the steps for generating a hash value for each of plural records of record list included in the database with the hash function using as an argument the value of at least a part of one or more predetermined columns among those for the grouping of such records, determining whether at least another record having the hash value identical to that of the hash value generated for each record exists or not from a plurality of hash values generated for plural records, and also determining a part of plurality of records determined respectively that there is no other records having the identical hash values by such determination process as the records having no possibility that the other records having the identical value of at least a part of the columns among those for the grouping do not exist. This method is effective when the average grouping duplicate degree is almost identical one (1).
The duplicate removal method of the present invention can apply the process same as the grouping method explained above to the duplicate removal process.
Moreover, the present invention provides also the grouping process method in which bucket division is repeated and a plurality of hash functions are sequentially applied through the change-over operation for each repetition of bucket division.