1. Field of the Invention
The present invention relates to a computer-readable storage medium storing a program for sorting a large number of database records into a plurality of groups. More particularly, the present invention relates to a computer-readable storage medium storing a dataset sorting program for sorting records in a dataset in a manner suitable for counting and summarizing data records. The present invention also relates to a dataset sorting device and a dataset sorting method providing the features stated above.
2. Description of the Related Art
Database systems are widely used in the operations of businesses. A database system maintains various pieces of data, and the user summarizes data values accumulated in a database, when necessary. Here, the summarization of data values refers to a series of computational operations to obtain the number of occurrences (or frequency) of each different item value and compile the results into a summary report. The summarization is one of the most important fundamental processes for analysis and utilization of data.
One of the most desired features for such summarization processes is a capability of handling large-scale data at high speeds. As the use of computers becomes widespread and network systems are widely deployed in recent years, large amounts of data are accumulated in distributed locations. Accordingly the volume of data that needs to be summarized goes on increasing, and therefore, the development of high-speed summarization techniques capable of handling a large amount of data is becoming more and more important.
Also, in recent years, XML (Extensible Markup Language) data has rapidly come into widespread use. XML offers higher flexibility in changing data itself and modifying data definitions and data structures. This trend has led to a demand for data summarizing techniques that can be applied, not only to structured data such as those handled by conventional RDB (Relational Database) systems or the like, but also to the newly emerged XML-formatted data.
To perform high-speed processing on large-scale data, a plurality of datasets are divided into a plurality of data groups, and a summarizing process is performed on those divided data groups. Conventional techniques to divide datasets include a method based on in which data is equally divided in the order that they are stored. According to this method, ten million records are divided into, for example, ten destination groups as they are stored, one million records in each group.
To summarize large-scale data divided by using the conventional sorting method based on storage order, some additional tasks for merging datasets are required. One example technique disclosed in Japanese Patent No. 2959497 first divides given data by using a storage order-based method. After that, it sorts and summarizes records in each divided dataset while merging the divided data subsets. Such a storage order-based sorting method has an advantage in that it can be applied easily to any kind of data as long as the data has some definite structure and can be divided into groups with an equal number of membership.
However, the conventional storage order-based method has a problem in that it does not take into consideration the distribution of data with respect to designated items, and therefore, resource-consuming merging tasks are subsequently required to summarize the data. For example, the technique disclosed in the above Japanese Patent 2959497, paragraph 0019, requires datasets to be sorted individually according to a specified key data item beforehand, and also requires sorting of data taken out of each dataset when merging the datasets.
A technique for summarizing bloated XML data is disclosed in Japanese Patent Application Publication No. 2002-108844. According to the technique disclosed in this patent application, the number of items (tags) and the number of occurrences of each particular item value are counted and the resulting count values are supplied to a user in list form. From that list, the user selects key items for sorting. A destination group is associated with every set of such selected item values, and the records in XML data are sorted according to those associations.
The method of the above publication No. 2002-108844 counts the items (tags) and occurrences of each item value independently for each individual item. This makes it difficult to take into consideration the data distribution of values of each item when sorting dataset records. Also, the disclosed technique requires high costs in sorting data as its data distribution changes in accordance with additions, updates, and deletions of data.
Yet another technique for summarizing XML data is disclosed in Japanese Patent Application Publication No. 2004-358947. According to this technique, information about items required for summarization is extracted out of one or more files of data records described in the XML format. Records are summarized by constructing a trie based on that extracted information.
The same publication No. 2004-358947 also discloses a data summarizing device that summarizes data records through a single-path scanning, thus achieving very high summarizing speeds. However, this conventional technique requires large main storage capacity in order to construct a trie data structure in proportion to the number of different key item values. This means that the conventional device has still another problem in that, when the device is applied to summarization of large-scale data, the performance of the device decreases due to the lack of sufficient memory space for computation.