1. Technical Field of the Invention
The present invention relates to an information processing method and an information processing apparatus for processing a large amount of data, and, more particularly to an information processing method and an information processing system that adopt an architecture of a parallel computer.
2. Description of Related Art/Background Art
Conventionally, data processing for accumulating a large amount of information and retrieving and tabulating the accumulated information has been performed. The data processing is used in, for example, a well-known computer system in which a CPU, a memory, peripheral equipment interfaces, an auxiliary storing device such as a hard disk, display devices such as a display and a printer, input devices such as a keyboard and a mouse, and a power supply unit are connected via a bus. In particular, the data processing is provided as software operable in a computer system easily available in the market. In order to perform the data processing such as retrieval and tabulation, in particular, various databases for accumulating a large amount of data are known. Among the large amount of data, in particular, there is a strong demand for processing of data that could be represented in a table format.
It depends on a format for storing a large amount of data whether the large amount of data can be efficiently retrieved and tabulated. Conventionally, a so-called “row unit” storage technique and a so-called “item unit” storage technique are known as general storage techniques. In the case of the row unit storage technique, sets of item values of sex, age, and occupation formed for each record number are stored on a disk in an order of record numbers and in an order of the increase of a logical address. On the other hand, in the case of the item unit storage technique, item values are stored on a disk in an order of record numbers and in a direction of the increase of a logical address.
In the case of the conventional techniques, item values for all items of all record numbers are directly stored in a two-dimensional data structure (consisting of one dimension of the record numbers and one dimension of the other item values). Such a data structure will be hereinafter specifically referred to as “data table”. In the case of the conventional techniques, in retrieving and tabulating accumulated data, this data table is accessed to perform the retrieval and the tabulation.
Besides the method of directly storing values of items as item values, there is also known a method of converting values into codes and storing the codes as item values. Even in this case, the method is the same as the method described above in that the codes subjected to the code conversion are stored in the data table as the item values.
In retrieving and tabulating a large amount of data stored using the data structure of the data table type of the conventional techniques, there is a problem in that processing time for retrieval and tabulation lengthens because of an access time for accessing such a data table.
The data table at least has essential disadvantages described below.
(1) The data table increases in size. Moreover, for example, it is difficult to (physically) divide the data table for each item. Practically, it is difficult to expand the data table on a high-speed storage device such as a memory in order to tabulate and retrieve data.(2) It is impossible to hold the data table in a form in which the respective item values are simultaneously sorted.(3) An identical value appears in the data table many times.
On the other hand, in order to significantly improve speed of retrieval and tabulation of a large amount of data, the inventor has proposed a method of retrieving, tabulating, and sorting table format data and an apparatus for carrying out the method by providing a data management mechanism that has the function of the conventional data table and in which the problems of the data structure based on the data table are solved (see, for example, Patent Document 1).
The proposed method and apparatus for retrieving and tabulating table format data introduce a new data management mechanism usable in the usual computer system. In principle, this data management mechanism has a value management table and an array of pointers to the value management table.
FIG. 1 is a diagram for explaining the conventional data management mechanism. In the figure, a value management table 110 and an array of pointers 120 to the value management table are shown. The value management table 110 is a table in which, in an order of item value numbers that order item values belonging to items of table format data (represent the item values as integers) with respect to the respective items, item values (see reference numeral 111) corresponding to the item value numbers and classification numbers (see reference numeral 112) related to the item values are stored. The array of pointers 120 to the value management table is an array in which item value numbers of certain columns (i.e., items) of the table format data, that is, pointers to the value management table 110 are stored in an order of record numbers of the table format data.
By combining the array of pointers 120 to the value management table and the value management table 110, when a certain record number is given, it is possible to obtain an item value from the record number by extracting an item value number stored in association with the record number from the array of pointers 120 to the value management table concerning a predetermined item and, then, extracting an item value stored in association with the item value number in the value management table 110. Therefore, as in the conventional data table, it is possible to refer to all data (item values) using coordinates of record numbers (rows) and items (columns).
The data management mechanism including the value management table created for a certain item in the items of the table format data and the array of pointers to the value management table in this way may be specifically referred to as information block in the following explanation.
In the conventional data table, all data are integrally managed using coordinates consisting of rows corresponding to records and columns corresponding to items. On the other hand, this information block has a characteristic in that data are completely separated for each column of a table format, that is, for each item. According to this data management mechanism, since a large amount of data are separated for each item, it is possible to capture only data concerning items necessary for retrieval and tabulation onto a high-speed storage device such as a memory. As a result, time for access to the data is reduced. Thus, speed of processing for retrieval and tabulation is increased and, even in the case of data with an extremely large number of items, it is possible to treat the data without deteriorating performance.
In the case of this information block, the item values are stored in the value management table and the record numbers indicating positions where values are present are associated with the array of pointers to the value management table. Thus, it is unnecessary to arrange the item values in the order of the record numbers. Therefore, it is possible to sort data with respect to the item values to be suitable for retrieval and tabulation. This makes it possible to quickly judge whether an item value coinciding with a target value is present in the data. Moreover, since the item values correspond to the item value numbers, it is possible to treat even data, a character string, and the like with a long item value as an integer.
Moreover, according to this data management mechanism, all the item value numbers in the value management table 110 correspond to different item values. Thus, in extracting a record including an item value having a specific value, the number of times of comparison of the specific value and the item values required for the extraction is the number of kinds of the item values, that is, the number of item value numbers at the maximum. The number of times of a comparative operation is remarkably reduced and speed-up of retrieval and tabulation is realized. In that case, although a place for storing a result of checking whether a certain item value is relevant is necessary, for example, it is possible to use a classification number 112 as the storage place.
In FIG. 2, an information block including a value management table 210 consisting of an item value array 211 in which item values are stored, a classification number array 212 in which classification numbers are stored, and a presence number array 214 in which numbers of presence are stored is shown. In the presence number array 214, a number indicating how many item values concerning a certain item are present in all data, respectively, that is, the number of records having a predetermined item value is stored. By preparing such a presence number array 214 in the value management table 210, it is possible to immediately obtain information “what kinds of data (and how many data) are present” required in retrieval and tabulation. Thus, speed-up of retrieval and tabulation is realized.
However, even in such a data management mechanism, although the value list and the array of pointers, in particular, the array of pointers becomes extremely large in accordance with an increase in the number of records, a processable data amount is limited according to hardware resources used.
Processing of large-size data is also required in fields other than the information processing for the table format data described above. In these days when computers are installed in various places in the whole society and networks such as the Internet are spread, large-size data are accumulated here and there. Enormous calculations are required to process such large-size data. It is natural to attempt to introduce parallel processing for the calculations.
A parallel processing architecture is roughly classified into a “shared memory type” and a “distributed memory type”. The former (“shared memory type”) is a system in which plural processors share one huge memory space. In this system, since traffic between a processor group and a shared memory is a bottleneck, it is not easy to construct a realistic system using one hundred or more processors. Therefore, for example, in calculating a square root of one billion floating-point variables, an acceleration ratio for a single CPU is one hundred times at the most. Empirically, about thirty times is an upper limit.
In the latter (“distributed memory type”), respective processors have local memories and the local memories are united to construct a system. In this system, it is possible to design a hardware system including several hundred to several tens thousand processors. Therefore, it is possible to set the acceleration ratio for a single CPU in calculating a square root of one billion floating-point variables to several hundred to several tens thousand times.
Patent Document 1: International Publication No. WO00/10103 Pamphlet