The present invention relates to an information retrieving system using index, and particularly to the addition and deletion of a retrieval node or the change of the amount of process between retrieval nodes in a retrieval system capable of parallel retrieval operations on a plurality of the nodes with the index divided.
A method of improving a processing performance is proposed for an information processing system for making information retrieval such as a database management system (hereinafter, abbreviated DBMS). That is, there is disclosed a document of “Parallel Database Systems: The Future of High Performance Database Systems”, COMMUNICATIONS OF THE ACM, Vol. 35, NO. 6, 1992, P. 85-P. 98, which is a technique of an architecture for dispersing the database processing load to a plurality of processors and processing the partial loads on the processors. In the prior art given above, the shared everything, shared disk type architecture enables all the nodes or processors for making retrieval to access to all disks, but the shared nothing type architecture enables the processors to respectively access to only a disk independently belonging to each node. The shared nothing type architecture has few resources competitively accessed by processors as compared to the shared disk type architecture or to the shared everything type architecture, and thus it is excellent in the scalability.
In an information system of the shared nothing type architecture, when the amount of process on each node is required to change by the addition or deletion of a node or by the concentration of access to a particular node, it is necessary to change the amount of data imposed on each node. The most simple method for altering the amounts of data allocated on the nodes is that, after the content of database is once backed up and newly defined in its data arrangement, the backed up data is reloaded. In this method, however, when the amount of data to be treated is large-sized, a tremendous amount of processing time is taken to back up and reload.
To solve this problem, there is proposed a management technique of previously dividing data into a plurality of buckets by hash function or the like and allocating some buckets to a processor as disclosed in U.S. Pat. No. 4,412,285.
In addition, JP-A-2001-142752 discloses a technique. In this technique, data is previously divided into buckets and managed in a correspondence table of buckets and a plurality of disks. The correspondence between the buckets and the disks is changed when a disk is additionally provided because of the addition of a retrieval node so that the minimum amount of data can be moved, thus data being rearranged.
Moreover, JP-A-2003-6021 discloses another technique. In this technique, data is previously logically divided into units corresponding to the buckets by hash function and managed in association with a plurality of disks. When a disk is added, the data is rearranged in units corresponding to the buckets while processes such as retrieval, update and insertion are performed during the rearrangement of data.
Furthermore, JP-A-2005-56077 discloses the technique that the allocation of data between processors is changed without physical movement of data by changing the mapping of physical disks and virtual disks corresponding the buckets. By this technique, it is possible to exponentially shorten the time taken to change the allocation of data between processors and to dynamically increase the number of nodes as the loads on the nodes rise.
These techniques are about general data and do not particularly consider the index formed of an inverted file.
An information retrieving system having a shared nothing type index is required to alter the allocation of search-targeted ranges of the index to each node in order to add and delete nodes and to change the load balance between the nodes. The basic idea to meet this request is that the search-targeted ranges of the index is previously divided into buckets as is the general data with no index, and that the allocation of search-targeted ranges to each node is changed in units of buckets.
Here, in order to flexibly change the number of nodes and load balance between the nodes in the information processing system, it is necessary that the data size of the bucket that is the minimum unit in the data arrangement be much smaller than the amount of data allocated to each node. When the bucketsize is reduced, the number of buckets inevitably tends to increase.
In addition, the index commonly used to increase the speed of information search is formed of an inverted file that is a list of the index keys used in the retrieval and the addresses of information items matched to the index keys.
The simplest method for reallocating the index will be that partial indexes are respectively produced as bucket units and reallocated. However, since each node handles a large number of buckets as described above, there are many partial indexes in each node, and thus information retrieval operation must refer to many partial indexes with the same search key. The operation of referring to many partial indexes results in the fact that the searching of partial inverted files for a target index key occurs a large number of times. In addition, since many address lists associated with a target index key exist in a divided manner over the buckets, the address lists cannot be read in at a time. Therefore, this method is inefficient as compared to the case in which a single large-sized partial index allocated to each node unit is once referred to at a time. The deterioration of the retrieval performance is an important problem to the information processing system that chiefly makes information retrieval.
On the other hand, in order to solve this deterioration problem, it can be considered that, when the allocation of search-targeted ranges between the nodes is changed, the partial index responsible for each node unit is reproduced from the original text of the information items. However, the production of a partial index needs frequent computations because the comparing operation about the index keys occurs a number of times in order to produce the address list for each index key in the partial inverted file. In addition, frequent computations are needed for other processing operations such as the analysis of the original text of the information items and the extraction of portions associated with the index keys. When the allocation of search-targeted ranges is changed because of the increase of loads on the nodes and hence of the addition of nodes, the loads on the nodes further increase in order to produce the partial indexes for each node. Therefore, it is not appropriate to regenerate partial indexes from the original text of the information items.