This invention relates to improvement in computer system and method for distributing and storing graphs to a plurality of computers and searching the distributed graphs in parallel.
Graphs are used to express given relations among data pieces. A graph is a set of data in which a data piece holds a data value and at least one relation with another data piece and is connected with the other data pieces by the relations.
There exist a graph database apparatus for storing and administrating graphs and a graph search apparatus for extracting a desired graph from the graphs stored in the graph database. The graph search apparatus extracts a graph or graphs matching the conditions defined with a data value and a relation of data pieces from the graph database apparatus.
To expedite searching a massive number of graphs, there is a known technique that distributes and shares graphs among a plurality of server nodes and conducts parallel searches at the server nodes.
Receiving and merging results of the parallel searches in the plurality of server nodes at a management server leads to obtaining a result including the same graphs obtained by searching all the graphs. It should be noted, however, if related data is distributed among a plurality of server nodes, the plurality of server nodes need to determine whether the related data satisfy conditions with one another because, in searching graphs, the search conditions consist of a data value and a data relation. Determination on the conditions in the plurality of server nodes might require communications among the server nodes, causing delay in the processing. In order to prevent this delay, Non-Patent Literature 1 discloses a technique that stores data connected by relations in the same server node.
The technique disclosed in Non-Patent Literature 1 eliminates communications among the server nodes in searching; the search time required for each server node is the time taken by searching the graphs held in each server node to extract graphs matching the conditions designated with a data value and a data relation. Since each server node conducts a search in parallel, the time to obtain all the search results depends on the server node that takes the longest time in searching. The details of the search are common to all servers; accordingly, the time to obtain all the search results depends on the number of graphs to be searched by each server node.
Now, graphs to be searched are explained. In general, searching data uses labels called an index to extract one or more data pieces matching a part or all of the search conditions. The index for the graphs is dictionary data in which data values and data relations are sorted in a specific order. Extracting a data range matching a part or all of the search conditions from this dictionary data leads to acquisition of intended graphs without checking the entirety of the graphs. If extracted at this phase is a data range matching a part of the search conditions, it is necessary to determine, assuming that the extracted data range provides possible solutions, whether each possible solution matches the remaining search conditions. The number of possible solutions corresponds to the number of graphs to be searched. If no index is provided, all the graphs are possible solutions.
The number of graphs to be searched depends on the details of the search and the allocation of the graphs to the server nodes. Accordingly, if a specific server has more graphs than the other servers, the load to the specific server increases to cause delay in searching. To solve this problem, Patent Literature 1 discloses a technique that holds the records of the details of past searches and the volume of searched data and reallocates data from a server node having a large volume of searched data to a server node having a small volume to achieve load balancing.