1. Field of the Invention
The present invention relates to a coordinator server configured to be connected to a plurality of database servers each having a database for storing extensible markup language (XML) data to constitute a distributed XML database server, and a distributed processing method.
2. Description of the Related Art
Conventionally, as a high speed technology of a processor that performs reading, interpretation, and execution of commands, and writing of results thereof, for example, there is a pipeline processing technology. Pipeline processing independently operates a process of each phase, in which before a processing cycle of a previous phase finishes, a process of the next phase is started, and this process is repeated. Accordingly, an assembly-line operation is realized and the performance of the entire processing is improved.
Meanwhile, there is a parallel database technology as a technique for managing a large amount of data. In the parallel database technology, a system including a plurality of servers is established to correspond to a large amount of data. A large amount of data set having a uniform data format is arranged in a plurality of databases. There is also a case that the data set is arranged not in a distributed manner, but in an overlapped manner on a plurality of databases. By arranging the data set in this manner, an improvement of throughput can be expected in a case that the number of simultaneous accesses to the same data is high.
A system of managing the data in such a parallel database is largely divided into three methods, that is, a system in which a plurality of servers do not share a disk (disk nonshared system), a system in which the servers share a disk (disk sharing system), and a system in which the servers share a disk and a memory (memory sharing system).
The disk nonshared system is mainly explained here. When the data set is divided into a plurality of databases and arranged, two methods of vertical division of data set and horizontal division of data set can be considered (fragmentation). The horizontal division of the data set is to create a subset of data set. A data partitioning technique described later becomes important. The vertical division of the data is to divide the data in a unit of attribute or column. Each division method includes a merit and demerit according to an individual access pattern. For example, in the vertical division of data, high speed can be acquired if data scanning of a size with few inquiries is good enough. However, if original data is required, data coupling is required between servers, and the performance is greatly deteriorated.
Each server used in the parallel database in the disk nonshared system can perform parallel access by individually accessing a plurality of databases in which the data set is arranged in the divided manner, and improvement of performance corresponding to the number of databases can be expected. Accordingly, processing efficiency and response time can be improved (partition parallelization).
As the data partitioning method, key range partitioning and hash partitioning are known. For example, it is assumed here that a large amount of data set is expressed with relation. In the key range partitioning and the hash partitioning, there are a case of using one column value of a table and a case of using a plurality of column values of the relation. When such data partitioning is performed, although loads are concentrated, in search with a range condition with respect to a target column, inefficiency caused by accessing an irrelevant database can be avoided. Further, in the search including natural coupling in the target column, coupling between different databases is not required, thereby enabling to considerably improve the performance.
In the parallel database, loads are concentrated on a specific database at the time of search, unless balanced data partitioning is performed, thereby making it difficult to exhibit a parallelization effect. However, respective data sizes may be unbalanced due to a change in the trend of input data, which cannot be avoided by using a predetermined data division rule. Therefore, improvement techniques such as dynamically changing the key range and changing the hash value have been proposed. With these techniques, however, the load due to data shift related to the change increases.
The parallel database often includes one coordinator server and a plurality of database (DB) servers. In such a configuration, following processing is performed in the parallel database. That is, the coordinator server having received a request from a client analyzes the request to generate a plan, and divides and distributes the plan to each of the DB servers. Each DB server executes the distributed plan and transmits data set of a processing result to the coordinator server. The coordinator server performs aggregation processing such as merge with respect to the transmitted data set, and transmits the aggregation result to the client. The data transferred between the servers is stream transmitted on a network such as a local area network (LAN). Therefore, in the parallel database, also the network is often realized on distributed parallel platforms such as interconnect between high-speed servers.
To realize high speed in the above processing, in the parallel database, a mechanism for performing phase processing such as scanning, sorting, and joining of internal processing of structured query language (SQL) in parallel by a plurality of processes and a plurality of servers is incorporated. In a part of database products, a pipeline system in which the process in each phase is operated independently, and before a previous phase process finishes, the next phase process is started has been adopted (pipeline parallelization).
Further, in an XML database that stores XML data, as a query requesting acquisition of the XML data, for example, a functional language referred to as XQuery having a static typing function is used, to acquire a data set of the processing result. Single XQuery processing performed by using the XQuery is largely divided into an approach for processing XML data as the functional language, and an approach for processing the XML data by a tuple operation. Recently, distributed XQuery processing technology for realizing such XQuery processing by a distributed system has been developed. However, an attempt of distributed XQuery processing has just been started, and articles describing the distributed XQuery processing are found only occasionally. For example, there are articles in ‘Christopher Re, et. al., “Distributed XQuery”, IIWeb 2004 Highly Distributed XQuery with DXQ’ and ‘Mary Fernandez, et. al., “Highly Distributed XQuery with DXQ”, SIGMOD 2007’.
The distributed XML database that performs the distributed XQuery processing has, for example, a master-slave distributed architecture in which a plurality of database servers are designated as slaves and a coordinator server that coordinates the database servers is designated as a master. In such a distributed XML database, for example, it is assumed that, in an XQuery specification for performing the conventional single XQuery processing, an XQuery specification partially extended so that query shipping can be executed is used.
However, when the specification in which the XQuery specification for performing the conventional single XQuery processing is partially extended is used, a user who generates XQuery needs to describe a distributed processing procedure, and therefore it is not always true that the distribution transparency is high (XQueryD). The distribution transparency realizes a distributed system by causing the distributed configuration to appear as if it is a centralized system, without making a user conscious of the distributed configuration.