The present invention relates to a computer system and a data processing method in the computer system and more particularly to a computer system for parallelly processing a large volume of data by a plurality of computers and a data processing method in the event of a fault.
In recent years, the volume of data processed by computer systems has been growing explosively. This in turn has increased the time taken by data processing, giving rise to a problem of a job failing to be finished within a predetermined time. To speed up data processing, it is increasingly necessary that a large volume of data be processed with a plurality of parallelly connected computers.
Among technologies for processing large volumes of data using a plurality of computers may be cited a distributed memory technology, like the one described in a document: GemStone Systems, Inc., “GemFireEnterprise,” Technical White Paper, 2007. The distributed memory technology is a technology that integrates memories provided in a plurality of computers into one logical memory space in which to store data. In the distributed memory technology, since data is practically disposed distributed among memories of a plurality of computers, these distributed data can be processed by these computers parallelly. Further, since data is disposed in memories of the computers, data transfers to and from external storages such as disk drives are reduced. This in turn results in an increased speed of data processing.
The distributed memory technology, on the other hand, has a risk that, in the event of a fault in a computer, data held in that computer may be lost. To deal with this problem, it is a common practice in the distributed memory technology that the data held in the memory of a computer is replicated and that the replica of data is disposed in a memory of another computer to avoid a possible loss of data that would otherwise occur in the invent of a fault. When a fault has occurred in a computer, the operation that was being executed by that computer at the time of fault can be executed again by the second computer that holds the data replica. It is noted, however, that because the re-execution of the operation by the second computer in the event of a fault is done only after the second computer has finished the operation that was being executed at the time of fault, the completion of the overall data processing is delayed by the fault.
To speed up the re-execution of operation using the replicated data in the event of a computer fault, a technology is available to distributively dispose data in secondary memory devices of other computers, such as shown in JP-A-2000-322292 and JP-A-2001-100149. The technology disclosed in these patent documents has a replica of the data held by a computer distributively disposed in secondary memory devices of a plurality of other computers. When a fault occurs with a computer, a plurality of other computers with a secondary memory device holding the replica of the data held by the faulted computer parallelly process the data, thus reducing the time taken by the re-execution.