The present invention relates generally to the field of grid computing and more particularly, to a method and system for integrated scheduling and replication in a grid computing system.
Many organizations or laboratories need to perform numerous computations to generate desirable results. Data required for these computations may be present at different geographical locations. For example, a financial consultancy firm may require to access data from its computing systems in India, and the United State of America, to compute best investment plans for its clients. Another example of a drug discovery lab can be considered. In such a set-up various chemical compositions are developed and tested at different geographical locations. Computationally intensive tasks need to be performed on the data at these locations after collating the data to obtain the final product. The task of performing computational tasks on the data accessed from heterogeneous sources which may also be geographically separated gave rise to the concept of grid computing. In a grid computing system, a plurality of Data Processing Units (DPUs) or computing systems that are geographically dispersed are interconnected with each other. A computation job to be processed by the grid computing system may require plurality of files for processing. These files may be spread across more than one data processing unit (DPU) in the grid computing system. The time required to process the computation job at a DPU in the grid computing system thus depends on factors that include, but are not limited to, processing time required to process the computation job at the DPU, and the time required to move some of the plurality of files required by the computation job from other DPUs in the grid computing system to the DPU. The time required to move some of the plurality of files is in turn dependent on the bandwidth available with the DPU. Thus there is a need for methods and systems that take into consideration these factors, and schedule the computation job in such a way that it is processed in a minimum optimal time.
In one such method, files present with one DPU are replicated at every other DPU in the grid computing system. The process of replicating data across the grid computing system requires a large amount of time. The data is not optimally spread; hence there is considerable misuse of the available memory space, which leads to increase in the time required to process computation jobs. Other replication methods include, but are not limited to, replicating each of the plurality of files required by the computation job at the DPU processing it. In case the DPU cannot hold few files from the plurality of files due to space constraints, the oldest file stored in the memory/disk of the DPU is deleted. Although the time required for replicating data is reduced. The method doesn't make optimal usage of memory/disk available with the DPU and can lead to replication of every file at the DPU in case different computation jobs are scheduled at the DPU.
Other methods known in the art, consider scheduling to be a primary concern over replication. These methods result in a longer waiting time to access files that are not present with the DPU from other DPUs in the grid computing system. Hence, there is a need of a system that manages replication of data files based on the scheduling decisions taken by the grid computing system.
Due to the distributed nature of grid computing systems, a centralized system that integrates replication and scheduling through out the grid computing system will incur heavy costs and may not be feasible in many cases.
A system and method known in the art, which integrates replication and scheduling, determines whether to replicate data from a DPU to other DPUs based on the scheduling information. The system makes an assumption that each computation job requires only one file at a time for processing. But in practical scenarios, a computation job requires multiple files for processing. Further, a computation job may not be scheduled at the DPU that contains at least one of the file required by that computation job, but at a DPU that is close to DPUs containing all files required by the computation job.
The drawbacks in existing scheduling and replication methods and systems in grid computing systems give rise to the need to develop a method and system that integrates scheduling and replication in grid computing system in a distributed and scalable manner.