There are many types of computational jobs that are sufficiently complex to justify use of a distributed computing environment in which a number of interconnected computing devices cooperate in carrying out the required computations to complete the job. One such job, which arises during the development of new software by a team of programmers, is handling the check-in and testing of new code by each developer on the team. Typically, programmers working on a software development project will submit the changes to the code that have just been made so that a new build of the software incorporating those changes can be produced. Due to human error, programming bugs can be introduced that either prevent a build from being completed or cause errors when an attempt is made to execute a build of the software. As the size of the development team increases, the amount of time required to find the bugs so that they can be corrected grows in proportion. Consequently, large teams of programmers frequently expend more than half of the development time in identifying bugs in the software, so that errors can be addressed. Also, it is not uncommon for an attempted fix of one bug to introduce yet other bugs that must then be corrected.
Automated testing of newly developed software code addresses at least a portion of this problem, since such testing can detect bugs that prevent a build of the code from being completed and can help the team identify and correct bugs that cause runtime product failures. In the past, use of a distributed processing environment for implementing automated testing of software being developed has employed script language to define each of the tests and tasks that must be carried out by the plurality of computers comprising the distributed computing environment. A similar approach has been used for other tasks that are processed by distributed computing systems, to determine how portions of a job are parceled out to the computers in the system. However, script and other forms of defined programming used to specify all of the procedure for handling distributed computing jobs are typically very labor intensive to prepare and maintain. More importantly, such approaches have not been sufficiently flexible in handling the problems that inevitably occur because of errors or failures.
For example, in regard to script controlled distributed computing for handling software check-in and testing of new software builds, if one section of code breaks a new build, it may be difficult to restart the process once a problem that prevented the build from completing has been fixed. In the past, any problem that interrupted the preparation of a new build would likely require that the entire sequence of tasks be started over, which could unduly delay the testing job being completed.
In any distributed computing system, is important to make good use of the available computing devices that are sharing in performing a distributed computing job. It is difficult to efficiently assign tasks comprising the job to each computing device in accord with the capabilities of the computing device, because of the interdependence of the tasks, the time required to complete the tasks and capabilities of the computing devices that become available to do other tasks. As indicated above, resumption of a distributed computing job after an interruption has occurred should be carried out in a manner that enables available computing devices to be provided tasks that remain to be done, without having to redo tasks that have already been completed. The state of each task must therefore be maintained at all times, to enable efficient resumption of a job following an interruption and correction of a problem that caused the interruption.
Accordingly, it is apparent that a more flexible approach is required than has been used in the past. It is important to predefine the tasks comprising a computing job to facilitate distribution of the tasks to the various computing devices as the computing devices become available. The capabilities of the available computing devices in a group of such devices must be matched to the tasks that remain to be done. The use of multiple computing devices to complete a job in this flexible manner will be particularly useful in handling automated check-in of newly prepared computer code and testing of the builds produced from such code, but this approach is not limited to just this application. Instead, it should be useful for handling almost any distributed computing task in which a job can be predefined as a plurality of tasks that will be implemented by multiple computing devices matched to the tasks remaining to be completed.