A multi-nodal system is a type of computer system which completes computing jobs by means of multiple collaborative compute nodes which are connected together—e.g., a server or clustered computer system. These compute nodes may be located on the same physical system and be communicatively coupled using a bus or be remotely disposed and communicate via a communication network.
Using a multi-nodal system has many advantages. For example, the system, when working in a load-balance manner, can achieve a higher efficiency by dividing work and using multiple compute nodes. The multi-nodal system may also work in a master/slave manner. Once a master fails, a slave provides services to users as a substitute for the master, thereby exhibiting a high fault-tolerance.
Since the multi-nodal system normally comprises of large amounts of computing resources which work together collectively, each incoming job must be apportioned the correct amount of system resources. This process is referred to as job scheduling. In general, job scheduling includes mapping jobs to corresponding computing resources for execution based on a job's characteristics and scheduling policies. As part of this process, a job may be divided into one or more tasks (i.e., processes or threads). One or more of these tasks may then be executed on a compute node within the multi-nodal system. If multiple tasks are allocated to a single compute node, the multi-nodal system may use barrier synchronization to coordinate the activities of the various tasks.
Parallel processing, which distributes work among multiple concurrently executing tasks, may require synchronization between the tasks. One common method of providing this synchronization is via barrier synchronization. In general, barrier synchronization requires that each task in a group of communicating tasks needs to reach the same synchronization point (i.e., barrier) before any task within the group can proceed beyond that point. By definition, a barrier involves a group of tasks. Once a task enters the barrier, it waits for all other members of the same group to enter the barrier before it exits from the barrier.
When an application is processed in a parallel fashion, various jobs for the application are processed in parallel. Barrier synchronization provides a checkpoint mechanism that ensures that each job reaches a particular point before proceeding. This checkpoint mechanism is typically performed by the data stored in a special-purpose register—the barrier synchronization register (BSR).