A collection of computers connected by an electronic network and the appropriate software can be used as a single computing system. In network-based computing, a computing resource is not necessarily tied to a single machine. Rather, it becomes a feature of the whole network-based computer. A network-based computer can exist in a number of contexts, such as in a heterogeneous collection of user workstations and server machines on a local area network. A special purpose cluster of machines with individual processors connected by a high-speed network can also be a computing resource, or an enterprise or global network connecting several such environments together can also be considered a single computing resource.
This model can create a high performance computing platform that includes a set of workstations, processors, or PCs connected together by a high-speed network such as Fast Ethernet, Gigabit Ethernet, or Myrinet. Each processor or computer in the system is usually called a node. One node in the system can act as a front-end node that controls system operation. Other nodes in the system are called computing nodes, and their function is to perform the computation. Each node can run a full-multitasking operating system (e.g., Linux, UNIX, etc.) and each user can login to any node to work. A single system view is achieved by a shared file system such as NFS (Network File System) and a shared information system like NIS (Network Information System). Many systems have parallel computing software library like PVM (Parallel Virtual Machine) or MPI (Message Passing Interface) installed to form a network computing system.
One component such environments usually utilize is a resource management system or batch scheduling system. This tool helps users effectively apply high performance computing systems to their computing needs. Such a system generally has the capability of locating, scheduling, allocating and delivering resources or services while respecting policy requirements for load-balancing, fair-share scheduling, and optimal usage of resources. Batch scheduling systems have been used in many organizations to obtain super computing power at an affordable cost.
If a user does not have a good batch scheduling program then the user will have to know the details of the computing system in order to run across multiple nodes in their cluster transparently. This resource allocation can be configured in a way so that the resource utilization is optimal.
Many implementations of batch scheduling systems have become available. Some of these include PBS from NASA, NQS as modified from PBS by several commercial groups, IDS's Resource Manager, and HP's Task Broker. These approaches solve similar problems and choose either a centralized or a decentralized approach. Many of the centralized approaches do not provide a significant amount of fault-tolerance or flexibility. Decentralized scheduling can be used in the local domain even though a centralized approach is likely to scale better.
In a cluster processing system, batch scheduling has many important roles. First, the batch scheduling system helps users manage their job submissions. Second, the batch system controls the distribution and allocation of system resources such as the allocation of computing nodes to each user's task to maximize the performance and optimize resource usage. Third, the system controls the cluster system functions according to certain system policy. For example, some systems may restrict memory and disk usage for each user depending on their priority.