A collection of autonomous machines connected by a network and unified by appropriate software and management policies is itself a computer system. This concept of a network-based computer system is emerging as a significant paradigm in the computer industry. Network-based computers include more than a client-server computing and its emphasis on a two party relationship. In network-based computing, a service (or resource) becomes a more general concept and no longer needs to be tied to a single machine; rather, it becomes a feature of the whole network-based computer. A network-based computer, or environment, can arise in a number of contexts such as in a heterogeneous collection of user workstations and server machines on a local area network; in a special purpose "clustered" machine consisting of individual processors connected by a high-speed network; or in a campus or enterprise or global network connecting several such environments together.
An important component in all of these environments is a resource management system. Such a system should have the capability of locating, allocating and delivering resources or services while respecting policy requirements for load-balancing, fair-share scheduling, and optimal usage of resources. The facilities of traditional client-server computing allow a client to lookup a service and contact a particular server, They do not specify the policies or mechanism needed to effectively allocate services to clients or to arbitrate between clients requesting services. Furthermore, they do not include the notion of a service provided by a collection of machines. What is missing is a scheduling capability and a better means for describing resources.
Much of the work on global scheduling and load balancing (or load sharing) is theoretical in nature. See, for example, the article entitled "A comparison of Receiver-Initiated and Sender-Initiated Adaptive Load Sharing" by Eager et al. in Performance Evaluation, 1986, pp. 53-68. Also see article entitled "Load Sharing in Distributed Systems," by Wang et al. in IEEE Transactions on Computers, Vol C-34, No. 3, March 1985, pp. 204-217. An article entitled "Finding Idle Machines in a Workstation-Based Distributed System" by Theimer et al., IEEE Transactions on Software Engineering, Vol. 15, No. 11, Nov. 1989, pp. 1444-1458 compares centralized and decentralized scheduling. There are papers describing experiences describing such systems as, for example, the following:
Goscinski et al., "Resource Management in Large Distributed Systems," Operating Systems Review, Vol. 24, No. 4, Oct. 1990, pp. 7-25.
Litkow et al., "Condor-A Hunter of Idle Workstations," in Proc. 8th International Conference on Distributed Computing Systems, IEEE, 1988, pp. 104-111.
Silverman et al., "A Distributed Batching System for Parallel Processing", Software-Practice and Experience, Vol. 19, No. 12, Dec. 1989, pp. 1163-1174.
Commercial implementations are also becoming available. Some of these include NQS/Exec from The Cummings Group, VXM's Balans, IDS's Resource Manager, and HP's Task Broker. These approaches solve similar problems and choose either a centralized or a decentralized approach. Many of the central-based approaches, other than IDS's Resource Manager which sits on top of ISIS, do not provide fault-tolerance. The system referenced above as Goscinski discusses scalability issues and overall architectural questions, including how a hierarchy of scheduling domains might be constructed. The Goscinski system uses decentralized scheduling in the local domain despite agreeing that a centralized approach will scale better.
U.S. Pat. No. 4,827,411 of Arrowood et al. illustrates a network resource management system in which each node maintains a copy of the network topology database defining network resources. U.S. Pat. No. 4,747,130 of Ho discloses a distributed processing system which utilizes a local resource database at each processor which contains data about the availability of system resources. U.S. Pat. No. 4,800,488 of Agrawal et al. illustrates a computer with a resource availability database that is not centralized. U.S. Pat. No. 4,835,673 of Rushby et al. describes a local area network system with an administrator which allocates or deallocates resources. This system broadcasts availability and is not centralized. U.S. Pat. No 4,890,227 of Watanabe et al. describes a resource management system which allocates resources using data stored in acknowledge database. U.S. Pat. No. 4,727,487 of Masui et al. illustrates a resource allocation in a computer system which resource management function includes a policy making function. U.S. Pat. No. 4,914,571 of Baratz et al. describes a system which locates resources in a network.