The present invention relates to both a method and an apparatus for executing a distributed algorithm or service on a Simple Network Management Protocol version 1 (SNMPv1) based computer network. In the Local Area Network (LAN) environment, particularly those networks based on Transmission Controlled Protocol (TCP) and Internet Protocol (IP), Simple Network Management Protocol Version 1 (SNMPv1) has emerged as a standard tool for managing network devices. SNMPv1 normally operates by having one or more central manager node(s) oversee multiple agent nodes as shown in FIG. 1. As depicted, each agent node 2 supports a local, tree-structured database, called a Managed Information Base 3 (MIB) and software that allows a valid manager node 1 to access information in MIB 3. Agent node 2 responds to command messages sent by manager node 1. Messages that can be sent by manager node 1 to agent node 2 include: "Get" which is sent to read certain locations in MIB 3; "GetNext" which is similar to Get; and "Set" which is sent to write information to a location in MIB 3. Messages that may be sent by agent node 2 to manager node 1 include: "GetResponse" which is sent in response to a Get, GetNext, or Set command, and returns information to manager 1; and "Trap" which is sent asynchronously or, in other words, upon the occurrence of a predetermined event. Certain traps are predefined by SNMPv1. Other Traps are "enterprise specific" which means they can be defined to carry information specific to a particular algorithm or service.
Although commonly used, a centralized manager configuration has several shortcomings. For example, it creates communication overhead in the vicinity of the management station. Centralized management also constitutes a single point of failure in a system. That is, if the manager goes down, the entire system goes with it. The problems facing diagnostic algorithms exemplify other limitations of a traditional SNMPv1 based network. Fault detection in SNMPv1 is limited to two methods: polling and trap-notification. Managers poll agents to detect failed nodes. In a large network, however, the polling interval can become excessive, leading to large diagnostic latencies. Alternatively, the agents can inform the central observer of any failure. This, however, requires the device to remain partially operational under failure which tends to be unreliable in SNMPv1. Additionally, centralized management systems have "multi-hop communication" which may cause intermediate failures to mask the fault state of the monitored node. These problems are solved through distributed diagnosis.
There has been a large body of theoretical results in the area of system-level diagnosability and distributed diagnosis. Recently, these studies have been applied in real systems. One of the most advanced applications to date was achieved by Ronald P. Bianchini, Jr. and Richard W. Buskens as described in Implementation of On-Line Distributed System-Level Diagnosis Theory, IEEE Transactions on Computers, Vol. 41, No. 5, p. 616 (May 1992). This paper documents an early application of on-line distributed system-level diagnosis theory using Adaptive-Distributed System Diagnostics (ADSD). Key results of this paper include: an overview of earlier distributed system-level diagnosis algorithms, the specification of a new adaptive distributed system-level diagnosis algorithm, its comparison to previous centralized adaptive and distributed non-adaptive schemes, its application to an actual distributed network environment, and the experimentation within that environment.
The system described in Bianchini et al. uses a Berkeley socket interface and Ethernet IP/UDP protocols to facilitate ADSD. These protocols, however, may be impractical in the long run. In the LAN environment, SNMPv1 is the standard protocol for managing network devices. Yet, to date, SNMPv1 is not fully distributed. SNMPv1 only performs fault diagnosis via a centralized manager. Furthermore, SNMP version 2 offers greater distributed control but still maintains a hierarchial arrangement as shown in FIG. 2. One top manager 21 manages several secondary agent/managers 22, one of which, in turn, manages a third-level agent/managers 23, and so on until nodes are reached which act as dedicated agents 24. Therefore, a need arises for a SNMPv1 to run fully distributed algorithms and services. The present invention fulfills this need.