This invention relates generally to computer networks and, more specifically, to a system for coordinating the collection and storing of information from multiple applications and processes.
Organizations, including businesses, governments and educational institutions, increasingly rely on computer networks to share and exchange information. A computer network typically comprises a plurality of interconnected entities. An entity may consist of any device, such as a host or server, that sources (i.e., transmits) and/or receives messages. A common type of computer network is a local area network (xe2x80x9cLANxe2x80x9d) which typically refers to a privately owned network within a single building or campus. In many instances, several LANs may be interconnected by point-to-point links, microwave transceivers, satellite hook-ups, etc. to form a wide area network (xe2x80x9cWANxe2x80x9d) or intranet that may span an entire city, country or continent. An organization employing multiple intranets, moreover, may interconnect them through the Internet. Remote users may also utilize the Internet to contact and exchange information with the organization""s intranet.
One or more intermediate network devices are often used to couple LANs together and allow the corresponding entities to exchange information. For example, a bridge may be used to provide a xe2x80x9cbridgingxe2x80x9d function between two or more LANs or a switch may be utilized to provide a xe2x80x9cswitchingxe2x80x9d function for transferring information between a plurality of LANS. A router is often used to interconnect LANs executing different LAN standards, to interconnect two or more intranets and/or to provide connectivity to the Internet. Routers typically provide higher level functionality than bridges or switches.
In many computer networks, applications or processes are distributed across numerous workstations and servers. For example, due to the complexity of many computer networks, network management applications have been developed to assist administrators in the configuration of their networks. These network management applications also facilitate the identification and correction of faults, and assist administrators in maintaining a high level of network performance. Examples of network management applications include HP OpenView(copyright) from Hewlett-Packard Co. of Palo Alto, Calif. and NetView 6000 from International Business Machines Corp. of Armonk, N.Y., each of which provide a suite of applications and processes for collecting, storing and displaying network information. These network management applications are typically distributed across several workstations or servers within the network, in part, because their processor and memory requirements often exceed the capabilities of a single workstation or server. Each instance of these applications, moreover, may be responsible for a different area or region of the respective computer network.
FIG. 1 is a highly schematic block diagram of a conventional computer network 100 including three workstations 102, 104 and 106, each running an instance of a network management application 108a, 108b and 108c, respectively, such as HP OpenView or IBM""s NetView. Each instance 108a-c of the network management application, moreover, has been configured to acquire information from and to manage various network devices disposed throughout the computer network 100. Application 108a, for example, communicates with and obtains information from network devices 110, 112, 114 and 116, as indicated by arrows 118a-d. Application 108a may utilize the well-known Simple Network Management Protocol (SNMP) for acquiring information. Application 108b at workstation 104 acquires information from network devices 116, 120 and 122, as indicated by arrows 124a-c, and application 108c at workstation 106 acquires information from network devices 126, 128, 130 and 132, as indicated by arrows 134a-d. 
These instances 108a-c of the network management application may also implement various processes for collecting particularized information or data. Application 108a, for example, may implement a trap receiver process 142. Traps are basically messages that are created in response to exceptional occurrences of devices in the network, such as illegal access, network connections transitioning to an inoperable state, loss of connectivity with neighboring devices, etc. Application 108b may implement a polling process 146 that periodically polls network devices 116, 120 and 122, and obtains data therefrom. A network topology process 150 may be implemented by application 108c at workstation 106 for discovering the various hosts, devices and communications links creating network 100.
Most application programs, including applications 108a-108c and processes 142, 146 and 150, can be configured to generate and locally store error, tracking, tracing, and other information. This information relates to the running or operation of the respective application or process, and is used to debug the application or process and trace faults or errors. For example, applications 108a-108c may each create a local file 136, 138 and 140, respectively, for storing error, auditing, tracing or other such information generated by the corresponding application 108a-108c. Trap receiver process 142 may similarly create a local file 144 in which it stores auditing information. Polling process 146 may create a local trace file 148 in connection with its polling of devices 116, 120 and 122. Network topology process 150 may create a local error file 152 for storing its error information. In addition, conventional computer workstations and servers typically include basic facilities for monitoring errors and other events occurring in the distributed applications that they are running. For example, UNIX workstations typically include a system log (syslog) daemon, which runs continuously under the operating system. The syslog daemon logs messages regarding discontinuous events, such as errors, warnings and state transitions that occur at that workstation. The syslog daemon writes the messages to a log file located at the workstation. Workstations and servers also include their own trap or interrupt facilities that record exceptional events. Each of these facilities may also have their own directories and files at the workstation or server for recording information.
During normal operation, applications and processes are generally configured so as to not log error, tracking and trace information. Casual logging of such information can consume significant resources and thus severely impact the performance of the application or process. However, when error conditions manifest, the error, tracking and trace facilities are enabled so as to ascertain the problem. Typically, the activation of such facilities must be performed on a per application basis. That is, commands are entered at the particular machine or workstation at which the subject application or process is running. Alternatively, the application or process may be stopped and its start-up configuration parameters changed so as to enable the desired facilities. After these configuration changes are saved, the application is re-started.
Although the distribution of applications, such as application 108, and processes across many workstations or servers typically improves accessibility and efficiency, it complicates the task of troubleshooting faults and error conditions. That is, with application 108 distributed across numerous machines 102-106, an error manifesting at one location (e.g., workstation 102) may actually be the result of a problem at some other location (e.g., workstation 104). In order to track down such problems, administrators and is service personnel are forced to go to each machine included within the distributed system and configure each application or process to generate the appropriate log messages. The administrator or service personnel must then examine these files located at each machine. That is, to troubleshoot errors in distributed network management applications, administrators must typically activate and then examine the trap, log, poll, trace and other files at each workstation running an instance of the network management application. The administrator may, for example, need to examine files 136 and 144 at workstation 102, files 138 and 148 at workstation 104, and files 140 and 152 at workstation 106, among others.
Each workstation, moreover, may store the information for a given application or process in a different directory and/or with a different format. That is, file 136 at workstation 102 may be in a different directory and may contain different information or information in a different format from file 138 at workstation 104, depending on the particular software or version that is running at each machine. These varying storage and formatting conditions further complicate the task of troubleshooting problems. In addition, the closing and re-starting of the subject applications or processes, which is often required to enable the desired facilities, sometimes causes the system to change such that the problem no longer manifests itself. Indeed, as applications and processes are distributed across more and more heterogeneous machines, the ability to troubleshoot and correct problems can almost become unmanageable.
It is an object of the present invention to provide a system and method for organizing the collection of error, trace, audit and other information generated by distributed applications and processes.
It is a further object of the present invention to provide a system and method for centralizing the storage of error, trace, audit and other information generated by distributed applications.
It is a further object of the present invention to allow users to customize the error, trace, audit and other information centrally collected from distributed applications.
It is a further object of the present invention to allow users to selectively activate from a central point the collection and centralized storage of particular error, trace, audit and other information.
A still further object of the present invention is to allow users to enable the logging of error, trace, audit and other information without having to close and re-start the corresponding application process.
Briefly, the present invention is directed to a system and method for coordinating the organization, collection and storage of error, trace, audit and other such information in a computer network. According to the invention, a plurality of xe2x80x9cdebugxe2x80x9d objects are established for collecting particularized information from heterogeneous applications or processes. Instantiations of one or more these debug objects preferably exist at selected applications or processes, which may be distributed across multiple network entities, such as servers or workstations. Each network entity also includes a novel, extendable logging service layer that is in communicating relationship with the application or process, and is configured to provide common formatting and information storage services. The logging service layer includes a communications resource, one or more state machine engines and a callback generator. Upon initialization, the selected applications or processes issue methods or calls to the respective logging service layer identifying their one or more debug objects. The callback generator establishes a callback that identifies the application or process. In response to receiving or obtaining error, trace, audit or other information, the application or process preferably issues a method or call to one or more of its debug objects. The debug object passes this information to the logging service layer, which, in turn, decides whether or not to forward it to a selected logging facility. In the illustrative embodiment, there is a single, centralized logging facility within the network. The forwarding of information depends on the state of the debug object. If the debug object is in an enabled state, as determined by the state machine engine, then the logging service layer directs the communications resource to forward the information to the centralized logging facility. If the debug object is disabled, the information is discarded by the logging service layer. At the centralized logging facility, received information is time-stamped and appended to a primary log file along with the application""s name and the name of the network entity at which the application is running.
To retrieve the error, trace, audit and other such information generated by any application or process within the network, an administrator simply requests the desired information from the centralized logging facility. In response, the centralized logging facility retrieves the requested information and forwards it to the administrator. In a further aspect of the invention, the administrator may change the state associated with any debug object at a selected application or process. In particular, upon obtaining the respective callback for the selected application, the administrator communicates with the corresponding state machine engine via the communication resource and directs it to change state. For example, an administrator may be interested in obtaining trace messages from an instance of application program xe2x80x9cabcxe2x80x9d running at network entity xe2x80x9c123xe2x80x9d. The administrator directs the corresponding trace message debug object for application program xe2x80x9cabcxe2x80x9d to transition to the enabled state, thereby causing trace message information to be forwarded to and stored by the centralized logging facility. The administrator may then retrieve this information at any time. After reviewing the information, the administrator may change the state of the trace message debug object to disable, stopping the flow of trace message information from this instance of application program xe2x80x9cabcxe2x80x9d to the centralized logging facility.
In a still further aspect of the invention, the administrator may also obtain information directly from the applications or processes. In particular, the administrator may issue a GetDebugObjects service request to a particular instance of a distributed application. The GetDebugObjects service request is captured by the communications resource, which, in turn, returns to the administrator the list of debug objects that have been instantiated at the selected application. The administrator may use this information to set the states of the various debug objects at the application.