1. Field of the Invention
This invention relates to circuits, systems, and methods for handling electronically reported events in a computing environment such as failure reports from a managed networked computer.
2. Description of the Related Art
In today's business environment, employee productivity and customer satisfaction can be negatively impacted from system failures and delays. In order to achieve high performance, system flaw must be diagnosed and resolved in a timely manner. With highly interconnected systems, the level of complexity in problem resolution is considerable. Frequently, multiple administrators and support staff will receive identical problem notifications, which creates duplicate work, and often compromises work flow effectiveness, system response, and productivity.
A situation which commonly arises in the Systems Management arena in large accounts (e.g. computer systems for very large retail stores, government agencies, etc.) is the flow of events from multiple locations to multiple “home” servers, which are assigned various tasks based upon their defined tasks. Two of the main requirements encountered in this environment are (a) the need for the events to be sent to multiple event servers, and (b) to provide “failover” capabilities.
To date, the standard solution has been to have the events sent to different servers based upon ‘rules’ or logic defined either at the source of the event, or part of the infrastructure of the system at a low level. This causes several problems, the first being the assumption that the network infrastructure can handle the often 4 to 5 times load required. Secondly, the event ownership and failover scheme, along with the hardware needed to power this logic, needs to be coordinated among lower level parts of the infrastructure at many points, often including different geographies with many different locations, making system configuration and deployment difficult.
Therefore, in order to have technologies integrated with optimal configuration, today's network computing enterprise requires an open, scalable and cross-platform approach. One such system solution is IBM's Tivoli (TM) Management Framework (“TMF”), which is the basis for a suite of management applications for complex computing system and network management. TMF has the following features and services:                (a) enables users to create and execute tasks on multiple Tivoli resources via a task library;        (b) provides a scheduler to run the task library;        (c) includes a Relational Data Base Management System (“RDBMS”) Interface Module (RIM) that allows other Tivoli products to write application-specific information to relational databases; and        (d) incorporates a query capability that allows users to search and retrieve data from a relational database.        
Another tool that works well with TMF is IBM's TivoliEnterprise Console (“TEC”). TEC is a sophisticated, highly automated problem diagnosis and resolution tool aimed to improve system performance and reduce support costs. TEC is a rule-based event management application that integrates system, database, network, and application management. TEC has the ability to collect, process and automatically respond to common management events such as server failure, lost network connection or a successful completed batch processing job. Each TEC acts as a central collection point for alerts and events from various sources, prioritizing tasks based on the level of severity of received events, filtering redundant or low-priority events. TEC's coordination functionality also helps identify the reviewer to process specific events to resolve issues quickly.
IBM's Tivoli Management Framework serves as the foundation for Tivoli Enterprise Console. By utilizing the framework and console together, one can manage large distributed networks with multiple operating systems, which can use different network services performing diverse system operations that can change nodes and users constantly.
In order to understand TEC's existing processes, we turn now to FIG. 3. In this illustration of a typical event process (30), a number M of event reports (34a-34m) are generated by various a single event source (32) related to a single failure. For example, if a system's printer is out of paper, a first event report may be generated by an application program which is attempting to print a document, a second event report may be generated by the operating system of that system when it discovers the printer is off-line or not communicating, a third event report may be generated by the printer management application program, etc.
Based on the configuration of the system (32), a copy of each of the M related event reports is transmitted via a communications network (31), such as a dial-up modem with a telephone network, to one or more specified TEC servers (33a, 33b, 33n). These multiple copies of multiple event reports (e.g. M×n) are transmitted in an effort to assure that at least one TEC server successfully receives and acts on the failure. While this increases the reliability of the response system, these duplicate reports create redundant data and duplicate effort.
Turning to FIG. 4, a wider system view of typical handling of event reports is shown, wherein each TEC server receives event reports from a plurality of even sources. In this illustration (40), multiple event reports (42) from multiple event sources (32a, 32b, . . . 32x) are sent via the communications network (31), and are received (41) by multiple TEC servers (33a-33n), including the duplicate reports discussed in conjunction with FIG. 3. So, for example in this wider illustration, a total of M×n×x event reports are transmitted between event sources and TEC servers.
Once each event report reaches a particular TEC server, appropriate data is stored in a local event database (43a,43b, . . . 43n). Using a distributed relational database synchronization product such as IBM's Lotus Notes (TM) and/or Domino (TM) products, the TEC server databases periodically synchronized (44) with each other based on predefined rules and time periods.
For example, event report A (not shown) is generated from event source (32x), and copies are sent to TEC servers 1 and 2 (33a, 33b). When it is received by TEC Server 1 (33a), it is immediately stored in TEC server 1's local database (43a). Likewise, when event report A from event source (32x) is received by TEC Server 2 (33b), it is immediately stored in TEC server 2's local database (43b). At this point, neither TEC server is aware that the other has received a copy of the event report, until the next database synchronization (44) occurs, such as minutes, hours or even days later, depending on the database synchronization schedule rules, availability of a network resource for the database synchronization process, and amount of data to be synchronized. Following the next synchronization process, TEC servers 1 and 2 are to be notified whether or not event report A is resolved by another TEC server. If not, then it may fall to the local personnel to resolve the problem as a backup resource for supporting the reporting event source system.
Because database synchronization is not immediate (e.g. not “realtime”), an event report may not be addressed by other backup TEC servers until after at least one synchronization period has elapsed. Further, each event source may be configured to report events to multiple TEC servers, such as 3, 4 or more TEC servers, which also implies update and synchronization delays with each of their databases before a particular TEC server would be aware that resolution of the problem falls to it and not another one of the servers.
Consequently, an extended delay in response time occurs which is an undesirable characteristic of the current processes. Further, as multiple copies of the same event reports are sent to multiple TEC servers simultaneously, the system stores duplicate data which can create redundant support effort in an attempt to reduce delays to resolving the problem(s). For example, if TEC Server 3 is defined as the second backup to TEC Server 2, which is the first backup to TEC Server 1 for a particular event source, and if technicians learn by experience that it may take up to 2 hours for the system to determine that neither TEC Server 1 or 2 are addressing a reported problem, then technicians associated with TEC Server 3 may be dispatched to resolve a problem prior to actual notification that TEC Servers 1 and 2 are not addressing the problem in anticipation of a 2 hour undesired delay. If, however, TEC Server 1 or 2 have begun to address the problem (e.g. a technician or software resolution process has been initiated), the effort of TEC Server 3 will be redundant, wasteful, and often confusing to the situation.
For these reasons, there exists a need in the art for a system and method which addresses reported events in a timely manner, maintains a system of backup servers, and deals with duplicate event reports while avoiding duplication of problem resolution efforts and confusion surrounding responsibility for problem resolution.
Additionally, there exists a need in the art for a system and method which provides real-time event synchronization between multiple event servers in order to respond to event reports immediately. Furthermore, there exists a need in the art to eliminate duplicate support efforts to reduce time and energy spent for cost saving purposes.