In today's fast moving environment, business activities tend more and more to rely on the system, and e-business infrastructures progressively merge with internal information technology (IT) infrastructures. Ultimately, IT becomes essential to the business. To that end, companies try to monitor and manage their IT not only from a technical perspective, but also from a business perspective. Ensuring that all the IT components are available and performing well is still required, although such parameters, including any outage or slowness that might occur must be interpreted ultimately in terms of business impact. Then, when dependencies are understood by both parties and effectively controlled by the management system, SLM can be implemented.
IT Managers are challenged by the necessity of managing a growing number of IT resources including networks, systems, databases, and applications that are distributed over global organizations. The ability to commit on Service Levels and immediately detect and adequately react to alert conditions is critical in today's heterogeneous corporations. An enterprise management solution must cope with the distributed nature of IT resources and provide fault-tolerant capabilities in a dynamic environment.
Event management remains a fundamental area in IT Enterprise Management because it is impossible to predict the many failures that occur. More and more mission-critical applications require complex, heterogeneous, and distributed resources. These inter-related resources must be monitored in order to provide IT operations with an accurate picture of the enterprise. Because the IT infrastructure is critical to businesses, it is important that problems are discovered, analyzed, and fixed as soon as possible. The goal is to minimize the impact on the core business.
The various probes or monitors that are watching the distributed resources in order to detect malfunctions or changes can produce huge amounts of data in the form of events. Existing event management solutions mostly rely on either a centralized or two-tiered architecture. Historically, the centralized solutions appeared first but their limitations were quickly identified which led to the development of the two-tiered solutions.
Having limited analysis and automation capabilities at the agent level can result in losing information and building an inaccurate representation of what is happening. Indeed if the agent does not have the ability to apply a complex configurable automation and analysis, it will have to send the information to the server that has these capabilities and let the server react. The time involved in sending to the server and having the server react is sometimes enough so that the situation is completely changed when the server is able to query for some more information. Therefore, the representation of the situation that the server is building can often be completely off the mark.
Going with this fundamental trend, several products claim to offer a business-oriented operation management capability and/or a SLM capability. As such, they relate to new market segments like the Business to IT alignment market, the Enterprise Operations Enhancement market, or more extensively the SLM market. But there exists a need in today's environment to manage not hundreds of ‘static’ devices but thousands and thousands of objects—some being “real” as they pertain to the IT world; others being “logical and dynamic” as they move closer to business concepts; and they are all distributed.
A number of established vendors have provided insufficient solutions that were invented in the early nineties for client/server architectures. Each tries to manage an environment with an architecture that is dissimilar to what is being managed. This architecture has failed to be successful in the long run. The product suites described in this architecture included a plurality of components. The following description focuses on the scope of the description to those of the components that are directly contributing to the event processing architecture.
One example of these product suites is sold by BMC Software Corp. under the trademark PATROL 2000. This product includes a default 2-tier architecture including the Patrol Enterprise Manager™ (PEM) and the Patrol Agents. The PEM requires a Unix only specific hardware infrastructure. Moreover, it is slow and easily overloaded. This model can be extended to a three-tier architecture by adding an intermediate component: the Patrol Operations Manager™ (POM). The POM needs a Windows® NT only specific hardware infrastructure. POMs cannot talk horizontally or vertically to their peers. Communication is not possible between two POMs. Moreover, there is neither POM-to-POM embedded synchronization capability nor any fault tolerance capability. The only way to “synchronize” a PAM is to have a Patrol Agent forward an event to another POM. This action has to be programmed as a customized function and is not offered as a feature. Also, POMs do not implement any event correlation but a “filter and forward to PEM” model. From a functional standpoint, the approach lacks a logical layer to combine cross-domain, cross-discipline data for meaningful business impact determination. From a technical standpoint, this product has a strict hierarchical architecture capable of only bottom-up event flows. To date, PATROL 2000™ comes with three different consoles: the PEM console, the PAM console and the Patrol Agent console. A mix of dedicated Unix and Windows® NT servers is required.
Another product suite, sold by Tivoli Corp. (see http://www.tivoli.com) under the trademark Tivoli Enterprise™, also includes a default 2-tier architecture including the Tivoli Enterprise Console™ (TEC) and, in the low end, the Distributed Monitoring™ (DM) engines complemented with TEC Adapters. The TEC also requires a dedicated hardware infrastructure. It is also slow and easily overloaded. This model can be extended to a three-tier architecture by adding an intermediate component: the Availability Intermediate Manager™ (AIM). The AIM also requires a dedicated hardware infrastructure. Notably, AIM was built out of Tivoli IT Director, a newer, different technology than TEC. Some notable problems are that rules must be written to keep events synchronized and no security or encryption is available in the communication protocol between the AIMs and TEC. In addition, DM events can only be sent to AIMs through an external process that increases the load on the managed systems. In addition, Tivoli also provides a Standalone Prolog Rule Engine™ (SPRE) which is positioned as a fast intermediate event processor with no front-end graphical user interface (GUI) and no persistent event repository. If a SPRE crashes prior to having performing a saved state, all events received since the last saved state will be lost and unavailable on restart. From a technical standpoint, this product has a hierarchical architecture primarily aimed at supporting bottom-up event flows. To date, Tivoli Enterprise™ comes with 2 different consoles: the TEC JAVA™ GUI and the AIM JAVA™ GUI. A number of dedicated Unix or Windows® NT servers is required.
Another product suite, sold by Micromuse Corp. under the trademark Netcool/OMNIbus, also includes a 2-tier architecture—when not used as a Manager of Managers (MOM). This architecture includes: the Netcool ObjectServer™ and the Netcool/Impact™ application server on the high end; and the Netcool Probes&Monitors™ on the low end. The Netcool ObjectServer™ is a high-speed, in-memory central event database. Several Netcool ObjectServers™ can be chained in a peer-to-peer ‘hierarchy’ using Netcool Gateways™ to provide bi-directional interfaces between them, with synchronization and takeover capabilities. However, those components remain “central servers” in the sense that they are not designed to build a distributed multi-layered network of processors.
Thus, there is a need for an improved method and architecture for measuring and reporting availability and performance of Business Services in today's environment, where numerous objects with moving dependencies have to be managed in large distributed infrastructures. There is also a need for an intermediate functional layer providing configurable abstraction services. There is a further need for the processing component to be able to (a) collate, correlate, or generate instrumentation and dependency events; (b) communicate and synchronize with its peers; (c) implement some form of resilience; and (d) accept dynamic data updates as a means to support environment changes. There is yet another need for the processing component to be able to play various roles throughout the management architecture, without compromising its default capabilities. There is still another need for a unique console component to be able to interact with any of the processing components, whatever role each plays in the management architecture.