§ 1.1 Field of the Invention
The present invention concerns network management systems (“NMSs”). In particular, the present invention concerns combining fault and performance management.
§ 1.2 Description of Related Art
The description of art in this section is not, and should not be interpreted to be, an admission that such art is prior art to the present invention.
As computer, hardware, software and networking systems, and systems combining one or more of these systems, have become more complex, it has become more difficult to monitor the “health” of these systems. For example, FIG. 1 illustrates components of a system 100 that may be used by a so-called e-commerce business. As shown, this system may include a web interface server 110, a search and navigation server 120 associated with a product inventory database 125, a purchase or “shopping cart” server 130 associated with a user database 135, a payment server 140 associated with a credit card database 145, a transaction server 150 associated with a transaction database 155, a shipping server 180 associated with a shipping database 185, a local area network (“LAN”) 160, and a network 170 including linked routers 175. As shown, the search and navigation server 120, the purchase or “shopping cart” server 130, the payment server 140 and the transaction server 150 may communicate with one another via the LAN 160. As further shown, these servers may communicate with the shipping server 180 via the network 170.
Each of the servers may include components (e.g., power supplies, power supply backups, printers, interfaces, CPUs, chassis, fans, memory, disk storage, etc.) and may run applications or operating systems (e.g., Windows, Linux, Solaris, Microsoft Exchange, etc.) that may need to be monitored. The various databases (e.g., Microsoft SQL Server, Oracle Database, etc.) may also need to be monitored. Finally, the networks, as well as their components, (e.g., routers, firewalls, switches, interfaces, protocols, etc.) may need to be monitored.
Although the system 100 includes various discreet servers, networks, and databases, the system can be thought of as offering an end-to-end service. In this exemplary system, that end-to-end service is on-line shopping—from browsing inventory, to product selection, to payment, to shipping.
Tools have been developed to monitor these systems. Such tools have come to be known as network management systems (NMSs). (The term network management systems should not be interpreted to be limited to monitoring networks—network management systems have been used to monitor things other than networks.) Traditionally, NMSs have performed either fault management, or performance management, but not both. Fault management pertains to whether something is operating or not. Performance management pertains to a measure of how well something is working and to historical and future trends.
A fault management system generates and works with “real time” events (exceptions). It can query the state of a device and trigger an event upon a state change or threshold violation. However, fault management systems typically do not store the polled data—they only store events and alerts (including SNMP traps which are essentially events). Generally, the user interface console for a fault management system is “exception” driven. That is, if a managed element is functioning, it is typically not even displayed. Generally, higher severity fault events are displayed with more prominence (e.g., at the top of a list of faults), and less critical events are displayed with less prominence (e.g., lower in the list).
On the other hand, performance management systems generally store all polled data. This stored data can then be used to analyze trends or to generate historical reports on numerical data collected. A major challenge in performance management systems is storing such large amounts of data. For example, just polling 20 variables every 5 minutes from 1000 devices generates 6 million data samples per day. Assuming each data sample requires 50 bytes of storage, about 9 GB of data will be needed per month. Consequently, performance management systems are designed to handle large volumes of data, perform data warehousing and reporting functions.
Performance management systems are typically batch oriented. More specifically, generally, distributed data collectors poll data and periodically (e.g., each night) feed them to a centralized database. Since the size of the centralized database will become huge, database management is a prime concern in such products.
As can be appreciated from the foregoing, conventional fault management systems are limited in that they do not store data gathered for later use in performance analysis. Conventional performance management systems are limited in that they require huge amounts of storage. Furthermore, since data is batched and sent to a centralized location for storage, the stored data can become “stale” if enough time has elapsed since the last batch of data was stored.
Furthermore, most enterprises currently use a minimum of two, if not more, products for information technology management. It is common to find several independent products being used by various departments within an enterprise to meet the basic needs of monitoring and performance management across networks, servers and applications. Moreover, since the performance and fault monitoring systems are disjointed, correlating data from these different systems is not trivial.
Recognizing that correlation between the collective information technology (“IT”) infrastructure and business service is needed, several Manager of Manager (“MoM”) tools have appeared in the market. These products interface with the various well known commercial tools and try to present a unified view to IT managers. Unfortunately, however, such integration is complex and requires depending on yet another product which needs to be learned and supported each time an underlying tool is updated. The addition of yet another tool just adds to the operational costs rather than reducing it.
In view of the foregoing limitations of existing network management systems, there is a need to simplify the processing related to monitoring faults and performance. There is also a need to monitor end-to-end service faults and performance of a service. Such needs should be met by a technique or system that is simple to install and administer, that has real-time capabilities, and that scales well in view of the large amount of data storage that may be required by a performance management system. Finally, there is a need to provide different users with different levels of monitoring, either for purposes of security, for purposes of software licensing, or both.