1. Field of the Invention
This invention relates to a method of predicting and displaying resource capacity information relating to computer resources. More specifically, this invention relates to how information is collected from network devices to predict when critical resources will exceed maximum capacity, how it is stored for future use, and how it is displayed for the user.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the US Patent and Trademark Office file or records, but otherwise reserves all rights provided under copyright law.
2. Background and Benefits of the Invention
Computer networks have experience explosive growth in terms of size and complexity during the past ten years. Network designers have pushed the envelope of network capacity using innovative architectures. Asynchronous Transfer Mode (ATM), Fiber Distributed Data Interface (FDDI), Frame Relay, Ethernet, and token ring have aided the evolution of network computing resources from simplistic text-based mainframe software to sophisticated client-server imaging software, databases, and business-oriented applications in ever-growing geographical distribution. Both users and vendors have simultaneously demanded and provided networking technologies of greater and greater performance.
As networks grow, the complexity grows, not in proportion to network size, but instead exponentially. This cycle has created a nightmare for network managers who are required to control, manage, and maintain computer networks for their users. At a minimum, these users demand adequate levels of network performance to do their jobs; a solution which is not a simple one.
As a result, network performance management has been a crucial area in which vendors have fielded a variety of tools to monitor, analyze, and model network performance.
Often, these tools are used in a reactive mode to monitor or troubleshoot a current (or recurring) problem. Thus the typical solution is to monitor a network and do one of two things
Send an alarm when a threshold exceeds a certain level; or
Monitor the network over a long period of time and try to predict when a threshold is exceeded.
Unfortunately, these attempted solutions have not been successful. The inventors have discovered that the reasons are as follows:
1. An element exceeding a threshold for a brief moment is not necessarily a problem. Depending on how acutely you monitor the element, it could exceed a preselected threshold very frequently and still not cause a problem.
2. Solutions that attempt to predict future values have been unsuccessful typically because they base their calculations on historical values of mathematical averages. The average utilization of a network element may be well within acceptable ranges, but the upper operating range of the element during certain time intervals makes the element perform poorly in given circumstances. From the same standpoint, projecting peek utilization is similarly insufficient, as it does not indicate how long peek utilization was maintained.
3. Typically, monitoring networks involves monitoring the devices that make up the networks, or more specifically, specific values maintained on the device, and graphing that value over time. Unfortunately, this rarely yields a view of the network as the user considers it. For example, a user may use three separate 1.5 Mbps WAN lines multiplexed together to connect two sites together. A typical monitoring solution would monitor each WAN line individually, yielding three separate graphs of utilization. In reality, the user considers these three WAN links to be a single 4.5 Mbps resource, not three separate resources. In such a situation the first WAN link may be completely utilized, the second moderately utilized, and the third minimally utilized. Looking at three separate graphs would just cause confusion.
Threshold-based systems have emerged time and time again to overcome a fundamental shortcoming of data used to determine if network performance has deteriorated. Ultimately, these systems try to determine when a resource is completely saturated with activity. Due to the xe2x80x9cburstyxe2x80x9d nature of data networks, a network can be completely saturated for short periods of time, followed by longer periods of inactivity. If a monitoring system were to sample often enough to actually record such occurrences, it would need terabytes of storage to collect enough samples to demonstrate a longer-term trend. Expanding the sample period to allow a more manageable data set causes averages that mask a potentially serious burst. When longer sample periods are used, the chance that any single sample might reflect 100% utilization decreases, as the resource would have to be used 100% during the entire sample period in order to return a 100% value.
Thresholds are used because 100% samples are highly unlikely with systems that uses long sample periods. In such systems, a 90% threshold may be used with the assumption that in order to return 90% utilization, a certain percentage of the sample must have been higher than the threshold (including the possibility of 100% utilization). These systems may return an alarm when such an event occurs, or the system may even predict when in the future utilization will crest above the selected threshold.
Unfortunately, cresting above such a threshold does not indicate a problem. Users, familiar with such alarms, or even over exposed to such alarms, ignore such alarms or find them meaningless.
Compounding the problem, these systems perform their monitoring on individual elements that may make up a larger resource. For example, if a network maintained three WAN links between two locations, the first WAN link may be 100% utilized, the second 50%, and the third 10%. If a system monitored them individually, there would be a constant alarm on the first WAN link. In fact, as these three links are elements of a larger resource, the entire resource is little more than 50% utilized.
In short, although there are solutions that try to solve this problem, due to difficulties with the granularity of the collected data, and the presentation of that data, the problem has yet to be solved effectively.
In typical network management solutions, polling intervals have often been problematic. If the management system polls resource elements frequently, the system would potentially need terabytes of storage to maintain this information for any meaningful time period. Conversely, if the polling interval is extended, brief event may be masked by longer periods of normal usage. In order to balance these solutions, conventional management systems poll frequently and store this high volume of data for a relatively short period of time. After the data has existed for a preselected period of time, the data is xe2x80x9crolled upxe2x80x9d, or archived into larger and larger averages as time goes on. In such a system, data may have a fixed polling ranging from seconds to minute intervals available in the database for the first day, but this quickly dissolves into daily averages once the data is several weeks old. This technique presents two problems. First, the quality of data is inconsistent throughout the database, making projections based on this data inaccurate. Second, when trying to determine when usage of the network exceeded a preselected value over a preselected period of time, the information will be unavailable once the data is processed through such a xe2x80x9croll-upxe2x80x9d method.
In summary, therefore, one embodiment of the invention operates by tracking all network resources against their potential capacity and a user preferred operating range. This embodiment of the invention then determines when the critical operating range is going to exceed the current capacity of the resources that constitute the network.
One embodiment of the invention provides that devices are polled in a standard, fixed interval and information is stored only according to when a significant change is observed in that resource element. This, in essence, provides the capability of seeing events that last a potentially short period of time without necessarily requiring the hard drive storage that storing each event would ultimately take. The information about resource elements is stored in the database and analyzes the information over a long period of time. The information can also be displayed by bundling individual resource elements, which are meaningful to the user, to represent their overall resource, if applicable.
Thus, the method of the invention bundles separate homogenous resource elements together as a single composite resource and plots and projects the usage of that resource as a value on a single graph using a single line in one embodiment of the invention.
The invention uses a preselected subset of a daily period to represent the upper operating range of the device. For example, the second standard deviation of a daily period (approximately the worst 36 minutes) can be used to represent the upper operating range of a 24-hour period. Some users may choose to use a 95th percentile value, which would be the worst 72 minutes of the day.
Thus, this embodiment of the invention provides for a method that monitors resource elements and bundles them together in a way that is meaningful to the user. The usage is displayed in terms of a current period in graph (x-axis) and usage is projected into the future to determine potential saturation of network resources. A standard logistical regression is used to predict when resources will exceed their known capacity.
Further details of the present invention will become apparent to one skilled in the art from the following detailed description when taken in conjunction with the accompanying drawings