Over the years, computer systems and the associated applications they run have grown in both size and complexity. Historically computer applications were single tier (also referred to as “single node”) applications in which all the data and the processing of the data occurred on a single computer system. In contrast, many modern applications run on multiple-tiers (also referred to as “multiple nodes”) distributed across multiple computers.
The ability to create multi-nodal systems offers an organization great flexibility and scalability in application design and implementation.
However, regardless of the number of nodes, tiers, their location, technology, organization and persistence they must all function cooperatively in an optimal manner in order to both process the data properly and deliver the required level of performance, in terms of throughput and response time, to the users of the system.
The use of multi-node/multi-tier systems has also created technological challenges in terms of the ability to test and diagnose poor overall system performance, especially in cases where the poor performance is intermittent. This is because, if there is an intermittent drop in performance, the use of traditional monitoring methods will typically significantly degrade overall system performance to an unacceptable level.
Specifically, traditional approaches to address an intermittent drop in performance involve creating script files that run on the various tiers that comprise the application. The script files are then run on a continuous basis creating large quantities of output data or they are only invoked as a result of some triggering event. In either case, post processing attempts are then made to synchronize the output data across the multiple tiers by means of timestamps in the data.
Using either of these traditional approaches has disadvantageous.
The continuous running of the scripts approach, outputting all diagnostic data, all the time on all tiers on which the application runs, is prohibitive, especially in a production environment in which response times are critical. More particularly, when using this approach, there is often not enough available disk space to store the data generated, the performance of the system experiencing the problem is adversely impacted and degraded by the running of the script files, and, in any event, depending on the frequency of the event occurrence, most of the data collected by running the scripts is useless but nevertheless must be analyzed.
The triggering event approach to running the scripts is less detrimental in a production environment, since data is only output when an intermittent event that causes a degradation in performance occurs. Nevertheless, this approach has its own inadequacies. For example, in some cases, the triggering event may only occur after the event that caused the degradation in performance has already occurred, or while it is in-process. Thus the root cause of the performance problem may have already passed or terminated before the actual triggering event. In other cases, the triggering event may occur on a single node within the system and cause collection of data for that node, but the cause of the event may not have happened on that node. If this is the case, i.e., that node is not the cause of the triggering event, then the collected data will be of limited to no value, since the performance issue will actually have resulted from an event that occurred on some other node and this node will only have “noticed” the trigger because the effect of the real event has “trickled over” to the node where the trigger occurred.
Still further, neither approaches can take into account dynamically created nodes, nor can they truly account for the possibility that a transaction that causes the performance problem may be processed on any available node, so it may be difficult to replicate by re-running the transactions that were being processed shortly before or during the time of the degraded performance.
Adding to the complexity in diagnosing intermittent degraded performance issues is the trend towards ever larger systems, composed of hundreds or thousands of computing nodes, any of which may be involved in the processing that results in a performance issue.
In an effort to avoid having an effect on the production system while it is in use, another approach to diagnosing intermittent poor performance typically involves creating a separate test environment and attempting to replicate the conditions of the production environment and, through monitoring, identify the cause of the poor performance. This approach is likewise inadequate because, the test environment is not actually the production environment.
With any of the foregoing traditional approaches, locating, diagnosing and fixing the cause of the performance issue typically takes months or more, so it is extremely costly.
Thus, there is an ongoing technological problem that requires a better solution than the replicating test environment can provide.