This invention relates to proactive fault detection and anomaly detection on transaction networks.
Transaction-oriented networks, such as ATandT""s Transaction Access Services (TAS) network, provide ubiquitous dial-to-packet services for carrying short-duration transaction traffic. Average usage of the TAS network can amount to millions of transactions on a non-busy and typical day. Usage of such transaction networks is continuing to grow at a rapid pace. Typical transactions support point-of sale applications and services (e.g., credit/debit card authorizations and settlements), health care applications, banking and vending applications, and other data-driven sales applications.
Such transaction-oriented networks can be data, telecom, or a combination of data and telecom wide area networks (WANs) that service such short-duration transactions (having a duration in the order of seconds) between a set of terminals (e.g., credit card scanners, or personal computers) and a set of processing servers (e.g., credit processing servers). FIG. 1 shows a typical network 101, such as the ATandT TAS network to which is connected a very large plurality of terminal equipment, such as credit card scanners 102-1-102-N, and a plurality of host processors 103-1-103-M. The ATandT TAS Network comprises three components: an ATandT 800 Network 104, TAS nodes 105, and the ATandT Packet network 106. The TAS network 101 enables transaction-oriented communication between any of the connected terminal devices 102, which are geographically scattered around the country, and their designated particular processing host. Access of the plural terminal devices 102 to the TAS network is through the 800 Network 104, which is terminated at the set of TAS nodes (modems) 105 where POTS-to-packet protocol conversion is effected. These nodes use the Dialed Number Identification Service (DNIS) digits provided by the 4ESS(trademark) switches in the 800 network to establish switched virtual circuits (SVCs) in the packet network 106. The packet network 106 completes the connection between an individual terminal device 102-i initiating a transaction of a certain type and the particular host processor 103-j which can complete that type of transaction.
The TAS network 101 concurrently supports transactions of multiple service classes, where each service class represents a different type of transaction. Thus, VISA(copyright) credit card transactions directed to the particular VISA host processor that handles such transactions fall within one class, MasterCard(copyright) credit card transactions directed to the MasterCard host processor that handles such transactions fall within another class, health-card transactions for a particular pharmaceutical provider directed to that provider""s host processor fall within a third class, etc. Transactions between classes are likely to have vastly different temporal characteristics, such as average transaction duration and duration distribution, due to the diverse nature of the type of transactions being supported between classes. Thus, for example, a credit card authorization for a purchase using a VISA credit card would be expected to be of shorter duration than a health-card transaction relating to a drug-refill order.
Because of the tremendously large number of transactions being supported on the network, it is extremely important that network failures and performance degradations be kept at a minimum. A network failure or performance degradation, for example, could easily affect the ability of consumers across the country to make purchases with their credit cards. Such a scenario could have severe economic repercussions to a large segment of the business community. Even the failure or performance degradation of a particular one host processor can strongly impact performances of other service classes and the network as a whole.
Currently network management systems monitor and manage the TAS nodes 105 in the TAS network 101 to detect performance failure in the modem circuitry. Such systems are reactive to a hard failure when it occurs. Similarly, management systems monitor switches in the 800 network 104 and switches in the packet network 106 and are also only reactive actual failures on the monitored part of the network. Thus, only the network itself and the components that comprise the network are currently being monitored and managed from within the network. With such management systems, therefore, the failure or performance degradation of any non-managed element cannot be detected.
It is often, however, the failure or performance degradation of a non-managed element that can have a severely deleterious effect on the entire network. For example, the performance degradation of a host processor serving transactions in a particular service class can result in not only denial of service to that one service class, but to transactions in other service classes which are being serviced by other host processors since all service classes share the same network infrastructure and resources. Further, during periods of high traffic intensity, such as during the Christmas holiday shopping season, resource services for the VISA and MasterCard service classes may be oversubscribed, resulting in the resource hijacking from these dominant service classes from other less dominant service classes resulting in a denial of access for transactions in these other less dominant service classes. Even further, on the transaction input side of the network, a general failure of unmanaged transaction terminals associated with a particular service class will remain undetected by a network management system that only monitors network elements.
It is desirable, therefore, to be able to measure and analyze network performance in real time from which an anomaly can be detected before an actual failure occurs so that corrective actions can be executed in time to avert failures. Fault detection on a local area network using anomaly detection is described by F. Feather, D. Siewiorek and R. Maxion, in a paper entitled xe2x80x9cFault Detection in and Ethernet Using Anomaly Signature Matchingxe2x80x9d, Computer Communication Review (ACM SIGCOMM""93), Vol. 23, No. 4, October, 1993. As described in that paper, observable network performance data is directly analyzed to detect anomalies. It has been found, however, that an analysis of directly observed network performance data does not provide sufficient sensitivity to enable off-network anomalies to be detected and thus proactively corrected. In addition, the techniques suggested by Feather et al. apply primarily to an ethernet local area network (LAN) environment.
In accordance with the present invention, proactive and automatic detection of network failures and performance degradations is achieved by first converting real-time network performance data into a performance-based objective function. The objective function, using current real-time data, is directly correlated with the particular anomalies which the network monitor is trying to detect. That objective function, generated from current data, is then compared with that same objective function as predicted from historical performance data to determine anomalies in the objective function generated from the current data which probabilistically signify a potential fault. Through such a real time comparison, alarms can be generated when anomalies are detected in the objective function generated from the current data.
In the embodiment of the present invention, an objective function used to characterize performance of transaction-oriented networks is defined as traffic intensity. For each transaction, the transaction is characterized by the service class that the transaction belongs to, the start time of the transaction, and the duration of the transaction. From current transaction data, for each service class, a time-dependent traffic intensity is computed that is defined as being equal to the total number of active transactions on the monitored network falling within a calculated binning interval. That binning interval, for each service class, is adaptively and dynamically determined from recent historical transaction records and is a function of the median and probability distribution of the transaction duration for that particular service class from such past data. From a larger past time-frame window of historical data, predicted baseline, and upper and lower time-dependent traffic intensity thresholds are determined for each service class using that binning interval. An anomaly is detected when the real-time traffic intensity at a certain time, computed from the current data using the determined binning interval, is greater than the predicted time-dependent upper threshold or less than the predicted time-dependent lower threshold for longer than a predetermined time interval. To account for the evolution of network traffic, the most recent performance data is periodically incorporated with the historical data to update the performance thresholds and baseline. Upon detecting an anomaly, an alarm can be sounded, such as through a graphic user interface (GUI), to alert a network operator of the presence of network anomalies and faults. The operator can then identify the service class or classes associated with the anomaly and determine the possible cause(s) of the anomaly. Alternatively or cooperatively with a GUI, upon detecting an anomaly, a network control module can be signaled for automatic feedback and control such as the detachment from the network of a potentially offending host processor, or the initiation of a rerouting module. Upon detection of an anomaly, the data giving rise to that anomaly is removed from the database to prevent use of that data as part of the historical data used to recalculate binning intervals, and the baseline and upper and lower thresholds.