Network management in a large organization involves ensuring the availability and responsiveness of network resources attached to the network, such as servers and printers. Users often think of the services they rely on, such as web sites and email, as part of “the network”. Although tools exist for ensuring the availability of the servers running these services, ensuring their performance is a more difficult problem.
Broadly speaking, there are two aspects of the end systems that may be managed: whether they are operational, and how well they are performing. The former is called fault management and the latter is called performance management. The FCAPS model is the International Telecommunications Union (ITU) model describing the space of network management tasks by dividing it into five layers [Udu96]: fault, configuration, accounting (or administration), performance, and security, which I will discuss in turn.
Fault management is concerned with whether the individual network components—the routers, switches, and links—are up and running. If not, there is a fault that needs to be detected and addressed. This is typically achieved with the simple network management protocol, SNMP.
Configuration management is concerned with achieving and maintaining a consistent configuration among all the devices in the network. For example, most networks will want to ensure that every side of every link is set to full-duplex (not half-duplex), and that the auto-negotiation of the link rate did not yield a lower rate than the link is capable of. Scaling these issues to a network with tens of thousands of end systems proves the need for an automatic approach to configuration management.
Accounting involves measuring the traffic volume sent between routing domains. Most domains, or autonomous systems (AS's), have agreements with neighboring AS's about the amount and price of network traffic that the neighbor will accept for transit. For example, UNC is charged by (one of) its upstream provider(s) to the Internet according to the 95th percentile of the link utilization each month, so that the more traffic UNC sends and receives to this AS, the higher the price the AS will charge for its service. Other AS's have peering arrangements, so that transit is free as long as the service is mutual (and roughly similar in volume). In any event, accounting involves measuring the traffic volume sent to and from neighboring AS's and sometimes enforcing outgoing rate limits to keep costs down. Another area in the accounting domain is billing users (such as departments on campus) according to their usage.
Performance management is concerned with the same objects as fault management, but the concern is orthogonal. Instead of asking whether the objects are up and running, performance management assumes they are running and asks whether they are running well. This layer describes my research.
Security management is concerned with keeping the network secure from attack. For example, the routers and switches should not accept configuration changes unless the change is appropriately authenticated and authorized. Also, network users should not be able to spy on the network transmissions of other users.
Network management may also include managing the performance of authorized services in the network, such as web services and email services. This extension does not comprise an additional layer of the FCAPS model; rather, it is an extension of the fault and performance layers. The fault layer is extended to detect faults not just of the internal network devices, but also of the servers attached to the network. Likewise, the performance layer is extended to manage the performance not just of routers, switches, and links, but also the servers.
From a user's perspective, the response time is a natural measure of performance: the longer the response time, the worse the performance. For example, if there is a noticeable delay between clicking on a “get new e-mail” button and seeing new e-mail appear in one's inbox, then the response time is high, and the performance is low. Such a conception of performance is naturally supported within the request-response paradigm.
Current passive solutions to monitoring server performance will typically target the most common types of servers, ignoring hundreds of other types. If a new type of service is created, the developers of the monitoring solution will need to understand the application-level protocol upon which the service is based, and incorporate the protocol into their solution. If the protocol is proprietary, then they must guess at its operation. In contrast, my solution incorporates no knowledge about specific application-level protocols, so it works equally well for all TCP-based request/response services, independent of whether they are based on new, proprietary, or standard application-level protocols. (Note that my solution does not work for UDP and other non-TCP types of traffic. Most servers of interest are TCP-based. Notable exceptions include voice-over-IP and streaming media.) As a result, my measurements of server performance do not require access to the application-level information present in a server's protocol messages.
A common problem in many network measurement techniques is the inability to monitor traffic in which the application payload is encrypted and thus unintelligible without the private decryption key. Because these techniques inspect the application payload, encryption foils their measurements. My technique depends only on the packet headers, and thus can be used with traffic that is encrypted. To use an analogy, consider packets as envelopes. Unlike other approaches, my approach looks only at address and other information on the outside of the envelope. Thus, when an envelope is securely sealed (as in the case of encryption), being unable to open the envelope is not a problem for my approach.
Many evaluations of anomaly (or intrusion) detection methods analyze the false positive rate, or the proportion of alarms that did not correspond to real anomalies (intrusions), and the false negative rate, or the proportion of real anomalies (intrusions) that went undiscovered by the methods. Such an analysis requires that all real anomalies (intrusions) are identified a priori. Unfortunately, there is no authoritative source that labels anomalies, partially because the definition of what issues are important to a network manager is an imprecise concept.
One network management tool is Flowscan [Plo00]. Flowscan reads Cisco Netflowl records (obtained through passive measurement), which contain byte and packet counts per flow, and aggregates them into simple volume metrics of byte, packet, and flow counts per five-minute interval. The timeseries of these counts are stored in a round-robin database (using RRDtool), which automatically averages older points to save space. Netflow records do not take internal flow dynamics into account, and they have no notion of anything like a quiet time. Furthermore, it does not correlate the two flows in a connection.
Estan et al's AutoFocus [ESV03] tool (and the similar Aguri [CKK01] tool by Cho et al) summarizes network traffic into a list of the most significant patterns, for some definition of “significant”. For example, a flood of SYN packets to a particular host would show up as an anomaly when viewing the number of flows per destination host. Thus, the task of AutoFocus is to look at traffic from various viewpoints and carefully aggregate on one or several dimensions to come up with the list most useful to the network manager. AutoFocus reports in an offline fashion.
Bahl et al's Sherlock system [BCG+07] attempts to localize the source of performance problems in large enterprise networks. They introduce an “inference graph” to model the dependencies in the network, providing algorithms for inferring and building these models automatically. The system is able to detect and diagnose performance problems in the network down to their root cause. However, the system requires deploying “agents” on multiple machines at multiple points in the network, so deployment is not simple. Also, the agents inject traceroutes into the network and make requests to servers, so the system is not passive.
Gilbert et al take a different approach [GKMS01]. They sacrifice exactness for the ability to produce meaningful summaries from high-speed traffic as it streams by. They sacrifice exactness.
Another approach is an open-source network management tool called “Nagios”. Nagios sends a probe out to the server and measures the response. In the case of web servers, the probe is an actual HTTP request. If the response time is above a certain threshold, then Nagios sends a warning. If three successive response times are above the threshold, then Nagios sends a more urgent alert, indicating that there is a performance issue with the server.
Lastly, there is a closely related network management solution called OPNET ACE3. ACE is similarly concerned with network service performance, and it is also interested in application response times. ACE is capable of more precisely distinguishing between network transit time and application processing time, but it requires an extensive deployment of measurement infrastructure throughout the network, on clients, servers, and points in between. Furthermore, ACE constitutes an expensive, active measurement approach.
Barford and Crovella [BC98] used user-level think-time modeling to realistically model HTTP users. Lan and Heidemann extended this idea to empirically-derived distributions of think-times in [LH02].
A seminal work from Smith et al introduced the notion of inferring application behavior from transport-level information alone [SHCJ01]. This work eventually led to the A-B-T model and the “t-mix” traffic generator [HC06, HCJS07b, WAHC+06].
Vishwanath and Vandat also create a general model and use it for traffic generation [VV06]. Like “t-mix” and my work, they model TCP connections as request/response pairs, with data unit sizes and “think-time” latencies. However, they take a different approach to modeling for the purpose of generating or replaying traffic, capturing information at higher “session” and request/response “exchange” layers.
Fraleigh et al [FDL+01] propose a design for a system to passively collect and store packet header traces from high-speed links within an Internet service provider's point-of-presence (PoP). The system can enable workload characterization, among other things. However, packet header traces offer a high level of detail and consume vast amounts of storage space.
Hussain et al [HBP+05] describe a passive, continuous monitoring system to collect, archive, and analyze network data captured at a large ISP. They collect packet header traces as well as metadata, and they use vast disk resources to store months of data.
Malan and Jahanian's Windmill system [MJ98] also collects continuous passive measurements. However, they use incoming packets to trigger events in application protocol “replay” modules that they manually created for the most popular application protocols. This approach requires significant work to create each such module.
Feldmann's BLT [Fel00] is a method for capturing HTTP information, including request/response pairs. It abstracts network and transport effects such as out-of-order delivery and retransmissions, instead focusing on the application-level data units exchanged between hosts. However, BLT focuses only on HTTP and gleans information from the TCP payload, so it is not appropriate for encrypted traffic. Furthermore, it uses a multi-pass algorithm, and so is not suitable for continuous monitoring of high-speed links, which require a one-pass approach. Feldmann's BLT system [1] passively extracts important HTTP information from a TCP stream but BLT is an off-line method that requires multiple processing passes and fundamentally requires information in the TCP payload (i.e., HTTP headers). This approach cannot be used for continuous monitoring or monitoring when traffic is encrypted. Another approach, based on BLT, which has similar drawbacks is Fu et al.'s EtE [FVCT02].
In [2] and [3], Olshefski et al. introduce ksniffer and its sibling, RLM, which passively infer application-level response times for HTTP in a streaming fashion. However, both systems require access to HTTP headers, making them unsuitable for encrypted traffic. Furthermore, these approaches are not purely passive. Ksniffer requires a kernel module installed on the server system, and RLM places an active processing system in the network path of the server.
Commercial products that measure and manage the performance of servers include the OPNET ACE system. ACE also monitors response times of network services but is an active measurement system that requires an extensive deployment of measurement infrastructure throughout the network, on clients, servers, and points in between. Fluke's Visual Performance Manager is similar and also requires extensive configuration and integration. Also similar is Computer Associates Wily Customer Experience Manager. CEM monitors the performance of a particular web server, and in the case of HTTPS, it requires knowledge of server encryption keys in order to function.
Thus, common approaches to monitoring the performance of servers in a network may include installing monitoring software on the server, or to use an active monitoring system that generates service requests periodically and measures the response time. This approach to server performance management may include measuring the CPU utilization, memory utilization, and other metrics of the server operating system. In probing or active measurement, the server is given actual requests, and the response time is measured. This approach more accurately reflects the quality of service as perceived by the user.
However, the measurement may negatively affect the service itself-sometimes noticeably. For example, when the quality of service is suffering because of high load, the act of measuring the quality of service will only exacerbate the problem. Yet another drawback is that these approaches typically require extensive customization to work with the specific server/service at hand. Another problem with conventional approaches is that such metrics do not necessarily correlate with the user-perceived quality of service. Another problem with this approach is that the act of measuring computational resources consumes the very same resources that the service itself needs.
Accordingly, there exists a need for methods, systems, and computer program products for network server performance anomaly detection.