Computer networks are becoming an integral part of daily lives through technologies such as the Internet, wireless personal digital assistants, ad-hoc wireless and mobile networks. The potential for using these networks to assist in illegal or terrorist-based activities is very large. The ability to monitor network traffic and police computer networks is critically important.
Network speeds have been increasing at an incredible rate, doubling every 3-12 months. This is even faster than one version of the remarkably accurate “Moore's Law” observation made in 1965 that states that processing power doubles every 18-24 months. However, while network speeds have been increasing exponentially, disk and bus bandwidth have only increased linearly. The disparity between network, CPU, and disk speeds will continue to increase problematically. Most analysts agree that there are enough new manufacturing methods, designs, and physics available in the near future that these trends will continue for some years to come. One issue fueling this trend is that networks are increasingly optical while computers are primarily electronic.
The implications of these trends are critically important. Conventional methods will not suffice to capture network traffic in the future. As research technology becomes conventional technology, tcpdump/libpcap style processing will have neither the requisite processing power nor the disk bandwidth available to handle fully saturated network links. Any such single-machine design, no matter how well implemented, will be unable to “keep up” with the network. For example, field tests of tcpdump/libpcap running on a 400-MHz machine over a Gigabit Ethernet (GigE) backbone showed that tcpdump could monitor traffic at speeds of no more than 250 Mbps with accuracy of no greater than milliseconds. At times of high network utilization (e.g., approximately 85%), over half of the packets were lost because this system could not handle the load. In addition, because few network traffic monitoring systems are available, many vendors offer custom hardware solutions with proprietary software. These systems are very expensive.
It is possible to address disk and accessory bandwidth by using redundant arrays of independent disks (RAID) systems as network-attached storage (NAS) and the memory integrated network interface (MINI) paradigm, but both systems are currently somewhat expensive. Memory bandwidth issues have been addressed by dynamic random access memory chip technology such as RAMBUS or simple increases of processor cache size, but these techniques have failed to live up to their potential. These approaches in combination with highly tuned commercial-off-the-shelf (COTS) monitors and a specialty monitor have been tested. Unfortunately, this approach is expensive and still susceptible to the growing disparity between network, CPU, and disk speeds.
Most limitations in current network monitoring methodologies are due to compromises in scalability, user level versus kernel level tasks, and monitor reliability. They compromise scalability by forcing the use of a single machine or processor rather than symmetric multiprocessing (SMP) or cluster systems. However, at projected network speeds the collection task for a single machine can require so much processing power that too little remains to actually do anything useful with the data.
The user level/kernel level compromise gives up efficiency, real-time monitoring, and timer resolution for ease of use. Most currently available network monitoring approaches are user level applications tied to their operating system's performance; these approaches depend upon system calls such as gettimeofday( ). Conversely, a dedicated operating system (OS) operates at kernel level and uses lower-level hardware calls such as rdtsc( ) (read-time stamp counter).
Reliability compromises are a misguided attempt to reduce cost but introduce issues with many implementations that fail to keep up with network speeds, fail to maintain a count of dropped packets, crash or hang during use.
The performance of currently available network traffic monitoring systems can be illustrated by tcpdump, libpcap, and remote monitoring (RMON) systems. Tcpdump is an invaluable, easy-to-use, portable, free tool for network administrators. It is designed as a user interface application relying upon functionality contained in the lower-level libpcap library, which has also been successfully used with other applications. Unfortunately, by nature tcpdump and libpcap have limitations due to decisions made concerning the compromises previously discussed. In particular, libpcap executes on a single machine, uses system calls to perform timestamps, and can be unreliable.
Furthermore, libpcap suffers from efficiency problems pandemic to implementations of traffic collection at user level. The operating system performs the required packet copy in the network stack (for transparency); this can double the time required to process a packet. The exact method used by libpcap and other tools varies by operating system but always involves a context switch into kernel mode and a copy of memory from the kernel to the user level library. This “call-and-copy” approach is repeated for every packet observed in Linux (or other operating systems), while other implementations use a ring buffer in an attempt to amortize costs over multiple packets. At high network speeds, the overhead of copying each individual packet between kernel and user space becomes so excessive that as much as 50% of the aggregate network traffic is dropped when using tcpdump/libpcap over a gigabit Ethernet link.
Remote monitoring (RMON) devices contain some traffic-collection functionality as well as some management functionality. These devices provide a superset of the functionality of tcpdump/libpcap but work in much the same way. Although the management software provides a nice interface to the hardware RMON device, it also introduces substantial overhead that limits the fidelity of packet timestamps to the order of seconds. This fidelity is a thousand times worse than tcpdump. RMON devices have several additional limitations.
First, the packet-capture mode of the RMON device often silently drops packets. Second, the data-transfer mode of the RMON device requires an active polling mechanism from another host to pull data across. Finally, the RMON devices themselves hang or crash often, e.g., every 36-72 hours.
Thus, there is need for a new, specialized network traffic monitoring methodology that is scalable, efficient, reliable, and inexpensive. The need for such a system has heretofore remained unsatisfied.