With the growth of distributed enterprise computing comes the expensive task of testing and debugging software that is expected to run on large networks such as wide area networks or WANs. As competition between vendors increases and the critical role of software becomes more apparent, conflicting needs have arisen: those of (1) testing software more quickly thereby delivering products to the marketplace sooner and (2) testing software more comprehensively so as to insure consistently high quality. With the growth of distributed network computing, the specific need to test network connected multi-computer software systems in the presence of network failures and imperfect network properties has arisen. Testing and debugging on real multi-site networks, including those that span large geographic distances, is very expensive and has limited value due to the inherent discrepancies between a particular test network and the much broader range of network imperfections that will be encountered by the eventual users of the software. A single unaltered real test network cannot adequately represent the broad range of user networks. Also, due to the infrequency of several types of faults on real networks and the need to test the impact of such faults on a large number of product states, adequate testing confined to a real network becomes a very protracted, expensive, and often an impractical prospect. It is also difficult to debug network related software failures because the specific network behavior needed to reproduce the defect for debugging cannot be reproduced on demand on a real network.
Many types of networks exist from relatively fault-free LANs (Local Area Networks) to fault-rich Internet based WANs (Wide Area Networks). Software is normally tested on LANs located in test labs. The more error prone WAN end of the spectrum is fundamentally different from the LAN in that WANs have relatively high and variable latency. Many defects found in software ultimately result from timing problems hence it is very important to test software over a full range of network latency conditions. An additional characteristic of networks is that the delivery of packetized traffic is not guaranteed to be first-in, first-out (FIFO). On many networks successive packets may take routes of different lengths and thus arrive at their common destination out of order. Also, many networks do not guarantee delivery therefore delivery failures and dropped and corrupted packets for various reasons are not uncommon. Shuffled, dropped, and corrupted packets may be handled in the communications stack below the application level, but the communicating processes still see these events as erratic variations in network latency that may expose timing related defects in the software being tested. Real networks also suffer from persistent communication outages such that two or more computers lose the ability to communicate with each other for long periods of time.
Emulating a broader range of representative user networks with a greater than normal frequency of network faults makes it possible for developers of software to test products more throughly at lower cost and at an accelerated pace. In addition, network emulation makes it practical and efficient to optimally tune as well as measure the performance of software systems over various combinations of network latency and frequency of fault conditions. For example, networks with high latency and high packet loss rates. Further, network emulation allows the specific network behavior needed to reproduce a defect so that it can be debugged to be created on demand.
An ideal network free of imperfections would have zero latency and deliver all traffic in the order that it was sent. On real networks, latency varies erratically and can be very high. Latency is introduced by the distance that traffic must travel, the speed of the alternative routes available, and by the temporary buffering of traffic by the various pieces of routing hardware encountered along the way.
On some networks, traffic can arrive out of order and must be correctly reordered before the communication is presented to the application. The time taken to recover from out-of-order arrivals is seen by the software that reads the communication as transient increases in latency. In addition to transient communication outages, persistent communication outages of much longer duration can occur. Transient and persistent outages are often handle by different mechanisms in the software system and therefore network emulation must allow both conditions to be tested or reproduced for debugging purposes.
All of these imperfections are exaggerated and more pervasive on Internet based wide area networks, but they also occur on more localized networks.
Distributed enterprise computing involves multiple asynchronous processes, threads and multiple processors which provide the opportunity for creating software race conditions. Race conditions are by definition sensitive to timing of events and consequently to the erratic latency of imperfect networks. A race condition that resolves favorably in the presence of one degree of latency may resolve unfavorably under a different degree of latency. Race conditions present themselves when software transitions from one state to another. There are a very large numbers of state transitions in real world software applications and thus many opportunities for race conditions. It is the concurrence of a fault-vulnerable state transition with the latency-of-the-moment that reveals the underlying software defect. This underscores the importance of testing a large number of state transitions in combination with a broad range of latencies. Emulation of a range of specific latencies makes this a much more tractable task.
Besides latency, the other principle class of network faults is persistent communication outages. The software system must respond appropriately to the outage. Again, the large number of program state transitions comes into play because a communication outage could coincide with a transition. A program might correctly handle a communication outage at some but not all transitions. It is the concurrence of a fault-prone transition with a communication outage that reveals this underlying defect. This underscores the importance of testing a large number of state transitions in concurrence with communication outages. The frequency of communication outages on real networks is insufficient for testing purposes. Emulation allows a tester to increase the frequency of outages and thus accomplish the necessary testing in the available time or reproduce a defect for the purposes of debugging in a relatively short period of time.
Network emulation has been used in the past to test distributed software systems but has required some combination of modifications to the programs being tested, to the communication stack of the host sending or receiving the traffic, and to the contents of the packetized traffic. Such modifications are costly and make the system being tested different than the system delivered to users. This reduction of relevancy makes the testing less valuable. These modifications may also change the system in a significant way so that a defect cannot be reproduced for the purpose of debugging.
Other emulators, such as the Lancaster emulator described in "A Network Emulator to Support the Development of Adaptive Applications" written by Nigel Davies, Gordon S. Blair, Keith Cheverst and Adrian Friday and published in Proceedings of the 2nd Usenix Symposium on Mobile and Location independent Computing, Ann Arbor, Mich., U.S. on Apr. 10, 1995 are available in the industry, but they use some of the intrusive modifications mentioned above that significantly change the system being tested and raise cost.
Presently known emulators are intrusive because they change the programs that are being tested or the communications stack of the hosts on which they run thereby imposing a computational load on every participating host and consequently changing the timing of the distributed systems being tested. The present invention eliminates the need to make intrusive modifications and requires only a simple change to the table that maps the IP network addresses of hosts to MAC addresses of hosts and the turning off of the ARP (Address Resolution Protocol) mechanism that normally maintains this table. These configuration changes must be made on each computer that runs the software to be tested.