The Transport Control Protocol (TCP) is one of the most widely used data transport protocols in network environments. TCP is used in the Internet and in many Intranet and network backbones where both application servers and Network Attached Storage (NAS) servers are interconnected by a high-speed network. The Transport Control Protocol is also used as the reliable transport mechanism for many network attached storage protocols, including the Common Internet File System (CIFS), the Network File System (NFS) and also for Storage Area Network protocols such as Small Computer Systems Interface commands encapsulated within the Transport Control Protocol/Internet Protocol (iSCSI). TCP is a robust protocol and provides reliable connection-oriented communication over a wide variety of network topologies and at a variety of speeds. However, the observed data transfer rate may be less than anticipated, or the actual availability of data servers may be less than estimated because of numerous factors. The availability of data servers may be affected by, for example, problems with the network interconnecting the data servers and clients. Such problems could include, e.g., a failure of routers/switches and loss of connectivity due to incorrect cabling and/or failures in a cable. There are several possible problems that may arise in the implementation of the protocol. Packet transmitters or receivers (or both) may be poorly configured or overloaded. The network or the portion thereof may lack sufficient bandwidth. Multiple data packet losses may occur and require coarse grained retranmissioned timeouts. The file system protocol stack layered over TCP/IP may have implementation errors. The Transport Control Protocol (TCP) stack at the endpoints may exhibit bad behavior. Optimizations used within the network path may have generated errors. And, interoperability errors may result between heterogeneous operating system endpoints.
These errors and problems may be caused by faulty protocol implementations in software or hardware executing on a network. These errors may also be generated by, for example improper configuration of hardware and/or software settings and parameters, faulty network wiring or by improperly planned network installations.
As most TCP/IP and file system protocol implementations are not designed for easy debugging of problems, various techniques have been designed to diagnose protocol-related problems. One common technique involves using some form of packet capture mechanism, such as the Berkeley Packet Filters to capture packets. Manual analysis by experts of the captured low-level packet traces is then used to isolate abnormal protocol behavior and to trace the behavior to misconfigured, overloaded or faulty elements in a network. Although this technique permits the analysis of specific transmissions, it is generally inconvenient, costly and error-prone.
In a modern networking environment, the typical local area network (LAN) centered around company's operations may utilize hardware and/or software from a variety of different vendors. The customer (or client) of these vendors may be faced with a variety of differing protocol implementations and error reporting mechanisms. As customers typically do not have a staff of well-trained network engineers to aid in the debugging or isolating protocol errors, customers rely on the customer service and technical staffs of vendors from which the purchase or acquire their software and/or hardware. With the increasing design of the heterogeneous network environments, the number of protocol errors requires an ever increasing amount of customer service time and man power. This results in an increased cost to vendors, which may be passed onto customers in the form of increased prices.
By way of example, network environment 100 is shown in FIG. 1. The network environment 100 is centered around a wide area network, such as the well-known Internet 105. Connected to the Internet is a customer service site 110. The customer service site 110 is associated with a provider of hardware and/or software executing around the LAN and provides, inter alia, a data connection for customer uploading of packet traces for technical support analysis. Also connected to the Internet 105 is a Local Area Network (LAN) 115 that is associated with a particular customer's network. Attached to the LAN 115 may be an application server 120, a client 125 and a file server 130, which is also connected to a set of disks 135. It should be noted that the network configuration shown here is exemplary only. For example, the Local Area Network (LAN) may be a Wide Area Network (WAN) or a Virtual Private Network (VPN) or any other networking environment. Additionally, any number of devices, including, for example, additional servers, clients, and file servers, may be connected to the LAN.
In known diagnostic systems, a customer needs to create a packet trace and upload it to the customer service site for manual analysis by a customer engineer. A flowchart detailing the steps of such a manual analysis procedure 200 is shown in FIG. 2. Initially, in step 205, the customer creates a trace file. The trace file can be captured at the problem scene using well-known and conventional trace-capture programs including the freeware applications, such as EtheReal, snoop, tcpdump or NetMon. Next, in step 210, the customer sends the trace file to the customer service site. Typically, the customer uses the File Transfer Protocol (FTP) or similar protocol to send the trace file to the customer service site 110 over the Internet. A noted disadvantage of the use of FTP to send these trace files is the size of the trace files. In some environments, the trace file may exceed 1 gigabyte (GB) in size. Thus, a related disadvantage of sending the trace file is the concomitant network bandwidth required to transfer such a large quantity of data.
At some time after receipt of the trace file, in step 215, customer service manually analyzes the file to diagnose a problem. Customarily, each customer service engineer analyzing a given trace file uses a collection of learned know how and personal experience to identify particular problems. Individual engineers may be able to quickly recognize particular errors from the given trace file due to their recurring appearance. Hence, the effectiveness of this manual analysis relies on the individual capabilities and skill of the customer service engineer performing the analysis. Depending upon the complexity of the trace file, it may take a customer service expert up to 24 hours to manually analyze the trace file. Once the trace file has been analyzed and a diagnosis generated, the customer service center then alerts the customer of the identified problem (step 220). The customer may then take the appropriate prescribed remedial action to correct the identified problem.
As the number and complexity of network traces that are delivered to customer service stations increases, customer service stations experience longer resolution cycles. However, customers demand of quick resolution of networking errors, because of the potential for lost productivity and income. Thus, a technique for automating trace analysis and diagnosis to identify protocol errors and to suggest possible solutions is highly desirous.