1. Field of the Invention
This invention relates to methods for testing computer systems, and more particularly to a method and an apparatus for finding a source of failure during a file system access by retrying the failed file system access from different locations in a computer system.
2. Related Art
As computer systems grow increasingly more complex, it is becoming increasingly more difficult to isolate and identify the sources of some computer system failures. A distributed computing system is particularly prone to this type of problem because the distributed computing system typically spans numerous computers and file servers coupled together through a computer network. Consequently, failures can arise in any one of the interacting hardware, software and even firmware components of the distributed computing system. Hence, it is often difficult, if not impossible, to identify the source of an error by testing from a single location in the distributed system.
Distributed computing systems provide significant advantages for computer system users. Distributed computing systems typically include a distributed file system, which allows processors on different nodes of the system to access files on other nodes of the system. For example, workstations can often access files residing on remote file servers. Additionally, distributed computing systems often provide facilities to make these file accesses transparent, so that application programs and workstation users can access files on the file server in the same way that they access files on a local disk drive.
Although, distributed file systems provide significant advantages for computer system users, the process of designing, building and configuring a distributed file system can be very complicated. It is often very hard to identify the source of a failure in a distributed file system, because a design flaw or a component failure at many different locations in the distributed system may potentially cause the failure.
Existing diagnostic tools can often only detect the fact that a failure has occurred, not the source of a failure. These diagnostic tools typically generate a stream of file references that systematically test accesses to different storage media in the distributed system. During testing, errors in writing to or reading from a particular file are typically indicated on a display. However, the fact that a failure occurred during a particular file operation is often not enough to identify the source of the failure. For example, a failure in writing a test pattern to a location in a file and then reading the same pattern back again may be caused by a number of factors, including: (1) a failed network interface card, (2) a failed network driver, (3) a bad disk sector or (4) a failure in a computer system's memory. In order to isolate the source of a failure, additional tests must be performed, during which various system components may be swapped, modified or otherwise manipulated.
However, existing diagnostic tools do not facilitate this additional testing, because existing diagnostic tools do not provide facilities to retry a failure. It is typically necessary to rerun an entire test from the beginning to reproduce a failure. This process can take many hours, and possibly days. Consequently, reproducing the failures multiple times in order to locate the source of a failure can take a great deal of time.
Furthermore, in a large computer system, and especially in a distributed computing system, it may not be possible to locate the source of a failure by merely retrying the failure from a single location in the system. For example, it may be necessary to retry a failed file system access from both a workstation and a file server to determine that the failure is caused by a faulty network connection between the workstation and the file server.
Additionally, if an entire test must be rerun to retry a failure, it may not be possible to reproduce an error because system parameters can change over time. For example, the allocation of buffers to disk blocks in the computer system's disk cache will change over time. These changes can lead to different results for the same test sequence. Also, if the diagnostic tool generates randomized sequences of file references, these randomized sequences may be impossible to reproduce. Furthermore, intermittent errors may be impossible to detect if a great deal of time passes before a troublesome file access is retried. This is because atmospheric parameters, such as temperature, can change leading to different results for the same test sequence.
What is needed is a diagnostic tool for testing file references and detecting the source of a failure during a file reference, which allows a failure in a file reference to be immediately retried "on the fly." This would allow system components to be manipulated between retries to more rapidly determine the source of a failure.
Additionally, what is needed is a diagnostic tool for testing file references and detecting the source of a failed file reference, which allows a failure in a file reference to be retried from different locations within a computer system in order to isolate a source of failure.