Developers of large disk arrays may typically perform large and difficult debug processes when bringing a product on line and to completion. After all the components of such an array have been assembled and it has been confirmed that the array can successfully run one command at a time, the developers begin to confirm that the array will properly operate in a more real-world type of environment. Testing in such an environment includes a much more rigorous procedure, including multiple computers issuing multiple simultaneous or closely bunched commands to multiple disk drives in the array. Ultimately, in such a test environment, and also when the system is in use in the real world, there may be hundreds, or even thousands of commands pending for processing at any one time.
During such a test, as well as when in use in a real-world computing environment, often things start off running smoothly and the array is able to handle the large number of issued commands. As the test runs for a time, performance may slow down, resulting in the multiple computers beginning to complain that disk access is taking too long. In extreme cases, the entire array may “lock up”, and no computer can do anything with any disk in the array, no matter how long the computer waits for a response.
These types of problems have traditionally been very hard to diagnose and fix, because the actual processing that is causing the problem takes place long before any symptoms are noticed. By the time any of the symptoms described above are noticed, the conditions that caused the problem have likely long passed through the system and have been processed.
Sometimes, such symptoms are caused by a particular command to a particular disk drive taking far longer than it should to complete. In extreme cases, the command may simply hang, and is never completed. Determining the precise one command in a sea of hundreds or thousands of other commands pending at the same time that caused such a problem has always been a difficult task.
It would therefore be beneficial to have some way to detect problems quickly in such a situation, and to provide a picture of what type of processing was being implemented at the time of the problem.