It is quite tough and time-consuming to position causing performance degradation of a complex system, particularly a production system. Though consistent performance is one of system design goals, almost every system, especially those built with many components, layers or complex logics, suffers unexpected performance degradation in reality, for instance, typically due to the following causes: design and/or implementation defects or limits, software configuration problems and/or hardware limits. Design and/or implementation defects or limits may be about a specific component, i.e., locking or serialization in key I/O path, or more commonly, unexpected interactions between several components, i.e., resource contentions or limited scalability and so on. For example, typical enterprise storage system comprises protocol, cache, data reduction, automatic thin provisioning, snapshot and more and more background services. Any single component and interaction between components may impact user visible performance. Software configuration problem may be, such as block size, cache size or queue size and so on. Hardware limits comprise, for example, specific hardware components (NIC/FC port, CPU or disk) reaching upper limit and becoming a bottleneck of the system, so that the end-end performance of the system cannot continue and so on.
At present, determining a cause of the system degradation is usually a long-duration and postmortem process. The process generally requires manually collecting materials and building an analog environment to represent the problem. As the performance behaviors cannot be captured instantly when performance degrades and lack of orchestration, the above process is usually a manual and repetitively interactive process, thereby causing low-efficiency, uncertain accuracy and high costs.
Therefore, a more accurate and efficient approach is required in the art to solve the above problem.