The present invention relates to performance analysis of network computer storage systems and, more particularly, to rule-based analysis of metrics gathered from such systems and generation of recommendations for resolving actual or anticipated performance problems.
Computer workstations and application servers (collectively hereinafter referred to as “clients”) frequently access data that is stored remotely from the clients. In these cases, computer networks are used to connect the clients to storage devices (such as disks) that store the data. For example, Information Systems (IS) departments frequently maintain “disk farms,” tape backup facilities and optical and other storage devices (sometimes referred to as media) in one or more central locations and provide access to these storage devices via computer networks. This centralized storage (commonly referred to as “network storage”) enables data stored on the storage devices to be shared by many clients scattered throughout an organization. Centralized network storage also enables the IS departments to store the data on highly reliable (sometimes redundant) equipment, so the data remains available, even in case of a catastrophic failure of one or more of the storage devices. Centralized data storage also facilitates making frequent backup copies of the data and providing access to backed-up data, when necessary.
Specialized computers (variously referred to as file servers, storage servers, filers, etc., collectively hereinafter referred to as “storage appliances”) located in the central locations make the data on the storage devices available to the clients. Software in the storage appliances and other software in the clients cooperate to make the central storage devices appear to users and application programs as though the storage devices are locally connected to the clients.
In addition, the storage appliances can perform services that are not visible to the clients. For example, a storage appliance can present a logical “volume” to clients and implement the volume across a set of physical disks. That is, the appliance satisfies write and read requests issued by the clients to the logical volume by writing to, or reading from, one or more of the physical disks of the set. Spreading the contents of the volume across the set of physical disks (commonly referred to as “striping”) improves throughput by dividing the input/output (I/O) workload among the physical disks of the set. Thus, some I/O operations can be performed on one physical disks while other I/O operations are performed on other physical disks. Furthermore, a logical volume can provide more storage capacity than a single physical disk, and that capacity can be dynamically increased by adding physical disks to the logical volume.
In another example, a storage appliance can redundantly store data on a set of storage devices, such as on a Redundant Array of Inexpensive (or Independent) Disks (RAID). If one member of the RAID fails, the storage appliance uses the remaining members of the RAID to continue satisfying read and write commands from the clients until (optionally) the failed RAID member is replaced.
In some cases, storage devices are directly connected to storage appliances. Such directly attached storage devices are sometimes referred to as Network Attached Storage (NAS). In other cases, storage devices are connected to storage appliances via dedicated, high-performance networks, commonly referred to as Storage Area Networks (SANs). Clients can be connected to storage appliances via local area networks (LANs), wide area networks (WANs) or a combination of LANs and WANs.
Maintaining a high level of performance (such as fast response time and/or high throughput) of storage appliances and their related storage devices can be challenging, especially with the constantly increasing amount of data IS departments are called on to store and to make available to their respective clients. This challenge is increased by ever changing workloads placed on the storage appliances and storage devices as a result of shifting business priorities within user communities. Manually collecting and analyzing system and performance metrics from storage appliances is time-consuming and yields results of varying quality, depending, among other things, on the experience level of the person performing the analysis. Furthermore, when databases are moved within network storage systems, or the software in storage appliances is upgraded to newer versions or new features are added, these systems often suffer performance degradations. These performance problems are sometimes caused by “bottlenecks” created by the moves. Other times, the problems result from increased workloads the changes place on processors, memory or other resources in the storage appliances. Thus, analyzing system and performance metrics of network storage components and determining appropriate actions that can be taken to correct performance problems are difficult tasks. Furthermore, predicting the performance impact of proposed changes to a network storage system is particularly challenging.