A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage (NAS) environment, a storage area network (SAN) and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
Storage of information on the disk array is preferably implemented as one or more storage “volumes” of physical disks, defining an overall logical arrangement of disk space. The disks within a volume are typically organized as one or more groups, wherein each group may be operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information (parity) with respect to the striped data. As described herein, a volume typically comprises at least one data disk and one associated parity disk (or possibly data/parity partitions in a single disk) arranged according to a RAID 4 or equivalent high-reliability implementation. The term “RAID” and its various implementations are well-known and disclosed in A Case for Redundant Arrays of Inexpensive Disks (RAID), by D. A. Patterson, G. A. Gibson and R. H. Katz, Proceedings of the International Conference on Management of Data (SIGMOD), June 1988.
The storage operating system of the storage system may implement a high-level module, such as a file system, to logically organize the information stored on the disks as a hierarchical structure of data containers, such as files and blocks. For example, each “on-disk” file may be implemented as set of data structures, i.e., disk blocks, configured to store information, such as the actual data for the file. These data blocks are organized within a volume block number (vbn) space that is maintained by the file system. The file system may also assign each data block in the file a corresponding “file offset” or file block number (fbn). The file system typically assigns sequences of fbns on a per-file basis, whereas vbns are assigned over a larger volume address space. The file system organizes the data blocks within the vbn space as a “logical volume”; each logical volume may be, although is not necessarily, associated with its own file system.
A known type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block is retrieved (read) from disk into a memory of the storage system and “dirtied” (i.e., updated or modified) with new data, the data block is thereafter stored (written) to a new location on disk to optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. An example of a write-anywhere file system that is configured to operate on a storage system is the Write Anywhere File Layout (WAFL®) file system available from Network Appliance, Inc., Sunnyvale, Calif.
The storage system may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access data containers stored on the system. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the storage system by issuing file-based and block-based protocol messages (in the form of packets) to the system over the network. In the case of block-based protocol packets, the client requests (and storage system responses) address the information in terms of block addressing on disk using, e.g., a logical unit number (lun).
A plurality of storage systems may be interconnected to provide a storage system environment configured to service many clients. Each storage system may be configured to service one or more volumes, wherein each volume stores one or more data containers. Yet often a large number of data access requests issued by the clients may be directed to a small number of data containers serviced by a particular storage system of the environment. A solution to such a problem is to distribute the volumes serviced by the particular storage system among all of the storage systems of the environment. This, in turn, distributes the data access requests, along with the processing resources needed to service such requests, among all of the storage systems, thereby reducing the individual processing load on each storage system. However, a noted disadvantage arises when only a single data container, such as a file, is heavily accessed by clients of the storage system environment. As a result, the storage system attempting to service the requests directed to that data container may exceed its processing resources and become overburdened, with a concomitant degradation of speed and performance.
One technique for overcoming the disadvantages of having a single data container that is heavily utilized is to stripe the data container across a plurality of volumes configured as a striped volume set (SVS), where each volume is serviced by a different storage system, thereby distributing the load for the single data container among a plurality of storage systems. One technique for data container striping is described in the above-incorporated U.S. Pat. No. 7,698,289, entitled STORAGE SYSTEM ARCHITECTURE FOR STRIPING DATA CONTAINER CONTENT ACROSS VOLUMES OF A CLUSTER. Here, stripes of content (data) of a data container are allocated to each volume of the SVS in a manner that balances data across the volumes of the SVS. In addition, various volumes of the SVS are configured to store (“cache”) meta-data associated with the container. As described in the above-incorporated patent application, a SVS may be utilized on a storage system that services both file-based and block-based data access requests. As such, a data container may by a file or other file system entity or may by a logical unit number that is accessible via block-based requests such as SCSI, iSCSI or FCP.
By striping the content of a data container across the volumes of a SVS, the load on the storage system environment is distributed across a plurality of nodes between the clients and the SVS volumes. Illustratively, each node comprises a network element (N-blade) and a disk element (D-blade). Each N-blade includes functionality that enables the node to connect to clients over a computer network, whereas each D-blade manages data storage on one or more storage devices. Within a node, the N and D-blade share a high bandwidth system bus, and, between the nodes, each N-blade is operatively interconnected to every other D-blade by a cluster switching fabric, which may be, e.g., a Gigabit Ethernet switch. Generally, the system bus within a node has higher bandwidth and/or lower latency than the cluster switching fabric which interconnects the nodes. Each N-blade contains functionality, e.g., a Locate( ) function, that enables it to identify the appropriate D-blade to route a given data access request for processing. It should be noted that all N and D-blades are not necessarily paired into nodes, as there may be more N-blades than D-blades (or vice versa) depending on the storage system architecture. Thus, some N-blades may lack a direct system bus connection to a D-blade, or vice versa. The N or D-blades lacking a direct system bus connection utilize only the cluster switching fabric for intra-cluster communication.
Although all data access requests may be directed to any N-blade, which then routes the requests to the appropriate D-blade to thereby balance the bandwidth and processing load on all D-blades and disk volumes, it is desirable to optimize the storage system by balancing the bandwidth and processing load on all N-blades. One solution is to equally divide all data access requests from a client among all available N-blades, presumably balancing the bandwidth and processing load among N-blades. Another solution is to utilize, e.g., a least-queue-depth algorithm to balance the load across all paths between the client and the N-blades. Under either solution, the N-blades route the data access requests to the appropriate D-blades via either the high bandwidth system bus within a particular node or the cluster switching fabric among the nodes, as necessary.
Although the above-mentioned solutions balance the bandwidth and processing load among all N-blades, it is additionally desirable to optimize the storage system by reducing traffic on the cluster switching fabric interconnecting the N and D-blades. In most applications, the shared system bus between the N and D-blades within a node is the most optimal route for a data access request to follow since, as noted, the system bus generally has higher bandwidth and lower latency than the cluster switching fabric. Thus the techniques for balancing bandwidth and processing load among all N-blades, using either equal load division or a least-queue-depth algorithm, fail to achieve the most optimal data flow through the storage system because they utilize primarily the cluster switching fabric, rather than the shared system bus within each node, to route data access requests to the appropriate D-blades.