A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the storage system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage (NAS) environment, a storage area network (SAN) and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
Storage of information on the disk array is preferably implemented as one or more storage “volumes” of physical disks, defining an overall logical arrangement of disk space. The disks within a volume are typically organized as one or more groups, wherein each group may be operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). Most RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information (parity) with respect to the striped data. The physical disks of each RAID group may include disks configured to store striped data (i.e., data disks) and disks configured to store parity for the data (i.e., parity disks). The parity may thereafter be retrieved to enable recovery of data lost when a disk fails. The term “RAID” and its various implementations are well-known and disclosed in A Case for Redundant Arrays of Inexpensive Disks (RAID), by D. A. Patterson, G. A. Gibson and R. H. Katz, Proceedings of the International Conference on Management of Data (SIGMOD), June 1988.
The storage operating system of the storage system may implement a high-level module, such as a file system, to logically organize the information stored on the disks as a hierarchical structure of data containers, such as directories, files and blocks. For example, each “on-disk” file may be implemented as set of data structures, i.e., disk blocks, configured to store information, such as the actual data for the file. These data blocks are organized within a volume block number (vbn) space that is maintained by the file system. The file system organizes the data blocks within the vbn space as a “logical volume”; each logical volume may be, although is not necessarily, associated with its own file system. The file system typically consists of a contiguous range of vbns from zero to n, for a file system of size n+1 blocks.
A known type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block is retrieved (read) from disk into a memory of the storage system and “dirtied” (i.e., updated or modified) with new data, the data block is thereafter stored (written) to a new location on disk to optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. An example of a write-anywhere file system that is configured to operate on a storage system is the Write Anywhere File Layout (WAFL®) file system available from Network Appliance, Inc., of Sunnyvale, Calif.
The storage system may be configured to operate according to a client/server model of information delivery to thereby allow many clients to access the directories, files and blocks stored on the system. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. Each client may request the services of the file system by issuing file system protocol messages (in the form of packets) to the storage system over the network. By supporting a plurality of file system protocols, such as the conventional Common Internet File System (CIFS) and the Network File System (NFS) protocols, the utility of the storage system is enhanced.
Storage system users often desire to search the data containers stored within storage systems to identify those containers that contain one or more search criteria, such as phrases and terms. As noted, data containers may include a file, a directory, a virtual disk (vdisk), or other data construct that is addressable via a storage system. For example, a user may desire to search and locate all data containers serviced by the storage system that contain the phrase “Accounts Receivable.” By enabling searching of data containers on storage systems, users may make better utilization of their data, especially in large enterprises where the number of data containers may be in the tens or hundreds of millions.
To identify data containers that meet the search criteria, a search process may need to examine all of the data containers within a storage system every time a search is requested. In a typical storage system, having tens or hundreds of millions of data containers, this is not a practical solution due to the substantial amount of time required to access every data container to determine if it contains the search criteria. To enable faster searching, a search index of information associated with the data containers may be generated for the storage system. The storage system search index may be constructed by performing a file system “crawl” through the entire file system (or other data container organizational structure) serviced by the storage system. Typically, a file system crawl involves accessing every data container within the file system to obtain the necessary index information. However, such a file system crawl is expensive both in terms of disk input/output operations and processing time, and suffers from the same practical problems of directly accessing each data container. That is, the file system crawl may slow access to the file system for tens of minutes at a time, which results in an unacceptable loss of performance.
Furthermore, the file system crawl must be performed at regular intervals to maintain up-to-date index information. As a result of the substantial processing time required, a further disadvantage of the file system crawl is that the search index information may be inconsistent with the current state of the file system, i.e., the index information only represents the file system as of the completion of the last file system crawl.
One technique for improving search indexing capabilities is to utilize a search appliance operatively interconnected between a storage system and clients of the storage system. As used herein a search appliance denotes a computer executing indexing and/or searching software for use in preparing search indices of data containers served by a storage system and/or for executing searches on the data containers. Illustratively, the search appliance executes indexing software that monitors data access requests as they flow through the search appliance to the storage system. By monitoring the data containers modified by the data access requests, the indexing software identifies which data containers should be retrieved from the storage system to update the index information, thereby obviating in the need for a full file system crawl.
Such a prior art storage system and search appliance environment 100 is shown in FIG. 1. One or more clients 105, which may comprise personal computers or other computers desiring access to the storage system 120, are interconnected with a search appliance 115, which is operatively interconnected with storage system 120. The search appliance is thus “in line” (or in-band) with the storage system 120 and clients 105. Coupled to the storage system 120 is a set of data storage devices 125, such as disks. In operation, a client 105 transmits a data access request to the search appliance 115, which examines the request and performs appropriate indexing operations before forwarding the request to the storage system 120 for processing. The search appliance 115 thus operates as a proxy for the storage system 120.
A noted disadvantage of such an environment 100 is that the search appliance 115 must perform indexing operations in real-time, i.e., as data access requests flow through to the storage system. Since all data access requests must flow through the in-band search appliance 115, there is an additional increase in processing latency of the requests, which may result in an unacceptable level of performance. A further noted disadvantage is the possibility that the search appliance 115 may modify the data access requests or otherwise interject error conditions into data flowing to/from the storage system 120, thereby resulting in data corruption and/or data loss. Additionally, an in-band search appliance presents a single point of failure, i.e., if the search appliance fails, then the sole data path between the client and a storage appliance is lost. Furthermore, another noted disadvantage of in-band search appliances is that they must be as robust as the supporting storage system as all data access requests flow through the search appliance. Any faults of protocol implementation in the search appliance may result in data corruption on the storage system.