The creation and storage of digitized data has proliferated in recent years. Accordingly, techniques and mechanisms that facilitate efficient and cost effective storage of large amounts of digital data are common today. For example, a cluster network environment of nodes may be implemented as a data storage system to facilitate the creation, storage, retrieval, and/or processing of digital data. Such a data storage system may be implemented using a variety of storage architectures, such as a network-attached storage (NAS) environment, a storage area network (SAN), a direct-attached storage environment, and combinations thereof. The foregoing data storage systems may comprise one or more data storage devices configured to store digital data within data volumes.
A data storage system includes one or more storage devices. A storage device may be a disk drive organized as a disk array. Although the term “disk” often refers to a magnetic storage device, in this context a disk may, for example, be a hard disk drive (HDD) or a solid state drive (SSD).
In a data storage system, information is stored on physical disks as volumes that define a logical arrangement of disk space. The disks in a volume may be operated as a Redundant Array of Independent Disks (RAID). The RAID configuration enhances the reliability of data storage by the redundant writing of data stripes across a given number of physical disks in a RAID group and the storing of redundant information (parity) of the data stripes. The physical disks in a RAID group may include data disks and parity disks. The parity may be retrieved to recover data when a disk fails.
Information on disks is typically organized in a file system, which is a hierarchical structure of directories, files and data blocks. A file may be implemented as a set of data blocks configured to store the actual data. The data blocks are organized within a volume block number (VBN) space maintained by the file system. The file system may also assign each data block in the file a corresponding file block number (FBN). The file system assigns sequences of FBNs on a per-file basis, while VBNs are assigned over a large volume address space. The file system generally comprises contiguous VBNs from zero to N−1, for a file system of size N blocks.
An example of a file system is a write-anywhere file system that does not overwrite data on disks. Instead a data block is retrieved from a disk into a memory and is updated or modified (i.e., dirtied) with new data, the data block is thereafter written to a new location on the disk. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks, which results in efficient read operation. When accessing a block of a file in response to a request, the file system specifies a VBN that is translated into a disk block number (DBN) location on a particular disk within a RAID group. Since each block in the VBN space and in the DBN space is typically fixed (e.g., 4 K bytes) in size, there is typically a one-to-one mapping between the information stored on the disks in the DBN space and the information organized by the file system in the VBN space. The requested block is then retrieved from the disk and stored in a buffer cache of the memory as part of a buffer tree of the file. The buffer tree is an internal representation of blocks for a file stored in the buffer cache and maintained by the file system.
As discussed before, the requested data block is retrieved from the disk and stored in a buffer cache of the memory. If the data block is updated or modified by a CPU, the dirty data remains in the buffer cache. Multiple modifying operations by the CPU are cached before the dirty data is stored on the disk (i.e., the buffer is cleaned). The delayed sending of dirty data to the disk provides benefits such as amortized overhead of allocation and improved on-disk layout by grouping related data blocks together. In the write anywhere file system, the point in time when a collection of changes to the data blocks is sent to the disk is known as consistency point (CP). A CP may conceptually be considered a point-in-time image of the updates to the file system since the previous CP. The process of emptying the buffer cache by sending the dirty data to the disk is accomplished by collecting a list of modes that have been modified since the last CP and then cleaning the inodes. It will be appreciated that cleaning dirty buffers involve assigning new locations on disk for the dirty buffers and then flushing the buffers to those locations on disk. An inode is a data structure used to store information, such as metadata, about a file, whereas data blocks are data structures used to store the actual data for the file. The information in an inode may include ownership of the file, access permission for the file, size of the file, and file type and references to locations on disk of the data blocks for the file. The references to the locations of the file data are provided by pointers which may reference the data blocks.
Initially a CPU issues a cleaner message indicating that the dirty buffers of one or more inodes need to be allocated on disk. In response, a block allocator in the file system selects free blocks on disks to which to write the dirty data and then queues the dirty buffers to a RAID group for storage. The block allocator examines a block allocation bitmap to select free blocks within the VBN space of a logical volume. The selected blocks are generally at consecutive locations on the disks in a RAID group for a plurality of blocks belonging to a particular file. When allocating blocks, the file system traverses a few blocks of each disk to lay down a plurality of stripes per RAID group. In particular, the file system chooses VBNs that are on the same stripe per RAID group to avoid RAID parity reads from disk.
In a cluster network environment having a plurality of multi-processors (MPs), multiple cleaner messages may be executing concurrently on MPs. The block allocator of the file system is required to respond to the multiple cleaner messages by selecting free blocks on disks on a RAID group and then queuing dirty buffers to the RAID group for writing. With new hardware platforms providing increasing number of CPUs, it becomes difficult for existing block allocators to timely respond to the cleaner messages, thus resulting in processing delay. Also, for efficient utilization of storage resources, depending on the particular type of data it is to store the data in a specific type of disk or a specific location on disk. For example, if a particular data block is frequently accessed, it is advantageous to store the data block in a SSD or the outer cylinder of an HDD for quick retrieval. If, on the other hand, the data is not frequently accessed, it may be acceptable to store the data block in the inner cylinder of an HDD. Many existing block allocators do not allow a user to select the type of disk or a location on disk to write the dirty buffers