1. Field of the Invention
The present invention relates to managing metadata.
2. Background of the Invention
Storage devices are employed to store data that is accessed by computer systems. Examples of basic storage devices include volatile and non-volatile memory, floppy drives, hard disk drives, tape drives, optical drives, etc. A storage device may be locally attached to an input/output (I/O) channel of a computer. For example, a hard disk drive may be connected to a computer's disk controller.
As is known in the art, a disk drive contains at least one magnetic disk which rotates relative to a read/write head and which stores data nonvolatilely. Data to be stored on a magnetic disk is generally divided into a plurality of equal length data sectors. A typical data sector, for example, may contain 512 bytes of data. A disk drive is capable of performing a write operation and a read operation. During a write operation, the disk drive receives data from a host computer along with instructions to store the data to a specific location, or set of locations, on the magnetic disk. The disk drive then moves the read/write head to that location, or set of locations, and writes the received data. During a read operation, the disk drive receives instructions from a host computer to access data stored at a specific location, or set of locations, and to transfer that data to the host computer. The disk drive then moves the read/write head to that location, or set of locations, senses the data stored there, and transfers that data to the host.
A storage device may also be accessible over a network. Examples of such a storage device include network attached storage (NAS) and storage area network (SAN) devices. A storage device may be a single stand-alone component or be comprised of a system of storage devices such as in the case of Redundant Array of Inexpensive Disks (RAID) groups.
Virtually all computer application programs rely on such storage devices which may be used to store computer code and data manipulated by the computer code. A typical computer system includes one or more host computers that execute such application programs and one or more storage systems that provide storage.
The host computers may access data by sending access requests to the one or more storage systems. Some storage systems require that the access requests identify units of data to be accessed using logical volume (“LUN”) and block addresses that define where the units of data are stored on the storage system. Such storage systems are known as “block I/O” storage systems. In some block I/O storage systems, the logical volumes presented by the storage system to the host correspond directly to physical storage devices (e.g., disk drives) on the storage system, so that the specification of a logical volume and block address specifies where the data is physically stored within the storage system. In other block I/O storage systems (referred to as intelligent storage systems), internal mapping techniques may be employed so that the logical volumes presented by the storage system do not necessarily map in a one-to-one manner to physical storage devices within the storage system. Nevertheless, the specification of a logical volume and a block address used with an intelligent storage system specifies where associated content is logically stored within the storage system, and from the perspective of devices outside of the storage system (e.g., a host) is perceived as specifying where the data is physically stored.
In contrast to block I/O storage systems, some storage systems receive and process access requests that identify a data unit or other content unit (also referenced to as an object) using an object identifier, rather than an address that specifies where the data unit is physically or logically stored in the storage system. Such storage systems are referred to as object addressable storage (OAS) systems. In object addressable storage, a content unit may be identified (e.g., by host computers requesting access to the content unit) using its object identifier and the object identifier may be independent of both the physical and logical location(s) at which the content unit is stored (although it is not required to be because in some embodiments the storage system may use the object identifier to inform where a content unit is stored in a storage system). From the perspective of the host computer (or user) accessing a content unit on an OAS system, the object identifier does not control where the content unit is logically (or physically) stored. Thus, in an OAS system, if the physical or logical location at which the unit of content is stored changes, the identifier by which host computer(s) access the unit of content may remain the same. In contrast, in a block I/O storage system, if the location at which the unit of content is stored changes in a manner that impacts the logical volume and block address used to access it, any host computer accessing the unit of content must be made aware of the location change and then use the new location of the unit of content for future accesses.
One example of an OAS system is a content addressable storage (CAS) system. In a CAS system, the object identifiers that identify content units are content addresses. A content address is an identifier that is computed, at least in part, from at least a portion of the content (which can be data and/or metadata) of its corresponding unit of content. For example, a content address for a unit of content may be computed by hashing the unit of content and using the resulting hash value as the content address. Storage systems that identify content by a content address are referred to as content addressable storage (CAS) systems.
Some OAS systems employ file systems to manage storage of objects on one or more storage devices. A file system is a logical construct that translates physical blocks of storage on a storage device into logical files and directories. In this way, the file system aids in organizing content stored on a disk. For example, an application program having ten logically related blocks of content to store on disk may store the content in a single file in the file system. Thus, the application program may simply track the name and/or location of the file, rather than tracking the block addresses of each of the ten blocks on disk that store the content.
File systems maintain metadata for each file that, inter alia, indicates the physical disk locations of the content logically stored in the file. For example, in UNIX file systems an mode is associated with each file and stores metadata about the file. The metadata includes information such as access permissions, time of last access of the file, time of last modification of the file, and which blocks on the physical storage devices store its content. The file system may also maintain a map, referred to as a free map in UNIX file systems, of all the blocks on the physical storage system at which the file system may store content. The file system tracks which blocks in the map are currently in use to store file content and which are available to store file content.
When an application program requests that the file system store content in a file, the file system may use the map to select available blocks and send a request to the physical storage devices to store the file content at the selected blocks. The file system may then store metadata (e.g., in an mode) that associates the filename for the file with the physical location of the content on the storage device(s). When the file system receives a subsequent request to access the file, the file system may access the metadata, use it to determine the blocks on the physical storage device at which the file's content is physically stored, request the content from the physical storage device(s), and return the content in response to the request.
As mentioned above, some OAS systems may store content in a file system. FIG. 1 shows an example of such an OAS system 1010 that includes an OAS interface 1030, a file system 1050, and one or more storage devices 1070. When OAS Interface 1030 receives a request (e.g., from an application program) to store a content unit, the OAS Interface may assign an object identifier to the content unit (which may be generated either by the OAS system, the entity that issued the request, or some other entity), and may issue a request to file system 1050 to store the content unit in one or more files. The file system may store the content unit on physical storage device(s) 1070, and may store metadata associating the file(s) in which the content of the content unit is stored with the physical location(s) of the content on the physical storage device(s).
When a request to access the content unit (that identifies the content unit using its object identifier) is subsequently received by OAS Interface 1030, the OAS Interface determines the file or files in file system 1050 that logically store the content of the content unit in any suitable way.
In some OAS systems, when the OAS Interface receives a request to store a content unit and stores the content unit in one or more files in the file system, the OAS Interface may store metadata that associates the object identifier for the content unit with the filename(s) and/or file system location(s) of the file. The OAS Interface may use this metadata to determine the file(s) that store the content of the content unit. In some OAS systems, when the OAS Interface, in response to a request to store a content unit, stores the content of the content unit in one or more file(s) in the file system, the OAS Interface may instruct the file system to give the one or more file(s) a file name that includes all or a portion of the object identifier for the content unit. When a subsequent access request for the content unit (that identifies the content unit using its object identifier) is received, the OAS Interface may determine the file(s) that store the content of the content unit by locating the file(s) that have the object identifier (or a portion thereof) in their filename.
Once the OAS Interface determines the file(s) in file system 1050 that store(s) the content of the content unit, the OAS Interface made send a request to the file system to access the file(s). In response, the file system may determine the physical storage location(s) of the content unit on the physical storage device(s), and request the content stored and the determined physical storage location(s) from the physical storage device. Upon receiving the requested content, the file system may return the content to the OAS Interface, which may return it to the requesting entity.
The simplified block diagram of OAS system 1010 shows file system 1050 directly accessing storage device(s) 1070. However, this is provided merely as a simplified example, as file system 1050 may access the storage device(s) in any suitable way. For example, in some embodiments file system 1050 may access the storage device(s) via a device driver that provides an interface to the storage device(s) or via an operating system that interfaces with the device driver for the storage device(s).
A major requirement of storage systems is the transfer and retrieval of data without error. Thus, storage systems and storage array controllers employ error detection and recovery techniques to ensure data integrity.
One such technique is to provide a “mirror” for each storage device. In a mirror arrangement, data are written to at least two storage devices. Thus, data may be read from either of the two storage devices so long as the two devices are operational and contain the same data. That is, either of the two storage devices may process read requests so long as the two devices are in synchronization. When one of the storage devices fails, its minor may be used to continue processing read and write requests.
RAID parity schemes may be utilized to provide error detection during the transfer and retrieval of data across a storage system.
In the industry there have become defined several levels of RAID systems. The first level, RAID-0, combines two or more drives to create a larger virtual disk. In a dual drive RAID-0 system one disk contains the low numbered sectors or blocks and the other disk contains the high numbered sectors or blocks, forming one complete storage space. RAID-0 systems generally interleave the sectors of the virtual disk across the component drives, thereby improving the bandwidth of the combined virtual disk. Interleaving the data in that fashion is referred to as striping. RAID-0 systems provide no redundancy of data, so if a drive fails or data becomes corrupted, no recovery is possible short of backups made prior to the failure.
RAID-1 systems include one or more disks that provide redundancy of the virtual disk. One disk is required to contain the data of the virtual disk, as if it were the only disk of the array. One or more additional disks contain the same data as the first disk, providing a “mirror” of the data of the virtual disk. A RAID-1 system will contain at least two disks, the virtual disk being the size of the smallest of the component disks. A disadvantage of RAID-1 systems is that a write operation must be performed for each minor disk, reducing the bandwidth of the overall array. In a dual drive RAID-1 system, the first disk and the second disk contain the same sectors or blocks, each disk holding exactly the same data.
RAID-2 systems provide for error correction through hamming codes. The component drives each contain a particular bit of a word, or an error correction bit of that word. RAID-2 systems automatically and transparently detect and correct single-bit defects, or single drive failures, while the array is running. Although RAID-2 systems improve the reliability of the array over other RAID types, they are less popular than some other systems due to the expense of the additional drives, and redundant onboard hardware error correction.
RAID-4 systems are similar to RAID-0 systems, in that data is striped over multiple drives. For example, the storage spaces of two disks are added together in interleaved fashion, while a third disk contains the parity of the first two disks. RAID-4 systems are unique in that they include an additional disk containing parity. For each byte of data at the same position on the striped drives, parity is computed over the bytes of all the drives and stored to the parity disk. The XOR operation is used to compute parity, providing a fast and symmetric operation that can regenerate the data of a single drive, given that the data of the remaining drives remains intact. RAID-3 systems are essentially RAID-4 systems except that the entire stripe across all disks is always written together and is usually read as one; this argues for large, well formed writes and reads. RAID-4 and RAID-3 systems therefore are useful to provide virtual disks with redundancy, and additionally to provide large virtual drives, both with only one additional disk drive for the parity information. They have the disadvantage that the data throughput is limited by the throughput of the drive containing the parity information, which must be accessed for every read and write operation to the array.
RAID-5 systems are similar to RAID-4 systems, with the difference that the parity information is striped over all the disks with the data. For example, first, second, and third disks may each contain data and parity in interleaved fashion. Distributing the parity data generally increases the throughput of the array as compared to a RAID-4 system. RAID-5 systems may continue to operate though one of the disks has failed. RAID-6 systems are like RAID-5 systems, except that dual parity is kept to provide for normal operation if up to the failure of two drives.
Combinations of RAID systems are also possible. For example, in a four disk RAID 1+0 system, the first and second disks are mirrored, as are the third and fourth disks. The combination of the mirrored sets forms a storage space that is twice the size of one individual drive, assuming that all four are of equal size. Many other combinations of RAID systems are possible.
In at least some cases, when a logical volume is configured so that its data is written across multiple disk drives in the striping technique, the logical volume is operating in RAID-0 mode. Alternatively, if the logical volume's parity information is stored on one disk drive and its data is striped across multiple other disk drives, the logical volume is operating in RAID-4 mode. If both data and parity information are striped across multiple disk drives, the logical volume is operating in RAID-5 mode.
In addition to RAID parity schemes, a storage array controller may utilize an error and detection code to provide additional path and/or drive anomaly protection. Data path and drive anomaly protection schemes typically employ metadata that is stored on disk drives along with user data. This may require that the metadata is managed on a per input/output (I/O) basis. Further, each time user data is read from or written to media, the accompanying metadata must also be read from or written to media.
An approach known to the art for managing metadata involves interleaving metadata with the user data utilizing a 512-byte sector format. Generally, metadata is interleaved with the user data at fixed intervals, for example, a segment of user data may be followed by a sector of metadata. Typically, the size of the user data block in each interval matches the size of the cache block used to manage a storage controller's data cache. This allows the user data and the metadata for a given cache block to be picked up with a single read directed to that cache block.
In general, as a device for improving performance, a cache can copy the data from lower-speed storage devices (e.g., disks) to higher-speed storage devices (e.g., fast memories) to perform writing or reading commands so as to speed up the responses of systems.
The caching operation is basically deployed by using higher-speed storage devices in which retains a copy of data copied from lower-speed storage devices to allow readings or writings performed first on the higher-speed storage devices when it is required to read or write data from the lower-speed storage devices, and thus to speed up the responses of systems.
For example, a random access memory (RAM), which constitutes the main memory of a computer system, is running much faster than a disk so that the RAM memory can be partly used to be a cache of the disk. While reading the data of the disk, a copy of the read data will be stored in the cache. If the system repeats requests to read or write the same data or sectors which are already stored on the cache, the system can directly execute reading or writing actions on the cache memory instead. This method can improve the accessing speed of the system.
For example, regarding a storage system, a cache is also able to be used with the storage system's file systems to improve the overall performance of the system.
In general, the term “file system” refers to the system designed to provide computer application programs with access to data stored on storage devices in a logical, coherent way. File systems hide the details of how data is stored on storage devices from application programs. For instance, storage devices are generally block addressable, in that data is addressed with the smallest granularity of one block; multiple, contiguous blocks form an extent. The size of the particular block, typically 512 bytes in length, depends upon the actual devices involved. Application programs generally request data from file systems byte by byte. Consequently, file systems are responsible for seamlessly mapping between application program address-space and storage device address-space.
File systems store volumes of data on storage devices, i.e., collections of data blocks, each for one complete file system instance. These storage devices may be partitions of single physical devices or logical collections of several physical devices. Computers may have access to multiple file system volumes stored on one or more storage devices.
File systems maintain several different types of files, including regular files and directory files. Application programs store and retrieve data from regular files as contiguous, randomly accessible segments of bytes. With a byte-addressable address-space, applications may read and write data at any byte offset within a file. Applications can grow files by writing data to the end of a file; the size of the file increases by the amount of data written. Conversely, applications can truncate files by reducing the file size to any particular length. Applications are solely responsible for organizing data stored within regular files, since file systems are not aware of the content of each regular file.
Files are presented to application programs through directory files that form a tree-like hierarchy of files and subdirectories containing more files. Filenames are unique to directories but not to file system volumes. Application programs identify files by pathnames comprised of the filename and the names of all encompassing directories. The complete directory structure is called the file system namespace. For each file, file systems maintain attributes such as ownership information, access privileges, access times, and modification times.
File systems often utilize the services of operating system memory caches known as buffer caches and page caches. These caches generally consist of system memory buffers stored in volatile, solid-state memory of the computer. In this context, caching is a technique to speed up data requests from application programs by saving frequently accessed data in memory for quick recall by the file system without having to physically retrieve the data from the storage devices. Caching is also useful during file writes; the file system may write data to the memory cache and return control to the application before the data is actually written to non-volatile storage. Eventually, the cached data is written to the storage devices.
The state of the cache depends upon the consistency between the cache and the storage devices. A cache is “clean” when its contents are exactly the same as the data stored on the underlying storage devices. A cache is “dirty” when its data is newer than the data stored on storage devices; a cache becomes dirty when the file system has written to the cache, but the data has not yet been written to the storage devices. A cache is “stale” when its contents are older than data stored on the storage devices; a cache becomes stale when it has not been updated to reflect changes to the data stored on the storage devices.
In order to maintain consistency between the caches and the storage devices, file systems perform “flush” and “invalidate” operations on cached data. A flush operation writes dirty cached data to the storage devices before returning control to the caller. An invalidation operation removes stale data from the cache without invoking calls to the storage devices. File systems may flush or invalidate caches for specific byte-ranges of the cached files.
Many file systems utilize data structures mentioned above called modes to store information specific to each file. Copies of these data structures are maintained in memory and within the storage devices. Inodes contain attribute information such as file type, ownership information, access permissions, access times, modification times, and file size. Inodes also contain lists of pointers that address data blocks. These pointers may address single data blocks or address an extent of several consecutive blocks. The addressed data blocks contain either actual data stored by the application programs or lists of pointers to other data blocks. With the information specified by these pointers, the contents of a file can be read or written by application programs. When an application programs write to files, data blocks may be allocated by the file system. Such allocation modifies the modes.
Additionally, file systems maintain information, called “allocation tables”, that indicate which data blocks are assigned to files and which are available for allocation to files. File systems modify these allocation tables during file allocation and de-allocation. Most modem file systems store allocation tables within the file system volume as bitmap fields. File systems set bits to signify blocks that are presently allocated to files and clear bits to signify blocks available for future allocation.
The terms real-data and metadata classify application program data and file system structure data, respectively. In other words, real-data is data that application programs store in regular files. Conversely, file systems create metadata to store volume layout information, such as modes, pointer blocks (called indirect blocks), and allocation tables (called bitmaps). Metadata may not be directly visible to applications. Metadata requires a fraction of the amount of storage space that real-data occupies and has significant locality of reference. As a result, metadata caching drastically influences file system performance.
Metadata consistency is vital to file system integrity. Corruption of metadata may result in the complete destruction of the file system volume. Corruption of real-data may have bad consequences to users but will not affect the integrity of the whole file system.
A file may have other descriptive and referential information, i.e., other file metadata, associated with it. This information may be relative to the source, content, generation date and place, ownership or copyright notice, central storage location, conditions to use, related documentation, applications associated with the file or services.
Today there are different approaches for implementing the association of a file with metadata of that file. Basically, metadata of a file can be encoded onto the same filename of the file, they can be prepended or appended onto the file as part of a file wrapper structure, they can be embedded at a well-defined convenient point elsewhere within the file, or they can be created as an entirely separate file.
I/O interfaces transport data among the computers and the storage devices. Traditionally, interfaces fall into two categories: channels and networks. Computers generally communicate with storage devices via channel interfaces. Channels predictably transfer data with low-latency and high-bandwidth performance; however, channels typically span short distances and provide low connectivity. Performance requirements often dictate that hardware mechanisms control channel operations. The Small Computer System Interface (SCSI) is a common channel interfaces. Storage devices that are connected directly to computers are known as direct-attached storage (DAS) devices.
Computers communicate with other computers through networks. Networks are interfaces with more flexibility than channels. Software mechanisms control substantial network operations, providing networks with flexibility but large latencies and low bandwidth performance. Local area networks (LAN) connect computers medium distances, such as within buildings, whereas wide area networks (WAN) span long distances, like across campuses or even across the world. LANs normally consist of shared media networks, like Ethernet, while WANs are often point-to-point connections, like Asynchronous Transfer Mode (ATM). Transmission Control Protocol/Internet Protocol (TCP/IP) is a popular network protocol for both LANs and WANs. Because LANs and WANs utilize very similar protocols, for the purpose of this application, the term LAN is used to include both LAN and WAN interfaces.
Recent interface trends combine channel and network technologies into single interfaces capable of supporting multiple protocols. For instance, Fibre Channel (FC) is a serial interface that supports network protocols like TCP/IP as well as channel protocols such as SCSI-3. Other technologies, such as iSCSI, map the SCSI storage protocol onto TCP/IP network protocols, thus utilizing LAN infrastructures for storage transfers.
In at least some cases, SAN refers to network interfaces that support storage protocols. Storage devices connected to SANs are referred to as SAN-attached storage devices. These storage devices are block and object-addressable and may be dedicated devices or general purpose computers serving block and object-level data.
Block and object-addressable devices connect to SANs and share storage among multiple computers. As noted herein, block-address devices are common storage devices that are addressable by fixed length data blocks or sectors; in contrast, object-addressable devices are impending devices that are addressable by an object identifier and an offset into the object. Each object-addressable device may support numerous objects.
SANs are often dedicated networks specifically designed to transport block data; however, SANs may also operate as subsets of general purpose LANs and share the same physical network connections. Therefore, the type of data moving on the network dictates whether a network is a SAN or a LAN.
Local file systems service file-level requests for application programs only running on the same computer that maintains the non-shared file system volume. To achieve the highest levels of performance, local file systems extensively cache metadata and real-data within operating system buffer caches and page caches. Because local file systems do not share data among multiple computer systems, performance is generally very good.
Local file systems traditionally store volumes on DAS devices connected directly to the computer. A weakness of using DAS is that should the computer fail, volumes located on the DAS devices become inaccessible. To reclaim access to these volumes, the DAS devices must be physically detached from the original computer and connected to a backup computer.
SAN technologies enable local file system volumes to be stored on SAN-attached devices. These volumes are accessible to several computers; however, at any point in time, each volume is only assigned to one computer. Storing local file system volumes on SAN-attached devices rather than DAS devices has the benefit that the volumes may be easily reassigned to other computers in the event of failures or maintenance.
Distributed file systems provide users and application programs with transparent access to files from multiple computers networked together. Distributed file systems may lack the high-performance found in local file systems due to resource sharing and lack of data locality. However, the sharing capabilities of distributed file systems may compensate for poor performance.
Architectures for distributed file systems fall into two main categories: NAS-based and SAN-based. NAS-based file sharing places server computers between storage devices and client computers connected via LANs. In contrast, SAN-based file sharing, traditionally known as “shared disk” or “share storage”, uses SANs to directly transfer data between storage devices and networked computers.
NAS-based distributed file systems transfer data between server computers and client computers across LAN connections. The server computers store volumes in units of blocks on DAS devices and present this data to client computers in a file-level format. These NAS servers communicate with NAS clients via NAS protocols. Both read and write data-paths traverse from the clients, across the LAN, to the NAS servers. In turn, the servers read from and write to the DAS devices. NAS servers may be dedicated appliances or general-purpose computers.
NFS is a common NAS protocol that uses central servers and DAS devices to store real-data and metadata for the file system volume. These central servers locally maintain metadata and transport only real-data to clients. The central server design is simple yet efficient, since all metadata remains local to the server. Like local file systems, central servers only need to manage metadata consistency between main memory and DAS devices. In fact, central server distributed file systems often use local file systems to manage and store data for the file system. In this regard, the only job of the central server file system is to transport real-data between clients and servers.
Central server designs were early NAS-based distributed file systems. As the need for greater parallelism and enhanced availability grew, distributed file system designs evolved from central servers to multiple server configurations. As with central servers, multiple servers, also known as distributed servers, store all file system data on DAS devices connected to server computers. Since multiple servers cooperatively manage the file system, servers may share metadata between computers. The complexity of these designs increases an order of magnitude, since distributed system integrity requires strong metadata consistency between servers. Such systems often cannot use local file systems to store data. As a result, server software must manage, store, and transport metadata and real-data between servers.
Distributed server file systems have further evolved into designs where clients and servers are often difficult to distinguish. In these systems, clients manage, store, and transport metadata and real-data between servers and other clients.
One aspect of NAS-based file system designs that has remained among central server, distributed server, and merged client-server designs is the direct attachment of storage devices to computers. With devices directly attached to computers, however, a single computer failure renders data stored on the storage devices inaccessible. Although redundant devices on separate computers can be added to improve availability, such techniques add complexity and cost to the system.
Furthermore, the NAS architecture limits performance when clients access data stored on remote devices, because the data-path between client and storage device includes server computers. These servers add overheads caused by server workloads as well as overheads relating to the translations from channel interface protocols to network interface protocols. Server computers designed to support large workloads are very expensive.
At least some distributed file system designs that use SAN technologies have followed a different evolutionary path. Instead of storing data on storage devices connected directly to computers, SAN-based designs store data on SAN-attached devices shared among several client computers. SAN-based designs have high-bandwidth, low-latency data-paths between clients and devices.
SAN-based file systems require arbitration for the storage devices and consistency management of any data cached on the clients. Consistency mechanisms are either centrally located or distributed within the system. The consistency mechanisms may include software running on computers, hardware mechanisms attached to the networks, or a combination of both hardware and software.
There are at least two distinct SAN-based file system designs. The first design uses private file managers, in which client computers independently access metadata and real-data directly from the storage devices. Private file manager schemes do not require dedicated servers, since all necessary data is taken directly from the SAN-attached devices. With private file manager designs, clients only service local file requests.
As a result of their designs, clients utilizing private file managers remain independent from the failures and bottlenecks of other clients. Similarly, client resources such as memory, CPUs, and bus bandwidth are not spent servicing requests from other clients.
The second type of SAN-based distributed file system design utilizes file manager server computers. These file servers manage file system namespace and metadata. SAN clients make requests to the SAN servers, and the servers determine the location of real-data on SAN devices by examining and modifying file metadata. Once the location is determined, the servers either initiate transfers between clients and storage devices or inform the clients how to invoke the transfers. Servers must maintain and store metadata, manage real-data, and control transfers between clients and storage devices. The server design is complex, since servers need to provide a great deal of functionality (e.g., file locking is the feature used to maintain data consistency between hosts). Servers that fail or become overworked tend to disrupt file system operation. The Celerra HighRoad multiplex file system (MPFS) from EMC Corporation is an example of a SAN-based file system that utilizes SAN server file managers to facilitate file transfers between SAN devices and SAN clients. A detailed description of the Celerra system and HighRoad software is given in U.S. Pat. No. 6,324,581 issued Nov. 27, 2001 and U.S. Pat. No. 7,325,097 issued Jan. 29, 2008, both assigned to EMC the assignee of the present invention and hereby incorporated by reference herein.
Local file systems may be used in SAN file sharing environments under various restrictions. For instance, most local file system volumes may be mounted by multiple SAN clients as long as all clients mount the volume in read-only mode. Since the volume does not change, caching performed by the clients does not affect the state of the SAN environment. When files of the volume need to be modified, however, all clients must unmount the volume and then one client re-mounts the volume in read-write mode. This client makes the appropriate modifications and then unmounts the volume. Finally, all clients re-mount the volume in read-only mode. This scheme promotes high-speed file sharing.
Some local file systems are specifically designed to support SAN file sharing environments where one SAN client mounts the volume in read-write mode and all other SAN clients mount the volume read-only. These SAN-based local file systems must frequently flush dirty caches on the read-write client and regularly invalidate stale caches on the read-only clients.
A SAN-based file sharing environment may be configured to serve a large number of NAS client computers using NAS file sharing protocols. SAN clients act as NAS servers that serve volumes stored on the SAN-attached devices to a large number of NAS clients connected to the NAS servers though LANs. Such systems, also known as clusters, combine SAN and NAS technologies into a two tiered scheme. In effect, a NAS cluster can be viewed as a single, large NAS server.
SAN appliances are prior art systems that consist of a variety of components including storage devices, file servers, and network connections. SAN appliances provide block-level, and possibly file-level, access to data stored and managed by the appliance. Despite the ability to serve both block-level and file-level data, SAN appliances may not possess the needed management mechanisms to actually share data between the SAN and NAS connections. The storage devices are usually partitioned so that a portion of the available storage is available to the SAN and a different portion is available for NAS file sharing. Therefore, for the purpose of this application, SAN appliances are treated as the subsystems they represent.
Another adaptation of a SAN appliance is simply a general purpose computer with DAS devices. This computer converts the DAS protocols into SAN protocols in order to serve block-level data to the SAN. The computer may also act as a NAS server and serve file-level data to the LAN.
File system designers can construct complete file systems by layering, or stacking, partial designs on top of existing file systems. The new designs reuse existing services by inheriting functionality of the lower level file system software. For instance, NFS is a central-server architecture that utilizes existing local file systems to store and retrieve data from storage device attached directly to servers. By layering NFS on top of local file systems, NFS software is free from the complexities of namespace, file attribute, and storage management. NFS software consists of simple caching and transport functions. As a result, NFS benefits from performance and recovery improvements made to local file systems.
Other examples of file system layering include adding quota support to existing file system, strengthening consistency of cached data in an existing distributed file system, and adding compression or encryption to file systems without such support.
Most modern operating systems include installable file system interfaces to support multiple file system types within a single computer. In UNIX, the Virtual File System (VFS) interface is an object-oriented, installable interface. While several UNIX implementations incorporate VFS, the interfaces differ slightly between platforms. Several non-UNIX operating systems, such as Microsoft Windows NT, have interfaces similar to VFS.
VFS occupies the level between the system call interface and installed file systems. Each installed file system provides the UNIX kernel with functions associated with VFS and vnode operations. VFS functions operate on whole file systems to perform tasks such as mounting, unmounting, and reading file system statistics. Vnode operations manipulate individual files. Vnode operations include opening, closing, looking up, creating, removing, reading, writing, and renaming files.
Vnode structures are the objects upon which vnode functions operate. The VFS interface creates and passes vnodes to file system vnode functions. A vnode is the VFS virtual equivalent of an mode. Each vnode maintains a pointer called “v_data” to attached file system specific, in-core memory structures such as modes.
Many file system interfaces support layering. With layering, file systems are capable of making calls to other file systems through the virtual file system interface. For instance, NFS server software may be implemented to access local file systems through VFS. In this manner, the server software does not need to be specifically coded for any particular local file system type; new local file systems may be added to an operating system without reconfiguring NFS.
The nature of non-volatile, vibration-free, small size and low power consumption has made flash memory an excellent component to be utilized in various flash storage devices. Flash storage devices are widely used as memory storage for computer and consumer system products such as notebook, desktop computer, set top box, digital camera, mobile phone, PDA and GPS etc. The increasing demand for more storage in these products has driven the need to expand the capacity of the flash storage devices.
There are two types of flash storage devices. The first type has a pre-defined mechanical dimension. This type includes: (a) Secure Digital (SD) card, (b) Multi Media Card (MMC), (c) Memory Stick (MS) card, (d) Compact Flash (CF) card, (e) Express Flash card, (f) Serial ATA Flash disk, (g) IDE Flash disk, (h) SCSI Flash disk, etc.
The second type of flash storage devices has no pre-defined physical dimension, which includes USB flash disk, Disk On Module (DOM), MP3 player etc. However, based upon the need for the system compactness, it is generally desirable to make this type of flash storage device as small in size and as high in capacity as possible.
Space constraints and available flash memory density are the major obstacles in expanding the capacity of the flash storage devices. A secure digital (SD) card is defined with a form factor. This fixed dimension restricts the number of components populated on a printed circuit board (PCB). For instance, if thin, small out-line package (TSOP) type of flash memory is used, only a flash memory chip and a flash controller can be placed in the space constraint. The available flash memory density further limits the overall SD card capacity.
A flash memory die is the basic element of flash memory. A typical flash memory chip comprises a flash memory die mounted on a substrate within an enclosure and the electrical signals are bonded out to the metal contacts of the package. Popular package types for flash memory chip are TSOP, WSOP (Very Very Thin Small Out-line Package) and BGA (Ball Grid Array) etc.
Advances in semiconductor technology have lead to an increase in the use of a semiconductor solid state drive (also known as a solid state disk or SSD) which uses a flash memory as a storage device, in areas such as computer systems. Thus, in at least some cases there seems to be a trend towards the use of an SSD as a storage device instead of a magnetic disk. In spite of having features such as, for example, a relatively small storage capacity and a relatively high price, the SSD has some other features that can make it more attractive as a storage device than the conventional magnetic disk in at least some cases.
Features that can make SSDs preferable as storage devices are, for example, a fast access rate, high throughput, a high integration density, and stability against an external impact. SSDs can move much larger amounts of data and process far more I/O requests, per time period, than conventional magnetic disks. This allows users to complete data transactions much more quickly.
Furthermore, advances in manufacturing technologies for SSDs may reduce the production costs of SSDs and also increase the storage capacities of SSDs. These developments may provide further incentive to use SSDs in place of magnetic disks in at least some cases.
Solid state disk systems may also comprise communication controllers, such as Fibre Channel (FC) controllers, Ethernet mechanisms, ATA or serial ATA interfaces, or SCSI controllers for managing data communication with external computing devices.