The present invention pertains to a method and apparatus for storing data in a mass storage system and, in particular, for a method and apparatus for storing data in a mass storage system, and in particular a mass storage system implementing RAID technology, by data type and topological categorization and ordering.
A continuing problem in computer systems is in providing secure, fault tolerant resources, such as communications and data storage resources, such that communications between the computer system and clients or users of the computer system are maintained in the event of failure and such that data is not lost and can be recovered or reconstructed without loss in the event of a failure. This problem is particularly severe in networked systems wherein a shared resource, such as a system data storage facility, is typically comprised of one or more system resources, such as file servers, shared among a number of clients and accessed through the system network. A failure in a shared resource, such as in the data storage functions of a file server or in communications between clients of the file server and the client file systems supported by the file server, can result in failure of the entire system. This problem is particularly severe in that the volume of data and communications and the number of data transactions supported by a shared resource such as a file server are significantly greater than within a single client system, resulting in significantly increased complexity in the resource, in the data transactions and in the client/server communications. This increased complexity results in increased probability of failure and increased difficulty in recovering from failures. In addition, the problem is multidimensional in that a failure may occur in any of a number of resource components or related functions, such as in a disk drive, in a control processor, or in the network communications.
Considering networked file server systems as a typical example of a shared system resource of the prior art, the filer server systems of the prior art have adopted a number of methods for achieving fault tolerance in client/server communications and in the file transaction functions of the file server, and for data recovery or reconstruction. These methods are typically based upon redundancy, that is, the provision of duplicate system elements and the replacement of a failed element with a duplicate element or the creation of duplicate copies of information to be used in reconstructing lost information. For example, many systems of the prior art employ multiple, duplicate parallel communications paths or multiple, duplicate parallel processing units, with appropriate switching to switch communications or file transactions from a failed communications path or file processor to an equivalent, parallel path or processor, to enhance the reliability and availability of client/file server communications and client/client file system communications. Yet other methods of the prior art utilize information redundancy to allow the recovery and reconstruction of transactions lost due to failures occurring during execution of the transactions. These methods include caching, transaction logging and mirroring wherein caching is the temporary storage of data in memory in the data flow path to and from the stable storage until the data transaction is committed to stable storage by transfer of the data into stable storage, that is, a disk drive, or read from stable storage and transferred to a recipient. Transaction logging, or journaling, temporarily stores information describing a data transaction, that is, the requested file server operation, until the data transaction is committed to stable storage, that is, completed in the file server, and allows lost data transactions to be re-constructed or re-executed from the stored information. Mirroring, in turn, is often used in conjunction with caching or transaction logging and is essentially the storing of a copy of the contents of a cache or transaction log in, for example, the memory or stable storage space of a separate processor as the cache or transaction log entries are generated in the file processor.
The use of multiple, duplicate parallel communications paths or multiple, duplicate parallel processing units, caching, transaction logging and mirroring, however, are often unsatisfactory because they are often costly in system resources and require complex administrative and synchronization operations and mechanisms to manage the caching, transaction logging and mirroring functions and subsequent transaction recovery operations, and significantly increase the file server latency, that is, the time required to complete a file transaction.
One of the most frequently used methods of the prior art for the preservation and recovery of data and file transactions is RAID technology, which is a family of industry standard methods for distributing redundant data and error correction information across a redundant array of disk drives that essentially operates as a single, very large mass storage device, which is often implemented as a networked file server. RAID technology allows a failed disk drive to be replaced by a redundant drive and allows the data in the failed disk to be reconstructed from the redundant data and error correction information.
The increased power and speed of contemporary networked computer systems, however, has resulted in a corresponding demand for significantly increased mass storage capability because of the increased volumes of data dealt with by the systems and the increased size of the operating system and applications programs executed by such systems. Most mass storage devices, however, are characterized by relatively low data access and transfer rates compared to the computer systems with operate with the data and programs stored therein. As a consequence, and although the mass storage capabilities of host computer systems has been increased significantly, the speed of data read and write access has not increased proportionally. While there have been many attempts in the prior art to solve the problem of data access speed for mass storage systems, they have typically taken the form of increasing the number of disk drives, for example, to store related data items and their associated parity information across several drives in parallel, thereby overlapping the initial data access time to each drive and increasing the efficiency of bus transfers. An extreme manifestation of this approach was found, for example, in the Thinking machines Corporation CM-2 system which operated with 39 bit words, each containing 32 data bits and 7 parity bits, and stored the bits of each word in parallel across 39 disk drives, on bit to each drive.
A more typical method for increasing the speed of data read and write access is xe2x80x9cstripingxe2x80x9d, wherein data and parity information are spread over several disk drives in an pattern referred to as a xe2x80x9cstripexe2x80x9d and wherein a xe2x80x9cstripexe2x80x9d is the amount of information for which for which a given RAID system generates and stores parity. Because the parity information for a stripe is generated for and from all of the data in a stripe, a stripe is effectively the smallest unit of data storage in a RAID striped system, that is, is stripe is always written as an entity. A RAID 5 system, for example, uses five disk drives and a stripe is comprised of four blocks of information, with one block being stored on each of four of the disk drives and with a fifth block containing parity information for the four information blocks being stored in the fifth disk drive. Striping is customarily employed to increase the speed with which information may be written to or read from the disk drives of a mass storage system as the information is distributed across the disk drives so that reads and writes of segments of information from and to the disk drives can be overlapped. Striping also facilitates the reconstruction of information in the event of a disk drive failure when used with parity information or an error detection and correction code. That is, the storing of information across a plurality of disk drives so that a single disk drive contains only a relatively small part of any body of information thereby limits the damage to a given body of information in the event of a failure or error, and allows the damaged information to be more easily recovered or reconstructed from the surviving information.
A limiting factor in the various methods for enhancing the speed of information read and write access, however, is the need to store not only parity information or error correcting codes but also several different types of data with very different storage characteristics and very different access requirements. That is, data and parity information are usually stored in units of fixed but possibly different sizes, which will typically depend upon the type of data, and the amount of data in a given file, as well as the amount of data to be read or written in a given read or write operation, will typically vary substantially. Storage space in the disk drives, however, is typically allocated in units of fixed size, which may be optimum for only a single type of data, and the storage space is formatted according to the selected RAID method implemented in the system. As a result, there are often significant differences between the optimum storage formats of various forms of information and the storage topology of the disks. As a result, the amount and location of the data in a write operation, for example, will rarely coincide with the format in which the data is stored on the disks and the reading or writing of a given type of information will often result in inefficient disk read/write operations, such as increased disk traverse and search times and frequent and time consuming read-modify-write operations, thereby reducing the information transfer rates. This problem is further compounded in that the systems of the prior art typically distinguish only between data and parity information when writing information to the disks and not between types of data and are optimized to maximize the use of storage space by avoiding or eliminating unused blocks of storage space. As a result, logically contiguous blocks of a given type of data are often physically stored on the disks as smaller, non-contiguous blocks separated by blocks of other types of data. This optimizes the use of physical storage space, but increases the disk traverse and search time required for a read or write operation, thereby further reducing the data transfer rate. These problems are compounded still further because the read/write access requirements for parity information and data, and for different types of data, vary significantly. For example, parity information is typically written or read, modified and rewritten upon each data write to disk and thus has high write access requirements, but has low read access requirements because the parity information is rarely read except to reconstruct data from a failed disk drive. As a result, information having widely varying read and write access requirements is typically intermixed on the disks, so that rarely accessed information must often be traversed and searched in order to access frequently accessed data, thereby still further reducing the data transfer rate.
The present invention provides a solution to these and other problems of the prior art.
The present invention is directed to a method and apparatus for storing data in a mass storage system, and in particular a mass storage system implementing RAID technology, by data type and topological categorization and ordering.
According to the present invention, a mass storage system includes a mass storage space for storing data items of a plurality of data types wherein each data item contains data of a corresponding data type and each data type is defined by the characteristics of the information represented by the data and wherein the storage space is topologically organized as a plurality of basic units of storage space wherein each basic unit of storage space contains storage space for a predetermined number of data blocks of predetermined sizes. A topological data formatter for storing the data in the storage space includes a write data buffer for and corresponding to each data type and an initial data classifier for data type categorizing of each data item to be written into the storage space as a member of a data type by performing an initial categorization of each data item to identify whether a data item is a member of a structured data type having defined data characteristics or a general data type having variable data characteristics. The initial data classifier then writes each data item that is a member of a structured data types into a corresponding type buffer and provides each data item that is a member of a general data type of a topological data classifier. The topological data classifier performs a topological categorization of each data item that is a member of a general data type and identifies whether each data item that is a member of a general data type is a full-basic unit data type wherein the data of the data items form one or more data block groups conforming to the basic unit of storage space or a partial-basic unit data type wherein the data of the data items form one or more data block groups differing from the basic unit of storage space. The topological data classifier then writes each data item that is a full-basic unit data type into a full-basic type buffer and writes each data item that is a partial-basic unit data type into a partial-basic type buffer. Subsequently, and upon performing a write of the data items to the mass storage, the initial data classifier reads each data item from the corresponding type buffer and for each data item of a structured data type, orders the data of the data items into one or more data block groups wherein each data block group corresponds to a basic unit of storage and writes the one or more data block groups of each structured data type into a corresponding data type area of the storage space. For each data item of the full-basic unit type and of the partial-basic unit type, the topological data classifier re-executes the topological classification of each data item as a full-basic unit data type or as a partial-basic unit data type, orders the data of each data item of a full-basic unit data type and the data of each data item of a partial-basic unit data type into one or more corresponding full-basic unit data block groups or one or more corresponding partial-basic unit data block groups, and writes the full-basic unit data block groups and the partial-basic unit data block groups into corresponding data type areas of the storage space.
In further embodiment of the present invention, and upon writing each data item into a corresponding type buffer, the topological data formatter combines the data of a data item being written into a type buffer and the data of one or more data items residing in the type buffer to form a full-basic unit of storage space. In still further embodiments of the present invention, and upon performing a write of the data items to the mass storage, the topological data formatter returns all data items forming partial-basic unit data block groups to the corresponding type buffers for re-ordering into full-basic unit data block groups in a subsequent write to the storage space.
In a presently preferred embodiment, each basic unit of storage space is a stripe of a striped mass storage system and each stripe contains storage space for a predetermined number of data blocks. Also, the mass storage space is preferrably structured into a plurality of data partitions wherein each data partition corresponds to a data type and is used to store data of the corresponding data type. In addition, and in a presently preferred embodiment of the mass storage system, the mass storage system is a RAID technology storage system and each stripe further includes at least one data block for storing data recovery information.