With increasing reliance on electronic means of data communication, different models to efficiently and economically store a large amount of data have been proposed. A data storage mechanism requires not only a sufficient amount of physical disk space to store data, but various levels of fault tolerance or redundancy (depending on how critical the data is) to preserve data integrity in the event of one or more disk failures.
In a traditional RAID networked storage system, a data storage device, such as a hard disk, is connected to a RAID controller and associated with a particular server or a particular server having a particular backup server. Thus, access to the data storage device is available only through the server associated with that data storage device. A client processor desiring access to the data storage device would, therefore, access the associated server through the network and the server would access the data storage device as requested by the client. In such systems, RAID recovery is performed in a manner that is transparent to the file system client.
By contrast, in a distributed object-based data storage system that uses RAID, each object-based storage device communicates directly with clients over a network. An example of a distributed object-based storage system is shown in co-pending, commonly-owned, U.S. patent application Ser. No. 10/109,998, filed on Mar. 29, 2002, titled “Data File Migration from a Mirrored RAID to a Non-Mirrored XOR-Based RAID Without Rewriting the Data,” incorporated by reference herein in its entirety.
In many failure scenarios in a distributed object-based file system, the failure can only be correctly diagnosed and corrected by a system manager that knows about and can control system specific devices. For example, a failure can be caused by a malfunctioning object-storage device and the ability to reset such device is reserved for security reasons only to the system manager unit. Therefore, when a client fails to write to a set of objects, the client needs to report that failure to the system manager so that the failure can be diagnosed and corrective actions can be taken. In addition, the file system manager must take steps to repair the object's parity equation.
In instances where a client fails to write to a set of objects, it would be desirable if the role of the system manager was not limited to repairing the error condition, but also extended to repair of the affected file system object's parity equation. Expansion of the role of the system manager to include correction of the parity equation is advantageous because the system will no longer need to depend on the file system client that encountered a failure to be able to repair the object's parity equation. The present invention provides an improved system and method that, in instances where there is an I/O error, transmits information to the system manager sufficient to permit the system manager to repair the parity equation of the object associated with the I/O error.