1. Field of the Invention
The present invention relates to controlling storage, modification, and transfer of data in a network providing distributed data storage services.
2. Description of the Related Art
Data storage and retrieval technology requires the availability of data in a timely manner. Basic data storage techniques involve generating a copy of the original data as a backup: such backup systems include simultaneous copying to two or more storage locations (e.g., simultaneous copying to two hard drives), and archival of data. Data archival has progressed from tape backup systems, to backups using compact disc (CD-R) technology, etc.
Distributed storage systems typically maintain duplicate copies of data on multiple, distinct network nodes to ensure a loss of a network node does not cause a loss of data. In addition, the distributed storage systems enable multiple client devices access to the same data files throughout a network.
The presence of multiple duplicate copies of data throughout the network introduces a new problem in terms of data integrity, namely ensuring that any change to one of the data copies is a valid (i.e., authorized) change, and that any valid change is implemented in each of the copies to ensure consistency between the multiple copies.
In addition, a multi-client environment that enables multiple clients to access the same data files increases the level of complexity in terms of controlling write access to ensure changes by different clients can be merged into the data files. Although existing database technologies control modification of data files accessible by multiple users, such database technologies assume that the multiple users are accessing the same data file, as opposed to a duplicate copy elsewhere in the network.
This problem becomes readily apparent when considering a scenario where a network having multiple nodes for storage of respective copies of a data file is divided into two groups of nodes, for example due to a link failure between the groups or due to the first group of nodes moving together to another location inaccessible by the second group of nodes. In this case one of three options are available to provide data integrity: read only; write access statically assigned to only one of the groups, or synchronization when the two groups of nodes are rejoined into the original network.
The read only option obviously suffers from the disadvantage that the data cannot be modified by any node among the two groups of nodes. The static assignment of write access to only one of the groups requires manual selection and configuration, and still prevents the other group from any modification of the data. Synchronization between the two modified data files upon rejoining of the two groups requires a manual synchronization or source code management (e.g., version tracking). Further, a data merge problem may arise if both groups of nodes perform changes that are mutually exclusive.
Further, the relative importance of data files changes between entities over time: during initial creation, an author will need substantial write access to save the data file during its creation; following its creation, reviewers will prefer write access to review and edit the data file, while the author may require occasional write access to a lesser degree; following approval and publication, little or no write access is needed by the author or reviewers, resulting in the data file and its copies on the network needlessly using up storage space in the network.
An application which places data uniformly across all storage nodes will optimize data integrity, but cause worst-case performance in terms of client access. However, placing all data on one node close to the client optimizes data access but causes a worst-case scenario in terms of data integrity (loss of the one node results in data loss). Moving the data closer to a client (for example, in terms of hop count) is referred to herein as “locality”.
An example of locality involves an e-mail client executed from a laptop computer, where the e-mail client is configured for accessing a pre-assigned e-mail server in a home office (e.g., Eastern United State); if the user travels to a distant office (e.g., in Western United States or Europe), and the e-mail client sends a request to a second e-mail server in the distant office, the second e-mail needs to retrieve the e-mail messages from the pre-assigned e-mail server in the home office for delivery to the e-mail client in the distant office. In this case, there is little locality between the e-mail messages and the e-mail client due to the necessity of the pre-assigned e-mail server being assigned as the destination repository for e-mail messages for the e-mail client.