1. Technical Field
This invention relates to de-duplication of data items in storage media devices. More specifically, the invention relates to enhancing the impact of data de-duplication by preferential selection of item(s) to be retained based upon the current loads and performance metrics of the devices.
2. Description of the Prior Art
A computer is a programmable machine that responds to a specific set of instructions in a well defined manner, and executes a list of instructions, also known as a program. Computers generally include the following hardware components: memory, storage, input device, output device, and a central processing unit. There are various techniques and devices known in the art for storing large amounts of data. Examples of storage devices include, but are not limited to, hard disks, optical disks, tapes, etc. In a networked computer system, it is known to group two or more storage devices into a storage area network or a mass storage device. A storage area network is a high speed sub-network of shared storage devices, wherein each storage device is a machine that contains one or more disks or storage devices. In one embodiment, a storage area network allows all storage devices to be available to all servers on a local or wide area network. The data resides on the storage devices and not the servers. This configuration of storage devices with respect to servers releases network capacity to the end user.
It is known in the art of storage technology for multiple copies of the redundant data to be stored on one or more storage devices in a storage area network. The redundant copies of data are also known as duplicate data. Recent developments in the art have encouraged removal of duplicate copies of data to make room available for non-duplicate copies of data on the storage device(s). In storage technology, de-duplication refers to the elimination of redundant data. More specifically, the process of de-duplication deletes duplicate data leaving only one copy of the data to be stored on storage media. At the same time, de-duplication retains indexing of all data retained, should that data ever be required. Accordingly, de-duplication is able to reduce the required storage capacity since only one copy of the unique data is stored.
FIG. 1 is a flow chart (100) illustrating a prior art de-duplication process. A hash function is computed for each data item retained on the storage device (102), also known as an existing data item. A data item, D, is selected (104). All data items that have the same hash value as D are found (106). The data items found at step (106) are considered duplicates of D. The set S(D) is selected as a set of data items that have the same hash value as D (108). Following the creation of the set at step (108), it is determined whether the set includes more than one data item (110). A negative response to the determination at step (110) is followed by marking data item D as processed (112) and determining whether there are other unprocessed data items (114). A positive response to the determination at step (114) is followed by selection of the next unprocessed data item (116) and a return to step (106). In contrast, a negative response to the determination at step (114) is an indication that all of the data items have been identified and processed (118). Similarly, if the response to the determination at step (110) is positive, any one of the data items from the set is selected and retained in storage, with the other identified copies in the set removed from storage (120). Accordingly, the prior art solutions for selection of identified duplication copies does not include an evaluation of the copies to determine an optimal copy to retain.
Once the duplicate copies of data have been removed, a single copy of the data remains on storage media. Each server that needs access to that data will have to retrieve that data from the lone storage media that stores that data. However, different storage media devices are known to have different access rates and may have different current loads. The prior art de-duplication process does not address access rates or the current loads of the storage media devices. Rather, the prior art is restricted to retaining a single copy of data items, and removal of duplicate copies. Accordingly, there is a need to evaluate the current loads and other characteristics of the storage media devices before deciding which copy of the multiple copies of data should be retained on which storage media in the storage area network.