1. Field of the Invention
This invention relates to the areas of data storage, data deduplication and data security; in particular, to a method for multi-tenant, secure, data deduplication using data association tables.
2. Description of Related Art Including Information Disclosed Under 37 CFR 1.97 and 37 CFR 1.98
Much of the data in a data storage system, such as a data storage computer server or servers, is typically data or information that is often duplicated, repetitive or redundant, but which may have different data creators, users, controllers or owners. Examples of such duplicated data are application files, system files, image files, video files and email files. Within an organization or operational unit, these examples of duplicate data tend to have high redundancy levels.
The cost to store electronic data is directly proportional to the amount of data stored. The amount of data stored is amplified, often unnecessarily, by storage of duplicate data. As a result, the size of the data storage systems and the cost of data storage also are increased.
An element of cost containment for a data storage system is the elimination of duplicate data. Additionally, the efficiency of a data storage system is highly dependent on the system's ability to eliminate redundant information and is improved by the elimination of duplicated data. A method to eliminate duplicate data is by painstaking and manual human intervention in the form of identification and elimination of duplicate data prior to placing the data into a data storage system.
Another method to eliminate duplicate data is by automated or computerized reduction of duplicate data by means of hashing data to identify and match identical information or data chunks or sets with a data hash. The data hash serves as a unique identifier for a particular identified and matched data set. Through use of a data hash, only one copy of a particular data set need be associated with the data hash and stored in a data storage system. Although a data hash serves as a unique identifier of a data set, a data hash does not afford secure access to a data set. It is important that in a large data storage system or environment data deduplication to be done in a way such that security of the data is maintained.
Some existing data hash systems try to use obfuscation of a data hash as a means of security for the hashed data set. Use of a data hash may make locating a hashed data set or object within a data storage system more difficult, but is vulnerable to a random data hash request (or obtaining a hashed object identification through another means) that may allow unauthorized access to a hashed data set. In a private or closed data storage system, this may be acceptable. However, in an open or shared data storage system, this lack of data security is not acceptable.