1. Field of the Invention
The present invention is generally related to a storage system and in particular to a method and apparatus for reducing the amount of data stored in the storage system.
2. Description of the Related Art
A main concern of many storage administrators is rapid data growth, wherein the amount of data stored in a storage system increases so rapidly that it threatens to outstrip the capacity of the storage system. For example, data growth in some industries can be as high as 30-50 percent per year, which can require frequent upgrades and increases in the capacity of storage systems. Furthermore, increases in the amount of data stored in a storage system also causes increases in management costs for managing the data. Thus, it would be desirable to decrease the amount of data stored in storage systems, thereby decreasing the management costs and decreasing the required frequency of system upgrades.
One cause of the recent increases in the amount of data being stored in enterprise datacenters is data vaulting or long term data preservation. It has become more essential for many businesses to keep data for long periods of time, and their motivations for long-term data preservation are often due to governmental regulatory requirements and similar requirements particular to a number of industries. Examples of some such government regulations that require long-term data preservation include SEC Rule 17a-4, HIPAA (The Health Insurance Portability and Accountability Act), and SOX (The Sarbanes Oxley Act). The data required to be preserved is sometimes referred to as “Fixed Content” or “Reference Information”, which means that the data cannot be changed after it is stored. This can create situations different from an active database, wherein the data may be dynamically updated as it is changed.
Another reason for recent increases in the amount of data being stored is data replication, mirroring or copying. In order to improve data accessibility, reliability, and the like, businesses keep one or more copies of data. Sometimes data is replicated periodically at a certain point in time, and the replicated data and the function itself are called a “snapshot” or “point-in-time copy” (PiT copy). For example, some businesses may sometimes keep more than three or four different copies and a number of different generations of data within their datacenters. Accordingly, preserving copied data for the long term is another main cause leading to rapid growth in the amount of stored data.
One well-known prior-art technology for reducing the amount of copied data is Copy On Write (COW) technology. COW is a technique for maintaining a point-in-time copy of a collection of data by copying only data which is modified or updated after the instant of replicate initiation. The original source data is used to satisfy read requests for both the source data itself and for the unmodified portion of the point in time copy. Because only differential data are kept in the storage system, the amount of redundant data can be reduced (see, e.g., www.snia.org/education/dictionary/c/). An example of a product that uses COW is QuickShadow™ available from Hitachi Data Systems Corporation of Santa Clara, Calif. Prior art patents related to COW include U.S. Pat. No. 5,649,152 to Ohran et al. and U.S. Pat. No. 5,555,389 to Satoh et al., the disclosures of which are incorporated herein by reference.
Furthermore, it is known to use a technology called “pointer remapping” in COW systems. Pointer remapping is a technique for maintaining a point in time copy in which pointers to all of the source data and copy data are maintained. When data is overwritten, a new location is chosen for the updated data, and the pointer for that data is remapped to point to it. If the copy is read-only, pointers to its data are never modified (see, e.g., www.snia.org/education/dictionary/p/).
FIG. 2 illustrates a basic pointer remapping technique used in a snapshot COW system. This technique includes a base volume 100, which is a volume referred by the snapshot, a virtual volume 110, which is a window volume for a host to access the snapshot, having no physical disk space, and a pool volume 120, which is a set of logical volumes storing differential data between the base volume and the snapshot. A mapping table 130 is stored in a memory area containing mapping information and snapshot control information. Pointer 111 is a reference to data 101 in the base volume 100, defined in the mapping table 130, while pointer 112 is a reference to data 122 in the pool volume 120, defined in the mapping table 130. When data is updated, a new location is designated for the updated data, and the pointer for that data is remapped in the mapping table so as to point to the location of the updated data.
However, conventional COW techniques do not work to reduce the amount of data already stored in storage systems. Although COW is a well-accepted technology in storage systems, COW is in operation only when the storage systems write data to disk. The COW technology has not been applied for reducing the amount of data that is already stored in a storage system.
Other techniques for reducing the amount of stored data in storage systems are also known. For example, it is also known in certain applications to use data commonality factoring, coalescence or de-duplication technology to discover any commonality in a storage system. Once the commonality is discovered, the redundant data may be eliminated to reduce the amount of data in the storage system. In order to find commonality, chunking (cutting data into smaller sizes of data) and hashing technologies may be used. Examples of the companies providing such technologies are Avamar Technologies, Inc. of Irvine, Calif., Data Domain of Palo Alto, Calif., Diligent Technologies of Framingham, Mass., and Rocksoft of Adelaide, Australia. Patents disclosing related technologies include U.S. Pat. No. 6,826,711 to Moulton et al. and U.S. Pat. No. 6,704,730 to Moulton et al., the disclosures of which are incorporated herein by reference.
However, the coalescence technology described in the above-referenced patents requires new investment to enable them to be implemented in storage systems. Since the technology is new and not widely employed, it requires additional research and development costs, and, as a result, customers may be asked to pay more. Accordingly, there is a need for a technology that enables reducing the amount of data stored in storage systems and that leverages existing technologies to reduce development costs.
Further, it is known to use algorithms and mathematical techniques for searching and classifying the nearest neighbor among a set of data structures. For example, the paper “An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions”, by Sunil Arya et al., Journal of the ACM (JACM), v. 45 n. 6, p. 891-923, November 1998, discusses techniques for calculating a nearest neighbor using a balanced box-decomposition tree. These and similar mathematical techniques, generally known as the “nearest neighbor method”, may be applied to the storage system environment for classifying storage volumes into neighborhood groups having a desired degree of commonality, as will be described in more detail below in the Detailed Description of the Invention.