Computers tend to access program and data memory non-evenly, where some memory addresses are accessed more frequently than others. Some types of memory can be accessed more rapidly than other types, thus keeping the computer waiting at idle less during access. Fast memory devices, like those used for cache memory, are expensive and not practical to be used for the whole memory and data space. Disk storage is much slower to access, but is very attractive because its cost per byte of storage is low as compared to other memory systems. The best balance between performance and system cost generally means using a combination of cache memory, main random access memory (RAM), and disk storage. System performance will thus be least adversely impacted if the program and data that need to be accessed the most frequently are kept available in the cache memory.
Determining what to put in the cache can be difficult. Many different algorithms and schemes have been tried. The problem generally boils down to which policies to use for populating the cache store and for page eviction. The primary objective is to have a page of memory already fetched to the cache before it's needed, and flushed out of the cache back to main memory when other pages need the room and will be needed sooner or more often. A number of algorithms exist that define the policies that are employed for both pre-fetch and eviction. One assumption often employed, is pages that have been recently used will be needed again soon. Another assumption is the next page following a page recently accessed will probably be accessed again in the future.
Conventional least recently used (LRU) techniques have been included in the cache components of storage controllers and tape servers to control which pages to evict and when to evict a page. LRU techniques usually manage, control, and access metadata describing which data in cache memory has not yet been written back into main storage, so-called dirty data, and how long the data has been dirty. Improving performance by assigning weights, e.g., time to fetch the object, object sizes, etc., to the LRU for cache replacement is conventional and existing prior-art.
Data de-duplication, a relatively new technology employed in storage and tape servers, reduces storage needs by eliminating redundant data. Only one instance of each unique data is retained on a storage device such as disk or tape. Redundant data is replaced with a pointer to the unique data. For example, a typical email system may have one hundred instances of the same one megabyte file attachment. If the email platform is backed up or archived, then all one hundred instance will be saved, requiring 100 MB storage space. With data de-duplication, the one hundred instances are reduced to one unique instance of the attachment, and that one is the entirety of what actually needs to be stored. Each duplicate instance is referenced back to the one saved copy. In such example, a need for 100 MB of storage could be reduced to approximately 1 MB.
Data de-duplication, or “single instance storage” technology, scans incoming data for similarities, and creates and stores an index of duplicate blocks for data retrieval. It compares the incoming data with the most similar in a storage unit. If the incoming data is determined to be new, the new data is compressed and stored while also updating an index metric with knowledge of the new data. In the process of data reduction or commonality factoring, a table or index is constructed and maintained that maps duplicate objects to a single copy of the object. Later, when a request for the duplicated object comes in, the object mapping is used to index to the single copy of the object.
Data de-duplication offers other benefits. Lower storage space requirements may reduce disk expenditure costs. The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups. Data de-duplication also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery.
Data de-duplication can generally operate at the file, block, and even the bit level. File de-duplication eliminates duplicate files (as in the example above), but this is not an efficient means of de-duplication. Block and bit de-duplication looks within a file and saves unique iterations of each block or bit. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks or bytes are saved, the changes don't constitute an entirely new file. This behavior makes block and bit de-duplication far more efficient. However, block and bit de-duplication take more processing power and uses a much larger index to track the individual pieces.
Hash collisions are a potential problem with de-duplication. When a piece of data receives a hash number, that number is then compared with the index of other existing hash numbers. If that hash number is already in the index, the piece of data is considered a duplicate and does not need to be stored again. Otherwise the new hash number is added to the index and the new data is stored. In rare cases, the hash algorithm may produce the same hash number for two different chunks of data. When a hash collision occurs, the system won't store the new data because it sees that its hash number already exists in the index. This is called a false positive, and can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Some vendors are also examining metadata to identify data and prevent collisions.
In actual practice, data de-duplication is often used in conjunction with other forms of data reduction such as conventional compression and delta differencing. Taken together, these three techniques can be very effective at optimizing the use of storage space.