With the rise of computer systems and the ever increasing reliance of industries, businesses, and individuals on the use of electronic data there has arisen a need to be able to successfully store and retrieve large amounts of data in electronic form in a fast, efficient and economical way. For purposes of storing electronic data, hereinafter simply referred to as data, data is often broken up into blocks of a particular size. For example, data may be broken into 4 kilobyte blocks referred to as 4 k blocks of data. In storing data, the amount of data to be stored normally corresponds to the size of the physical storage device required to store the data. The larger the storage device required to meet the storage demands the higher the cost of the storage system. As a result, the compression of blocks of data received for storage has been used in some data storage systems to minimize the amount of data to be stored.
In addition to compressing blocks of data, prior known systems have also attempted to de-duplicate, e.g., eliminate the storage of duplicate blocks of data. However, such systems which use de-duplication require additional data processing and memory to track and properly manage the de-duplication of data blocks and the requests to store and retrieve blocks of data that may be duplicative. One approach that has met some limited success in managing the tracking of the de-duplicated data blocks has been to use a one way hash function to create a hash value to be associated with the physical address of where the block of data is stored. In such systems hash values corresponding to blocks of data can be compared to determine if the blocks of data are the same. Given that the hash value in such systems is usually around 128 bits, which is normally shorter than the length of the blocks of data being stored, in such a system two different blocks of data may result in the production of the same hash value. The use of a relatively long hash value in combination with a good hash function minimizes the probability that two blocks of data will result in the same hash value when processed. However, the possibility of two different blocks of data resulting in the same hash value, sometimes referred to as a collision, remains a real possibility. The known systems which use a very computationally heavy and complex hash function such as MD-5 and a hash value of 128 bits to provide a low risk of collisions have the distinct disadvantage of involving the use of specialized hardware to implement the hash function. Thus, the current approach has disadvantages in terms of cost due to hardware requirements as well as flexibility in terms of how a system can be implemented since support for the specialized hardware used to perform the MD-5 hash function needs to be provided in at least some known systems. Additionally, in some systems the process for determining whether a block of data received for storage is duplicative includes retrieving each previously stored block of data having a matching hash value from the storage media which is very time consuming
In some instances, some users wish to optimize the speed at which the data is stored and/or retrieved from the storage media and are willing to forego the de-duplicating of data to avoid delays in storage and/or retrieval due to the de-duplication process.
The management of the storage and retrieval of data in data storage apparatus is important to ensuring that data is properly tracked especially when de-duplication of data blocks is utilized to reduce the amount of data needed to be stored. Moreover, the methods used to manage the storage and retrieval of data from the physical storage device is also important to the amount of time it takes to store and/or retrieve data from the physical storage device.
The type of physical storage device or media, e.g., ROM, RAM, magnetic disk, optical disk, hard drives, solid state memory, upon which the data is stored is an additional aspect of a storage apparatus that affects the speed at which data can be stored on and retrieved from the physical storage device. For example, magnetic disks or drums have mechanical limitations that reduce the speed with which data can be read from the media.
Thus, there is a need for data processing methods and apparatus that can efficiently and effectively manage the storage and retrieval of data while reducing the amount of data to be stored as well as the amount of memory used to track the storage of data. Furthermore, there is a need for data processing methods and apparatus that can use lighter weighted and computationally simpler, hash functions than those currently being used in the management of data storage systems today. In particular, there is a need for methods and apparatus which allow for data de-duplication, e.g., using a hash function and/or other techniques, but without requiring specialized hardware, e.g., to implement the hash function. Moreover, there is a need for improved data de-duplication that reduces and/or minimizes the time it takes to identify duplicative data blocks without retrieving each potentially duplicative data block from physical media storage device. There is also a need for improved data storage with de-duplication methods and apparatus that reduce and/or minimize the time for storing blocks of data while also performing some de-duplication.