In computer hardware technology it is well-known to use disk storage devices like hard disk drives (HDDs) or optical disk drives built-up of one or a stack of multiple hard disks (platters) on which data is stored on a concentric pattern of magnetic/optical tracks using read/write heads. These tracks are divided into equal arcs or sectors. Two kinds of sectors on such disks are known. The first and at the very lowest level is a servo sector. In the case of a magnetic storage device, when the hard disks are manufactured, a special binary digit (bit) pattern is written in a code called ‘gray code’ on the surface of the disks, while the drive is open in a clean room, with a device called “servo writer”.
This gray code consists of successive numbers that differ by only a single bit like the three bit code sequence, 000′, 001′, 011′, 010′, 110′, etc. Although many gray codes are possible, one specific type of gray code is considered the gray code of choice because of its efficiency in computation. Although there are other schemes, the gray code is written in a wedge at the start of each sector. There are a fixed number of servo sectors per track and the sectors are adjacent to one another. This pattern is permanent and cannot be changed by writing normal data to the disk drive. It also cannot be changed by low-level formatting the drive.
Disk drive electronics use feedback from the heads which read the gray code pattern, to very accurately position and constantly correct the radial position of the appropriate head over the desired track, at the beginning of each sector, to compensate for variations in disk (platter) geometry caused by mechanical stress and thermal expansion or contraction.
At the end of the manufacturing process, the hard disk storage devices generally are low-level formatted. Afterwards, only high-level operations are performed such as known partitioning procedures, high-level formatting and read/write of data in the form of blocks as mentioned above. All high-level operations can be derived from only two base operations, namely a BlockRead and a BlockWrite operation. Thus even partitioning and formatting, the latter independently of the underlying formatting scheme like MS-DOS FAT, FAT32, NTFS or LINUX EXT2, are accomplished using the mentioned base operations.
When high-level formatting such a disk drive, each disk (platter) is arranged into blocks of fixed length by repeatedly writing with a definite patch like “$5A”. After formatting, when storing data in such disk storage devices, these data are stored as continuous data segments on the disk (platter). These continuous data segments are also referred to as “data” or, simply, “blocks” and such terminology will be used hereinafter.
It is to be noted that, in known tape storage devices, data are stored in form of data blocks, as well. The only difference between the above described hard disk devices and these tape storage devices is that data stored on HDDs are directly accessible by means of the read/write head (so-called direct memory access DMA operation mode), whereas data stored on tapes are only accessible in a sequential manner since the tape has to be wound to the location where the data of interest are stored before these data can be accessed.
In order to minimize storage occupancy in those storage devices, it is known to avoid duplicate data. A disk drive system comprising a sector buffer having a plurality of segments for storing data and reducing storage occupancy is disclosed in U.S. Pat. No. 6,092,145 assigned to the assignee of the present invention. Generally, HDD systems require a sector buffer memory to temporarily store data in the HDD system because the data transfer rate of the disk is not equal to the data transfer rate of a host computer and thus a sector buffer is provided in order to increase the data I/O rate of new high capacity HDD systems. The system described therein particularly includes a controller for classifying data to be stored in the sector buffer and for storing a portion of the classified data in a segment of the sector buffer such that the portion of classified data stored in the segment is not stored in any other segment in the sector buffer. Therefore, the sector buffer is handled more efficiently, and the computational load to check for duplicated data is reduced and the disk drive thus improves data transfer efficiency.
The subject matter of the U.S. Pat. No. 6,092,145, in other words, concerns an improved method for read-ahead and write-ahead operations using a sector buffer wherein duplicates are eliminated only in the sector buffer implemented on the hard disk or a separate Random Access Memory (RAM), in order to provide the improved transfer efficiency mentioned above.
Another approach for optimizing storage occupancy is disclosed in U.S. Pat. No. 5,732,265 assigned to Microsoft Corporation. Particularly disclosed is an encoder for use in CD-ROM pre-mastering software. The storage in the computer readable recording medium (CD-ROM) is optimized by eliminating redundant storage of identical data streams for duplicate files whereby two files having equivalent data streams are detected and encoded as a single data stream referenced by the respective directory entries of the files. More particularly addressed therein is the problem of data consistency that arises when multiple files are encoded as a single data stream and when these files are separately modified by an operating system or application program. In U.S. Pat. No. 5,732,265 it is further disclosed to implement such an encoder in an operating system or file system to dynamically optimize storage in the memory system of the computer wherein the above-described mechanism is applied at the time a file is created or saved on a data volume to detect whether the file is a duplicate of another existing file on the data volume.
The above-discussed prior art approaches, however, have a disadvantage in that they do not address reduction of storage occupancy of stored user data (e.g. within a file or between files) which is stored in an above identified data storage device. Only as an example, it is referred to text or picture files where blocks frequently are fully represented by a recurring data byte being regarded as duplicate data in the present context. Nevertheless, as computer usage and application programs supporting it has become more sophisticated, there is an increased likelihood that relatively large portions of individual (possibly large) files comprising many blocks of data may be duplicated in many stored files; letterhead and watermarks stored in documents and portions of image files representing relatively large image areas having little detail therein being only a few examples. Further, as the capacity of memory devices increases, it becomes even more clearly impractical to compare a block to the stored with all blocks which may have been previously stored in one or more memory devices to deter mine if an identical block has been previously stored.