Conventional computer storage systems typically store sequences of bytes as named files in file systems. Despite the fact that many files may be very similar to each other, and have large portions of data in common 130, 132 (FIG. 13), these systems may not eliminate this redundancy. Instead, they may store each file separately 140, 142 keeping a number of copies 130, 132 of the same data (FIG. 14).
Some conventional file systems incorporate conventional non-lossy text compression algorithms (such as GZip) to compress individual files, but this can be viewed as a “keyhole” redundancy elimination technique because it analyses the redundancy of a single file at a time rather than the file system as a whole. These conventional text compression algorithms may be incapable of spotting similarities between widely separated data 150, 152 such as two similar files 130, 132 in different parts of a file system (FIG. 15).
What is desired is a method and apparatus for representing data in a form that makes it possible to identify some of their repeated sequences of data and to reduce the number of copies of this repeated data that is stored.