The present disclosure relates to file optimization.
When backup technology was first available, most of the data being backed up was in simple un-encoded data formats, such as ASCII or in un-encoded databases. The amount of data was typically in the order of megabytes or gigabytes. In contrast, in more recent backup workflows, the data is typically one the order of terabytes or petabytes and the files are in rich media format, including encoded images, compressed text sections and other complex object formats and data structures such as CAD drawings or seismic information.
Encoded files come in a number of encoded file formats. An encoded file (e.g. a rich media files) is one that has some level of obfuscation on top of the information contained in the file. This is usually because of some compression scheme applied to the file, like JPEG compression, ZIP (LZ) compression or something simple like RLE (Run Length Encoding) or BASE64 (found in email systems). The nature of most encoding schemes is such that the encoding of a particular byte depends on the preceding bytes. This means that a particular byte may be encoded (represented in bits) differently each time, if its history or preceding bytes are different. Encoded rich media documents have the property that any two documents that seem similar or same in information may be represented entirely differently in their bit representation on storage media.