Differential or delta encoding is commonly used in data processing, and is particularly useful for compressing strongly correlated, ordered data sets, such as sequences of video images. Since there is usually a strong correlation in content between the successive images in a video sequence, it is possible to achieve significant reduction in data traffic by only transmitting the difference between one image and the previous image, instead of transmitting each complete image.
Other applications for delta encoding include data backup systems, in which, instead of storing a complete new copy of the data to be backed up, an incremental backup can be created, in which only the differences to the previous backup are stored for the new backup.
Delta encoding techniques have also been proposed as a substitute, or supplement, for web page caching. In this case, a web-page may be cached locally by a browser, for example, and then, instead of refreshing entire web-pages when necessary, the browser and web server negotiate subsequent transactions such that only incremental changes to web-pages are communicated to the browser. Similarly, websites which are to be replicated (mirrored) in order to improve their accessibility and reliability, can be kept synchronized with each other, by exchanging only the (delta) content between the mirrored sites.
Correlation encoding may be lossless, in which case data is encoded in such a way that it is subsequently completely reconstitutable to its original state, or lossy, in which case certain approximations are made during the encoding process, with the result that the encoded data no longer contains all the information required to reconstitute the data in its original state. In general, lossy encoding offers a significantly greater compression rate than lossless encoding.
Modern mobile phones are capable of taking many photographs and uploading the photos via wireless internet connections. One reason why mobile phones and smartphones have become a major image capture device is that the quality of the photos taken with these cameras is increasing. There are many web-based services which allow users to archive and share their private photos. However, the necessary access bandwidth may be unavailable (or too expensive) to permit frequent uploading and storage of photos automatically in an internet-based photo management service, for example.
A problem for network based photo services (e.g. photo archives or photo sharing services) is that the upload of images may take a long time because of the large size (resolution) of the images. For archiving services the user usually wants to store the best quality available (raw images with high resolutions in space and color domain). With the limited upload bandwidths available using DSL or mobile networks, the process of image transfer can take a long time. In some cases the process may need to be scheduled in advance, and may take many hours.
Data sets, such as batches of images, are routinely compressed, for example when the data sets are to be archived or transmitted, and where it is important to reduce the amount of bandwidth or data storage space required for the transmission or storage of the data sets. General purpose data-compression algorithms, such as the well-known Lempel-Ziv algorithm and its successors, involve identifying recurring patterns of data in a batch of data, and building a dictionary of such patterns, such that each pattern can be referred to by its dictionary reference.
The term “data set” is used in this application to refer to any item of data which may be the subject of correlation processing with other items of data—for example processing for compression, analysis or other types of data manipulation. Several examples of such data sets are given in this application, such as a collection of photographs in a digital camera, which are to be uploaded to a web-server. Another example would be batch-processing of the content of a batch of digital images—if for example a user wishes to carry out an image enhancement operation such as sharpening or contrast-enhancement on all the images in the batch, or convert a batch of images from one color space to another. Batch processing in this context means performing the same operation, or the same type of operation, on a batch (plurality) of data-sets.
Another example of data sets is in the batch-processing of OCR (optical character recognition) documents. Text documents can be subjected to a correlation encoding process, for example, in which similar pieces of scanned text (phrases, words or word segments, for example), which occur multiple times in the scanned images, are encoded as references to entries in a dictionary. In this case, each data set may be a whole scanned image (e.g. a page), or it may be a section of an image, such as a part of a text document which has been identified as a word or other group of characters or symbols.
Another example of data sets might be a collection of diverse computer files in a directory. Groups of such files may be collectively subject to various forms of processing such as data compression, data backup, virus-checking, file-system defragmentation, synchronizing etc.
It is known to use general-purpose data-compression algorithms to compress such unordered data sets. Files, or batches of files, can be compressed using the ubiquitous Zip algorithm, for example, which losslessly generates a compressed file containing all the information required to reconstitute the original file or files.
It is often necessary to process such data sets in batches. For example, a folder of files may need to be copied in one operation from a computer's internal storage to an external storage device such as a USB stick. Or a batch of holiday photographs may need to be uploaded from a mobile phone (for example via a wireless mobile network) to a social-media website server. Or a collection of pages of text may need to be scanned and OCR'ed in one operation.
Each of these operations can be speeded up by reducing the amount of information to be processed. In the case of compressing batches of data sets for transfer, conventional methods either compress the data sets individually, in which case the data sets can be transferred individually but the compression is sub-optimal, or they can be compressed as one file, in which case the compression is improved but the files cannot be transferred individually. Similarly, in the case of the batch OCR example, the amount of processing to be carried out can be reduced by encoding the whole batch of pages. The larger the data sample to be encoded, the more efficient is the encoding, since the likelihood of similar patterns (and therefore greater correlation) recurring is greater.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with the present invention as set forth in the remainder of the present application with reference to the drawings.