Various known documents providing a technological background for the present disclosure are listed in the APPENDIX associated with the present disclosure.
Generally, algorithms used to compress data are based either on a lossless compression method [3] or on a lossy compression method [4]. In lossless compression, various files, namely data (D1), are compressed in such a manner that the data (D1) can later be recovered as it was originally.
Conventionally, it is known to employ data de-duplication methods [5] when encoding the data (D1), which attempt to eliminate duplicate copies of segments of data in the data (D1), namely those data segments which have not changed when they repetitively reoccur when the data (D1) is, for example, temporally streamed. Known data de-duplication methods are able to find efficiently such data blocks that are exactly similar to a desired data block.
Generally, known data de-duplication methods recognize previously occurring data segments by using various different methods, such as:    (i) by detecting a number of changed data elements in a given data segment relative to a reference data segment;    (ii) by computing a sum of absolute differences between data elements of the given data segment and the reference data segment;    (iii) by utilizing redundancy check tables; or    (iv) by employing sliding-block methods.
Entire files can also be de-duplicated, in which case a symbol used to replace duplicate files produces an excellent compression ratio [5].
Moreover, data de-duplication can be executed in a post-processed manner, in which case associated data processing is performed retroactively after the data (D1) has been written. Alternatively, data de-duplication can be performed in real-time, namely just as the data (D1) enters a given system, in which case a given recognized data block is not written at all, but instead, a reference is made to an earlier data block which is mutually similar to the given recognized data block.
Data de-duplication is used in various branches of contemporary information technology industry, such as in data storage and in data transfer networks. For example, data de-duplication is used in cloud services, in system backup copying and in e-mail servers, wherein mutually similar files, or only slightly changed substantially mutually similar files, are transferred continuously. Moreover, in Internet communication networks, where responses to requests are sent, data bytes are transmitted back and forth, and those bytes mostly contain partly or entirely the same Internet Protocol (IP) packet data; data de-duplication is relevant to Wide Area Network (WAN) Optimization, for example.
It is well-known previously that known data de-duplication methods are more cost-efficient in comparison to traditional data compression methods. However, the known data de-duplication methods suffer from several disadvantages. Firstly, the known de-duplication methods often use considerable data memory and processing power as they attempt to achieve a desired data compression ratio. Generally, an associated search area, namely an amount of memory used to find similarities, needs to be increased to improve the data compression ratio. Moreover, CPU-intensive methods, such as a sliding search method, need to be used to improve the data compression ratio. The sliding search method seeks to identify a target data block or data packet in a raw fashion by shifting inside a search area to a direction pointed to by an algorithm employed for implementing the sliding search method.
Secondly, the known data de-duplication methods are not able to find such data blocks or data packets whose content has changed slightly, but which still contains a lot of unchanged data elements relative to the desired data block.
Thirdly, the known data de-duplication methods potentially result in data fragmentation, especially when the processing associated with these de-duplication methods is executed in real time.