Data compression is a technology widely applied in data storage and transmission. With respect to transmitted data, due to large amounts of redundant data, a network device at a transmitting end transmits data after compressing the data, which may effectively reduce data volume in the case of data transmission over a network and reduces transmission delay. Correspondingly, a network device at a receiving end needs to decompress received data.
At present, compression technologies used for data transmission may be categorized into two types. One is a compression technology based on LZ (Lempel-Ziv) algorithms, and the other is referred to as a data deduplication technology. With respect to the LZ compression technology, the transmitting end generally performs matching inside a data block by using a sliding window, so as to generate a compressed dictionary and performs compression, and the receiving end generates a corresponding dictionary and performs decompression. With respect to the data deduplication technology, large blocks of repeated data exist during data transmission, and the network device stores a large data block transmitted through the device and uses it as a dictionary entry. During subsequent data transmission, each time a repeated data block is detected, a short code index in the dictionary is used to replace the repeated data block. The receiving end restores the original data according to a received code index and stored dictionary entry.
If the data transmitted over the network is taken as a bit stream, the network device needs to properly segment a data stream that is transmitted through the device, and takes data segments as dictionary entries for data compression. The length of a data segment affects the utilization efficiency of the dictionary and a compression ratio. A too large length reduces the utilization efficiency of the dictionary and a too small length reduces the compression ratio.
If a segmentation method with a fixed number of bytes is used, when the data of a data segment changes, the boundaries of its following data segments all change so that the dictionary entries created according to the subsequent data segments cannot be effectively used. To solve such problem caused by fixed size segmentation, the prior art may use a content fingerprint (Fingerprint), and use a sliding window with the size of W to slide in the data stream to be processed. The sliding may be performed in a one-by-one byte manner or in a two-by-two bytes manner. During each sliding process, the content fingerprint of a data block in the window is calculated. When the content fingerprint satisfies a preset rule, the boundary along the sliding direction of the window is taken as a segmentation point; otherwise, sliding of the window is continued to calculate the content fingerprint until a segmentation point is determined.
During implementation of the present invention, the inventor finds at least the following problems in the prior art. With the above segmentation method, the length of the data segmentation may be too large, which may cause a reduction of the matching probability and reduce the utilization efficiency of the dictionary.