The present disclosure, in some embodiments thereof, relates to segmenting an input data stream and, more specifically, but not exclusively, to segmenting an input data stream using vector processing.
Today, the volume of electronic data that needs to be stored or transferred between locations is rapidly increasing. Enormous quantities of data may present major cost and complexity challenges with respect to storage space for storing the data or network bandwidth for transferring it.
One solution commonly used for reducing the amount of data for storage or transfer is data deduplication (often called “intelligent compression” or “single-instance storage”) which is a method of reducing the data volume by eliminating redundant data. While there are methods for file deduplication, block deduplication may present better results with respect to data compression. In block deduplication only one unique instance of a data segment (block) of a data stream is actually retained while redundant data segment(s) which are identical to the already retained data segment are replaced with a pointer to a copy of the retained data segment. Block deduplication processes a data stream that may be one of multiple data types, for example, data files, media files, stream data and the like to identify unique instances of one or more data segments (blocks). A unique number (hash value) is generated for each segment using a hash algorithm. A cryptographic strength hash algorithm is usually used for this purpose, for example, MD5 or SHA-1. The hash value generated for each segment is compared to existing hash values generated for previous segments and in case the hash value equals to an existing hash value, the segment is not retained but rather replaced with a pointer to the copy of the existing segment. Furthermore, in case the segment is updated, only the changed data may be retained while the remaining unchanged data which may include a significant amount of the segment is not retained.
One of the main challenges is effectively segmenting the data stream such that the segments are affected as little as possible by changes to the segments' data contents. Rolling hash techniques may be used for segmenting the data stream as known in the industry. Using a rolling hash, a hash value is calculated for shifting sequences of data in the data stream (in each rolling sequence an ending data item is omitted and a new data item is inserted). The calculated hash value is checked for compliance with pre-defined one or more segmentation criteria and in case the compliance is identified, the end of the respective rolling sequence is designated as a segment boundary or cut point.