The present disclosure, in some embodiments thereof, relates to segmenting an input data stream and, more specifically, but not exclusively, to segmenting an input data stream by splitting the input data stream to a plurality of data sub-streams segmented in parallel using vector processing.
In present times being an information era, the volumes of data which need to be stored and/or transferred between locations rapidly increase. The enormous quantities of data may present major cost and/or complexity challenges with respect to storage space for storing the data and/or network bandwidth for transferring it.
One solution commonly used for reducing the amount of data for storage and/or transfer is data deduplication (often called “intelligent compression” or “single-instance storage”) which is a method of reducing the data volume by eliminating redundant data. While there are methods for file deduplication, block deduplication may present better results with respect to data compression. In block deduplication only one unique instance of a data segment (block) of a data stream is actually retained while redundant data segment(s) which are identical to the already retained data segment are replaced with a pointer to a copy of the retained data segment. Block deduplication processes a data stream that may include multiple data types, for example, data files, media files, stream data and the like to identify unique instances of one or more data segments (blocks). A unique number (hash value) is generated for each segment using a hash algorithm, for example, a Rabin-Karp rolling hash and/or a Buzhash. The hash value generated for each segment is compared to existing hash values generated for previous segments and in case the hash value equals to an existing hash value, the segment is not retained but rather replaced with a pointer to the copy of the existing segment. Furthermore, in case the segment is updated, only the changed data may be retained while the remaining unchanged data which may include a significant amount of the segment is not retained.
One of the main challenges is effectively segmenting the data stream such that the segments are affected as little as possible by changes to the segments' data contents. Rolling hash techniques may be used for segmenting the data stream as known in the industry. Using a rolling hash, a hash value is calculated for shifting sequences of data in the data stream (in each rolling sequence an ending data item is omitted and a new data item is inserted). The calculated hash value is checked for compliance with pre-defined one or more segmentation criteria and in case the compliance is identified, the start of the respective rolling sequence is designated as a segment boundary or cut point.