The present invention relates to the area of segmenting data for the purposes of data communications, storage, search, and compression. Many applications use data segmentation to process data. Data segmentation breaks a large or continuous stream of data into multiple smaller data segments. The application then processes each segment to perform the desired function.
For example, packet based data networks communicate data in discrete packets. Typically, there is a maximum limit on the size of each packet. A network communications application can use data segmentation to break a large amount of data or a continuous stream of data into packet size segments. In a further example, a network protocol acceleration application can segment a large amount of data or a continuous stream of data into segments to exploit the similarity between different segments. The network protocol acceleration application then uses data suppression and/or compression techniques to minimize the amount of data sent over a network connection and/or to minimize actual or apparent network latency.
There are many prior data segmentation techniques. Prior data segmentation techniques segment data according to a shift-invariant, deterministic predicate function applied to a fixed window of the input at each offset of the buffer. A sliding window of fixed size is moved over data in a buffer. A predicate function, or set of rules, is applied to the data within the window at each window position. The predicate function can be constructed to output a true or false value based on its input data. If the predicate function evaluates to false for a given window position, the sliding window is moved to the next window position. If the predicate function evaluates to true for the data within the window at a given window position, a segmentation boundary is selected based on the current window position. By evaluating all of the buffer data for all possible window positions, a set of segmentation boundaries for the buffer data is created. The buffer data between adjacent segmentation boundaries form a segment.
Prior segmentation techniques utilize deterministic, shift-invariant predicate function. This type of predicate function outputs the same value for a given set of data, regardless of the position of the window within the buffer. For example, a shift-invariant predicate function will output the same value for a given set of data regardless of whether this data is located at the beginning of the buffer, in the middle of the buffer, or at the end of the buffer.
One advantage of a shift invariant predicate function is that the same data is segmented in the same fashion regardless of how or where it is encountered, e.g., whether it is in a file in a file system, a packet in a network, a row in a database, a transport buffer in a TCP connection, and so forth.
However, shift invariant predicate functions also have many disadvantages. One disadvantage of shift-invariant predicate functions is that certain data inputs will not generate any segment boundaries. This can occur when the predicate function evaluates a particular byte pattern as false (meaning there is no segment boundary chosen) and that byte pattern appears continuously in the input buffer.
To overcome this problem, prior segmentation systems impose an upper bound on a segment length. If the distance between the last segment boundary detected and the current position of the sliding window of the predicate function exceeds the upper bound on segment length, a segment boundary is created regardless of the output of the predicate function. Thus, the maximum segment length in these prior segmentation systems is the value of this upper bound.
A further problem with shift-invariant segmentation processes is that they tend to produce segments whose sizes are distributed in a skewed fashion. That is, the segment sizes tend to vary significantly rather than being clustered near a common value. This, in turn can create inefficiencies for implementations that utilize segments because such a system must accommodate a wide range of size rather than being tuned or optimized for a narrow range of sizes.
It is therefore desirable for a system and method of data segmentation overcome the disadvantages of prior data segmentation schemes and provide improved identification of redundant data for typical data inputs and improved distribution of segment sizes for more efficient communications and storage. It is also desirable for the system and method of data segmentation to be adaptable to a variety of different data communications, compression, and storage applications.