A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document of the patent disclosure as it appears in the United States Patent and Trademark Office patent file or records, but otherwise, reserves all copyright rights whatsoever. The following notice applies to the software and data and described below, inclusive of the drawing figures where applicable: Copyright (copyright) 2000, Undoo Technologies.
The present invention relates, in general, to the field of systems and methods for the unorchestrated determination of data sequences using xe2x80x9csticky bytexe2x80x9d factoring to determine breakpoints in digital sequences. More particularly, the present invention relates to an efficient and effective method of dividing a data set into pieces that generally yields near optimal commonality.
Modern computer systems hold vast quantities of dataxe2x80x94on the order of a billion, billion bytes in aggregate. Incredibly, this volume tends to quadruple each year and even the most impressive advances in computer mass storage architectures cannot keep pace.
The data maintained in most computer mass storage systems has been observed to have the following interesting characteristics: 1) it is almost never random and is, in fact, highly redundant; 2) the number of unique sequences in this data sums to a very small fraction of the storage space it actually occupies; 3) a considerable amount of effort is required in attempting to manage this volume of data, with much of that being involved in the identification and removal of redundancies (i.e. duplicate files, old versions of files, purging logs, archiving etc.); and 4) large amounts of capital resources are dedicated to making unnecessary copies, saving those copies to local media and the like.
A system that factored redundant copies would reduce the number of storage volumes otherwise needed by orders of magnitude. However, a system that factors large volumes of data into their common sequences must employ a method by which to determine those sequences. Conventional methods that attempt to compare one data sequence to another typically suffer from extreme computational complexity and these methods can, therefore, only be employed to factor relatively small data sets. Factoring larger data sets is generally only done using simplistic methods such as using arbitrary fixed sizes. These methods factor poorly under many circumstances and the efficient factoring of large data sets has long been a persistent and heretofore intractable problem in the field of computer science.
Disclosed herein is a system and method for unorchestrated determination of data sequences using xe2x80x9csticky bytexe2x80x9d factoring to determine breakpoints in digital sequences such that common sequences can be identified. Sticky byte factoring provides an efficient method of dividing a data set into pieces that generally yields near optimal commonality. As disclosed herein, this may be effectuated by employing a hash function with periodic reset of the hash value or, in a preferred embodiment, a rolling hashsum. Further, in the particular exemplary embodiment disclosed herein, a threshold function is utilized to deterministically set divisions in a digital or numeric sequence, such as a sequence of data. Both the rolling hash and the threshold function are designed to require minimal computation. This low overhead makes it possible to rapidly partition a data sequence for presentation to a factoring engine or other applications that prefer subsequent synchronization across the entire data set.
Among the significant advantages of the system and method disclosed herein is that its calculation requires neither communication nor comparisons (like conventional factoring systems) to perform well. This is particularly true in a distributed environment where, while conventional systems require communication to compare one sequence to another, the system and method of the present invention can be performed in isolation using only the sequence being then considered.
In operation, the system and method of the present invention provides a fully automated means for dividing a sequence of numbers (e.g. bytes in a file) such that common elements may be found on multiple related and unrelated computer systems without the need for communication between the computers and without regard to the data content of the files. Broadly, what is disclosed herein is a system and method for a data processing system which includes a fully automated means to partition a sequence of numeric elements (i.e. a sequence of bytes) so that common sequences may be found without the need for searching, comparing, communicating or coordinating with other processing elements in the operation of finding those sequences. The system and method of the present invention produces xe2x80x9csticky bytexe2x80x9d points that partition numeric sequences with a distribution that produces subsequences of the type and size desired to optimize commonality between partitions.