The present invention relates, in general, to the field of systems and methods for the unorchestrated determination of data sequences using “sticky byte” factoring to determine breakpoints in digital sequences. More particularly, the present invention relates to an efficient and effective method of dividing a data set into pieces that generally yields near optimal commonality.
Modern computer systems hold vast quantities of data—on the order of a billion, billion bytes in aggregate. Incredibly, this volume tends to quadruple each year and even the most impressive advances in computer mass storage architectures cannot keep pace.
The data maintained in most computer mass storage systems has been observed to have the following interesting characteristics: 1) it is almost never random and is, in fact, highly redundant; 2) the number of unique sequences in this data sums to a very small fraction of the storage space it actually occupies; 3) a considerable amount of effort is required in attempting to manage this volume of data, with much of that being involved in the identification and removal of redundancies (i.e. duplicate files, old versions of files, purging logs, archiving etc.); and 4) large amounts of capital resources are dedicated to making unnecessary copies, saving those copies to local media and the like.
A system that factored redundant copies would reduce the number of storage volumes otherwise needed by orders of magnitude. However, a system that factors large volumes of data into their common sequences must employ a method by which to determine those sequences. Conventional methods that attempt to compare one data sequence to another typically suffer from extreme computational complexity and these methods can, therefore, only be employed to factor relatively small data sets. Factoring larger data sets is generally only done using simplistic methods such as using arbitrary fixed sizes. These methods factor poorly under many circumstances and the efficient factoring of large data sets has long been a persistent and heretofore intractable problem in the field of computer science.