In the fields of information management and governance it is frequently desirable to be able to identify a subset of documents that are considered “similar” to one another. Although readily understood in a colloquial or qualitative sense, “similarity” is a poorly-defined concept.
There are a number of challenges that are presented when trying to determine the similarity of a plurality of digital data files. Some of these challenges include, but are not limited to, file format impedance, performance considerations, sample window alignment and comprehensibility.
With respect to file format impedance, it is well known that different software vendors use different ways of writing files which can give substantially similar content from completely dissimilar (and incomparable) representations on disk, which can significantly complicate the task of detecting the similarity of two (or more) digital data files.
With respect to performance considerations, it will be readily appreciated that detecting similarity is generally only useful when applied to large volumes of content. Simple implementations of similarity detection have performance limitations that increase sharply as more (and larger) documents are added to the plurality of digital data files being considered.
With respect to sample window alignment, it will be readily appreciated that known methods of block-based sampling have an inherent problem when determining digital data file similarity as insertions or deletions in the middle of a file subsequently shift which bytes/characters are in subsequent sampled blocks.
Finally, with respect to the concept of comprehensibility, a given collection can have a resultant set of similarity relationships that ranges from no similarity at all to every file being identical, with no clear way to make the results presentable. In these instances, the best guarantee of outcome is that there will be some number of nodes with some number of relationships running in a single (but not predictable) non-directed graph of relationships with an arbitrary number of roots.
A number of prior art solutions have been developed in an attempt to address at least some of these challenges discussed above when trying to identify a subset of similar digital data files when presented with a plurality of seemingly otherwise disparate digital data files.
For example, known block-deduplicating file systems, such as ZFS, can detect files with identical beginnings; yet such systems can store only the differences detected after the first change.
Further, known genetic sequence alignment algorithms such as BLAST and FAST can detect subsequence segments in long stretches of DNA, however these algorithms cannot do many-to-many comparisons of segments simultaneously or efficiently.
Moreover, Fuzzy-Hash algorithms such as SSDEEP can find one-to-one similarities of raw byte streams, but cannot detect similarity across file formats, nor can they efficiently do many-to-many comparisons.
Furthermore, the aforementioned comprehensibility problem has been addressed in a variety of different ways. Two such examples include microarray heatmaps and subsample cluster analysis.
Microarray heatmaps often employ rectangular arrays of colour for many-to-many comparison of large arrays, but this approach has the fundamental weakness that the appearance is dependent on the sort applied to each axis and that every cell in the array needs to be calculated, which has the potential to be very expensive in terms of processing resources.
Alternatively, subsample cluster analysis involves the use of a subset of the matched documents, which simply renders a list of the relationships attached to that particular document.
Therefore, it would be desirable to detect subsets of digital data files that are all composed of substantially similar subsequences, regardless of where the differences between the files are actually located.
Accordingly, there is need for methods and systems for the improved semantic meshing of digital data files.