As is known in the art, computer users create and store data files as documents in computer systems. As is also known, these same computer users, for a variety of reasons, are often interested in determining the similarity of two documents.
One approach, for example, is to record samples of each document, and to declare documents to be similar if they have many samples in common. The samples could be sequences of fixed numbers of any convenient units, such as English words. Such a method requires samples proportional in size with the length of the documents.
Another approach to this problem is based on single word "chunks." Such a method employs a registration server that maintains registered documents against which new documents can be checked for overlap. The method detects copies based on comparing word frequency occurrences of the new document against those of registered documents.
What is needed is a method to determine whether two documents have the same content except for modifications such as formatting, minor corrections, web-master signature, logo, etc., using small sketches of the document, rather than the full text.