It is possible for documents stored in an electronic form to link or refer to other electronic documents stored elsewhere. For example, a web page is a document published to a computer network which can be accessed by any computing entity with a valid connection to the network. Web pages can refer or link to other web pages located elsewhere on the same network. One problem with references between electronic documents is that the content of a referenced document may change or the referenced document may be relocated on the network or removed altogether. It is therefore important for an owner of a document which includes references to verify the content of referenced documents and the continued existence of referenced documents at a referenced location. If the content of a referenced document is amended it may be necessary to compare the content of the amended document with the content of the document prior to amendment to ensure the document continues to be suitable for reference. Similarly, if a document is relocated or removed it is necessary to identify a new location of the document or a replacement document and to compare the content of the newly located or replacement document with the original document to ensure that it is suitable for reference. Such comparisons of the content of documents is often undertaken manually and is therefore time consuming and arduous. This is especially the case where compared documents are lengthy.
The content of a document can be considered as comprising substantive content and supplementary content. The substantive content of a document is that content which relates to the meaningful substance of the document in the context of the purpose or meaning of the document. In contrast, supplementary content in a document is that content which does not relate to the meaningful substance of the document, such as insignificant elements including links to other documents, advertisements or navigation features. It may also be appropriate to consider titles, headings and short annotations as supplementary content. In practice it can be useful to distinguish between the substantive content of a document and supplementary content of a document in terms of the number of words making up such content. For example, short paragraphs or lines of text consisting of fewer than three words are unlikely to constitute complete sentences with substantive meaning. Such short paragraphs or lines typically relate to document links (such as web page hyperlinks). Thus, for a given document, it may be defined that paragraphs consisting of fewer than three words constitute supplementary content of the document. All other content may constitute substantive content. Supplementary content within documents can be ignored when comparing the contents of documents. Similarly, two documents may differ in only an insignificant respect, such as difference in use of punctuation, layout, formatting, wording or style. These differences may have no impact on the substantive content of a document but nonetheless a literal comparison of the documents would identify these as differences. Such problems make it difficult to automate a method for the comparison of documents, such as through a computer program, since such automatic methods are inherently pedantic in their approach to comparison.
It would therefore be advantageous to provide a fingerprint for a document which reflects only the substantive content of the document and which is smaller than the document itself. Further, if the substantive content of the document is changed, it would be advantageous if the fingerprint for the document also changes to a measurable extent corresponding to the change to the substantive content of the document. I.e. The significance of the change to the document meaning would be reflected by an equivalent significance of change to the fingerprint. Thus, two documents can be compared by comparing their associated fingerprints. Any differences between the substantive content of the documents would result in a measurable and equivalent difference between the fingerprints of the documents.
One technique for providing a fingerprint for a document reflecting the content of the document is known as hashing. Hashing is a technique for generating a digest, such as a numerical value, corresponding to an input element such as a document. For example, the Message Digest 5 algorithm (MD5) is disclosed in RFC 1321 available from the world wide web at www.faqs.org/rfcs/rfc1321.html. This algorithm takes as input a document of arbitrary length and produces as output a digest of the document which is based on the content of the document. It is commonly accepted in the art that it is computationally infeasible to produce two documents with different content having the same document digest, or to produce any document having a particular document digest using the MD5 algorithm. Whilst the MD5 algorithm provides a fingerprint for a document, it does so for the whole contents of a document and does not distinguish the substantive content. Furthermore, a change to the document does not result in a measured change to the fingerprint generated by the MD5 algorithm. In fact, a small change to the content of a document can result in a radically different MD5 digest. Thus, comparing MD5 digests for two documents gives no indication of the similarity of the two documents themselves.
Another approach for generating a fingerprint for a document is known as shingling. Shingling is a method for generating a representation of the content of the document based on a set of shingles. A shingle is a contiguous subsequence of elements, such as words, contained in a document. The number of elements contained in a shingle is defined as the shingle size. The set of shingles for a document is the set of all unique shingles having the shingle size contained in the document. The shingling approach to generating a fingerprint for a document will now be considered with reference to FIGS. 1a to 1f. 
FIG. 1a is a representation of a document 1 including sentences, clauses, words and punctuation. Document 1 comprises a set of words represented by the elements of the document labelled ‘a’ to ‘l’. The document includes two sentences, s1 102 and s2 104. Sentence s1 102 is separated from sentence s2 104 by punctuation, in particular, ‘PERIOD’. Sentence s1 102 is thus comprised of words ‘a’ to ‘f’. Sentence s2 104 is comprised of words ‘g’ to ‘l’. Sentence s2 104 is further divided into clauses c1 106 and c2 108 which are divided by ‘COMMA’, and terminated by a further ‘PERIOD’ in clause c2 108. Document 1 can be divided into a set of shingles for a given shingle size. Taking a shingle size of three words, for example, a first shingle of document 1 includes the first three words ‘a’, ‘b’ and ‘c’. A second shingle of document 1 includes the second three words ‘b’, ‘c’ and ‘d’ and so on.
FIG. 1b is a representation of a set of shingles 10 with a shingle size of three words for the document 1 of FIG. 1a according to methods of the prior art. As can be seen from FIG. 1b a complete shingling of document 1 results in a set 10 of ten shingles starting with {‘a’, ‘b’, ‘c’ } and ending with {‘j’, ‘k’, ‘l’}. The set of shingles 10 therefore includes thirty words in total (a total number of words in all of the shingles). Thus, the set of shingles 10 is larger than the number of words in the original document 1 which included only twelve words (‘a’ to ‘l’). This results in a drawback of the shingling technique in that a comparison of documents by comparing sets of shingles results in comparing more elements than comparing the content of the documents themselves.
FIG. 1b is also annotated to include an indication of which shingles correspond to the semantic constructs of document 1. Thus, set of shingles 112 corresponds to the words included in sentence s1 102. Set of shingles 114 corresponds to the words included in sentence s2 104. Further, set of shingles 114 includes subset 116 corresponding to clause c1 106 and subset 118 corresponding to clause c2 108. It is noted that sets 112 and 114 intersect and that the two shingles {‘e’, ‘f’, ‘g’ } and {‘f’, ‘g’, ‘h’ } relate to both sentence s1 102 and sentence s2 104. Similarly, sets 116 and 118 intersect and the two shingles {‘h’, ‘i’, ‘j’ } and {‘i’, ‘j’, ‘k’} relate to both clause c1 106 and c2 108. Thus the existence of semantic constructs (such as ‘PERIOD’ and ‘COMMA’) in the substantive content of document 1 has no effect on the set of shingles 10 generated for document 1. This has the drawback that changes to the semantic structure of a document (e.g. Removal or addition of punctuation) does not affect a set of shingles generated for the document.
FIG. 1c is a representation of a document 2 which corresponds to the document 1 with the addition of a word ‘x’ at the end of the first sentence s1 122. In every other way the document 2 is identical to the document 1 and shall not be described in further detail. FIG. 1d is a representation of a set of shingles 20 with a shingle size of three words for the document 2 of FIG. 1c according to methods of the prior art. By comparing the set of shingles 20 for document 2 with the set of shingles 10 for document 1 it can be seen that the addition of the word ‘x’ at the end of sentence s1 122 has resulted in a change to the set of shingles 20 for the document 2. In particular, shingles including the word ‘x’ have been introduced. FIG. 1d is also annotated to include an indication of which shingles correspond to the semantic constructs of document 2. Thus, set of shingles 132 corresponds to the words included in sentence s1 122. Set of shingles 134 corresponds to the words included in sentence s2 104, and so on. These sets of shingles 132 and 134 for document 2 can be compared with the corresponding sets of shingles 112 and 114 for document 1 to quantify the change in the set of shingles for each sentence s1 122 and s2 104 following the addition of the word ‘x’ to sentence s1 122. It can be seen that whilst the word ‘x’ only affects sentence s1 122 in the substantive content of the document 2, set of shingles 132 for sentence s1 122 and set of shingles 134 for sentence s2 104 are both affected. Thus shingling has the drawback that changes to one semantic construct (such as sentence s1 122) affects the shingles generated with respect to a separate semantic construct (such as s2 104).
FIG. 1e is a representation of a document 3 which corresponds to the document 1 with the sentence s1 102 swapped with the sentence s2 104. In every other way document 3 is identical to the document 1 and in particular, the swapping of sentence s1 102 with sentence s2 104 does not change the substantive content of document 3 as compared with document 1. FIG. 1f is a representation of a set of shingles 30 with a shingle size of three words for the document 3 of FIG. 1e according to methods of the prior art. As can be seen from FIG. 1f a complete shingling of document 3 results in a set 30 of ten shingles starting with {‘g’, ‘h’, ‘i’ } and ending with {‘d’, ‘e’, ‘f’ }. FIG. 1f is also annotated to include an indication of which shingles correspond to the semantic construct of document 3. Thus, set of shingles 144 corresponds to the words included in sentence s2 104. Set of shingles 142 corresponds to the words included in sentence s1 102. Further, set of shingles 144 includes subset 146 corresponding to clause c1 106 and subset 148 corresponding to clause c2 108. Whilst the substantive content of document 3 is identical to that of document 1 it can be seen that the set of shingles 30 for document 3 differs from the set of shingles 10 for document 1. An approach to quantifying the similarity of documents by sets of shingles is disclosed in the document “Syntactic Clustering of the Web” by Broder et al. (Computer Networks and ISDN Systems, September 1997, Volume 29, no. 8, pp 1157-1166). This approach defines that, for a given shingle size, the containment of a set of shingles A in a set of shingles B is:
      C    ⁡          (              A        ,        B            )        =                          A        ⋂        B                                A            where |X| is the size of set X. Applying this to the sets of shingles 10 and 30, with A corresponding to the set of shingles 10 and B corresponding to the set of shingles 30, the containment can be calculated as:
      C    ⁡          (              A        ,        B            )        =                                      A          ⋂          B                                            A                      =                  7        10            =      0.7      
Thus, even though the substantive content of documents 1 and 3 is identical, the similarity quantified by the containment of the set of shingles 10 in the set of shingles 30 is ‘0.7’ or 70%. Shingling thus has the drawback that a mere rearrangement of the semantic construct of a document can cause a significantly different set of shingles.
Thus, whilst shingling provides a technique for representing the content of a document, it is not limited to representing the substantive content of the document and it does not accommodate the significance or insignificance of semantic construct within the document. Consequently, the extent of a change to a document for which a set of shingles is generated is not measurably reflected in a regenerated set of shingles for the document.
Thus there exists a need to provide a method for generating a fingerprint for a document which overcomes these drawbacks and provides the advantageous features described above. In particular, the advantageous features of: providing a fingerprint for a document which reflects only the substantive content of the document and which is smaller than the document itself; the fingerprint reflecting the organisation of the document into semantic constructs; the fingerprint changing to a measurable extent corresponding to a change to the substantive content of the document; and the fingerprint being unaffected by mere rearrangement of the content of the document.