In the electronic publications arts, it is often desirable to determine when two digital volumes contain the same, or approximately the same, content. As an example, consider two instances of the same book made available by different publishers. A simple word-for-word comparison may identify numerous superficial differences, including, for example, different spellings, e.g., American versus British English, different cultural references, e.g., “pharmacist” versus “chemist,” differences in formatting and/or layout and the like. While such differences may be numerous, the underlying content may be substantially similar.
It is often the case that metadata is provided with the digitized text of a book. Such metadata may include, for example, the author, title, publisher, International Standard Book Number (ISBN), and additional identification information. While it may seem that the metadata can serve to determine book similarities, it has been observed that due to a variety of reasons, including, for example, human error, formatting inconsistencies, and deliberate deceit, metadata is not a reliable enough source upon which to determine content similarities.