1. Field of Disclosure
The disclosure generally relates to the field of digitizing books, and in particular to identifying and classifying related volumes in a corpus.
2. Background Information
A digital text corpus may be created by scanning books and other texts from libraries and other repositories. The collections held by different repositories often have significant content overlap. As a result, the digital text corpus has multiple copies of similar (including identical) volumes. Identifying similar volumes in the corpus is useful for purposes such as selecting a representative version of the volume, pages, or even text thereof, as well as catching anomalous volumes, detecting different but related volumes that share content, and detecting piracy.
However, for a large corpus, it is difficult to compare the digital text volumes to each other. It is infeasible to compare every pair of volumes in the corpus, and even more computationally prohibitive to compare every pair of pages within the volumes. Therefore, it is hard to identify similar volumes within the corpus.