1. Field of the Invention
This invention pertains in general to data mining a large corpus of text and in particular to identifying and navigating similar passages in a digital text corpus.
2. Description of the Related Art
Advancement in digital technology has changed the way people acquire information. For example, people now can view electronic documents that are stored in a predominantly text corpus such as a digital library that is accessible via the Internet. Such a digital text corpus is established, for example, by scanning paper copies of documents including books and newspapers, and then applying an optical character recognition (OCR) process to produce computer-readable text from the scans. The corpus can also be established by receiving documents and other texts already in machine-readable form.
Unlike in a hypertext corpus, a document in a digital text corpus rarely contains functional links to other documents either in the same corpus or in other corpora. Moreover, mining references from the text of documents in a digital text corpus to support general link-based browsing is a difficult task. Functional hypertext references such as URLs are rare. Citations and other forms of inline references are also seldom used outside of scholarly and professional works.
This lack of a link structure makes it difficult to browse documents in the corpus in the same manner that one might browse a set of web pages on the Internet. As a result, browsing the documents in the corpus can be less stimulating than traditional web browsing because one can not browse by related concept or by other characteristics.