Today, there are many websites with search engines dedicated to searching for specific types of content. One specific type of content that some websites provide searching services for is books. Such search engines typically require the digitization and indexing of books supplied by libraries, publishers, and other book providers. Typically, metadata of a book such as the author, title, publisher, copyright year, subjects, correlation between leaf numbers of pages and page numbers printed in the book, the book structure (leaf number of title page, leaf number of table of contents pages, leaf number of index pages), table of contents data (list of chapter names with corresponding page leaf numbers), and index data (list of index terms with corresponding page leaf numbers) is needed to be identified, associated, and indexed with the actual contents of the book.
Books that are not available in digital form are usually scanned using optical character recognition (OCR) technology. However, many problems exist with OCR technology: OCR software typically does not perform any metadata extraction; the quality of OCR output is not perfect as some words do not get recognized correctly; the OCR software is usually not able to detect different formatting between different publishers and copyright years; and the OCR software may not be able to detect more than one sequence of page numbers in a book.