The information age has produced an explosion of content for people to read. This content is obtained from traditional sources such as books, magazines, newspapers, newsletters, manuals, guides, references, articles, reports, documents, etc., that exist in print, as well as electronic media in which the aforesaid sources are provided in digital form. The Internet has further enabled an even wider publication of content in digital form, such as portable document files and e-books.
Technological advances in digital imaging devices have enabled the conversion of content from printed sources to digital form. For example, digital imaging systems including scanners equipped with automatic document feeders or scanning robots are now available that obtain digital images of pages of printed content and translate the images into computer-readable text using character recognition techniques. These “page images” may then be stored in a computing device and disseminated to users. Page images may also be provided from other sources, such as electronic files, including electronic files in .pdf format (Portable Document Format).
When a user attempts to access images of one or more pages of content from a book or other source stored on a computing device, it may be desirable to facilitate such access based on the type or classification of the page represented by the image, thus enhancing the user experience. For example, rather than forcing the user to reach a certain portion of the content by accessing the content serially, page image by page image, direct links may be provided, for example, to a page image classified as a table of contents or the start of the text.
Currently, classification of page matter is done manually, which is time consuming and costly. Accordingly, a method and system are needed for automatically classifying images of pages of content.