Existing Optical Character Recognition (OCR) software can convert scanned paper documents into an electronic version, which contains the ASCII/Unicode text and page-level structures such as text/graphics blocks. However, for books/journals with many pages it is desirable to automatically generate the book/journal level structure, for example, to split a whole book/journal into individual articles.
One benefit of splitting a book/journal into individual articles is to speed Web browsing and to reduce Web traffic. One common usage of the electronic versions of books/journals is to make them available on the Web for users to browse or download remotely. Usually the file for a whole book/journal can have a size ranging from several megabytes to tens of megabytes, depending on the number of pages and the contents of the pages. It will take a long time to download the file if the user only has low-speed Internet access, such as the dial-up service. Accordingly, if the book/journal is already spit into smaller units such as individual articles, the user can simply download the contents he/she is interested in, thus reducing Web traffic load for the content provider if people are only downloading pieces of the contents. Another benefit of splitting a book/journal into individual articles is to satisfy business considerations. In certain instances, the publisher provides the electronic contents for a fee. In this case, the user often wants to pay for only the parts interesting him/her. By providing content split into individual articles, the user can access only the parts they desire. Yet another benefit afforded by splitting a book/journal into individual articles is that it facilitates ease of use. Even for users who have the whole book/journal, it is still desirable to organize the contents logically to facilitate navigation and browsing.
Although human operators can split a book/journal/magazine into separate articles, this kind of manual processing tends to be slow, tedious and expensive. It is desirable to have a system that enables a computer to automatically analyze the logical structure of books/journals and split them into individual chapters/articles. In the prior art, there are two categories of methods to obtain the logical structure:
One category of methods obtains the logical structure by using only page/chapter/article numbers printed on table-of-content (TOC) pages (or the content pages) to find individual articles and analyze only the TOC pages to get other relevant information such as the article titles and author names. There are several disadvantages associated with relying only on the TOC pages. For one, OCR may make errors in recognizing the page numbers in the content pages. In this situation, the wrong page numbers are obtained or page numbers may be missed and accordingly split the journals in the wrong way. Additionally, where the page numbers are correctly recognized, there still may be false-negative or false-positive errors in deciding whether a digit string is a page number because there are often other digits strings besides page numbers on the TOC pages. Furthermore, in some magazines the page numbers on the TOC pages and/or individual articles are printed in special formats so that the OCR engine cannot recognize them at all. Moreover, in order to get the article titles and author names, the layout of the TOC pages must comply with certain templates. Some implementations depend on natural language processing (NLP) to extract article names and author names and accordingly are limited to specific languages.
Another category of methods obtains the logical structure by depending on text format information in the body pages (e.g., the pages of individual articles/chapters, excluding TOC pages). For example, if and only if article titles are in “30 point bold Times Romans”, that format can be used to locate the boundary of articles. There are disadvantages associated with this category of methods. For one, current OCR technology is not reliable enough to accurately determine the text format, limiting this method to analyzing computer-originated documents (for example, PDF or Postscript files generated from word-processing software). Furthermore, sometimes the same format is used to provide different functions within a book/journal (e.g., the same format is used for both the article title and sub-section title).