A newspaper is a printed document that contains informative articles about different topics. Usually they are printed on relatively inexpensive, low-grade paper. Newspaper articles are composed by different elements and can span on one or multiple pages: one title, one or more subtitles, one author, one or more text boxes containing the article corpus, and recently almost every article is associated with one or more images. Furthermore, one newspaper page typically contains multiple articles.
The conversion of newspaper pages into digital resources is an important task that greatly contributes to the preservation and access to newspaper archives. Moreover, in developing countries such as Africa where digital data are still sparse and difficult to gather, digitalized newspapers can extend the available data, enriching the amount of information available. While traditional paper based newspapers are easy to distribute to resource constrained areas, digitized newspapers would enable more intelligent online and offline services such as: smart search tools for journalists timeline and sentiment; and predictive analytics for generating articles templates.
In document digitalization, newspaper article extraction remains an open problem due to the complexity and variety of multi-article page layouts. The process that is typically used to digitalize newspapers is very complex and comprehends different phases: scan the document; segment the page into its structural and logical units (zones or regions); label the detected zones based on their types: title, text, images, lines, tables; extract the articles in which all the elements belonging to the same articles are clustered; and identify the reading order for the clustered elements.
The most challenging problem in the digitalization process is the article extraction. Different solutions have been proposed previously: use layout based information (rules) to detect the elements belonging to the same article; use the text content in the page to determine which text blocks belong to the same article; analyze the text content to extract the topics and use this information to merge the text boxes; or use syntactic rules to determine consecutive text boxes. However, challenges remain with these methods, such as those that follow.
First, newspaper pages may appear in a variety of formats without necessarily a common structure. This occurs in part because of repeated changes in layout habits through time, changes in editorial staff, and the like. Some newspapers may tend to have the bulk of a story on a single page, with only a small portion continued on a subsequent page, whereas others may have small portions of the story spread onto several pages. Second, newspapers are not meant to be red iteratively: the reader can choose his/her own elements and read them in any order he or she prefers. A single page may contain six or more story fragments, and the reader can elect to read each story serially or several stories in parallel. Third, the quality of the scanned documents to be digitalized is often very poor due to low print quality or deterioration through time. Portions of an article may be missing or blurred or otherwise unreadable, including the portion that instructs the reader where to find the continuation of an article on a subsequent page. Finally, newspaper pages have a very complex structure, in particular those where the text columns are located very close to each other or are formatted to follow the outline of an image. Different components may be placed in random positions depending on the content.
The present invention is intended to address one or more of the above-mentioned difficulties in order to provide more useful digital data from old newspapers and other print publications.