Nowadays, in the industries of, for example, newspaper or publication, there is often a need to extract an article and related metadata information from a layout of a digital file for further use, for example, to reconstruct or index article information. In order to restore contents of the layout accurately, besides the content information on the file, such as title, cited title, sub-theme, author, text, or other information, there is also a need to extract a position, font, size and other information of a required text block for the file.
Recently, when a digital newspaper is indexed, for example, when the contents information of the newspaper (such as date of publishing, edition and version name) is organized, there may be a large number of tables in the layout to be processed. Generally, these tabular data cannot be processed automatically and a manual processing is very complex. Therefore, a general processing approach is proposed to discard these data or to store these data as pictures. However, this approach will result in losing of the tabular data.