References reflect the foundation of prior researches. Citation of references indicates that the current research is an extension from research results of predecessors, as well as gives credits to these predecessors for their researches. Therefore, in various documents, particularly papers, information of relevant references (hereinafter “reference information”) is usually provided.
For example, FIG. 1 shows a layout of single-column reference information with a leading word. FIG. 2 shows a layout of reference information without leading word. FIG. 3 shows a layout of multiple-columned reference information with a leading word. As shown in FIGS. 1-3, the reference information is normally conformed to a particular format. Accordingly, when extracting contents from a layout file, the reference information may be extracted as structured data. For example, each of the first to fourth rows in FIG. 1 is one item, and the combination of the fifth to sixth rows is another item.
Currently, reference information is obtained from digital layout files mainly through extracting metadata using, for example, machine learning methods, or template methods, and then extracting reference items from the metadata. However, such an extraction method through metadata is usually inefficient. As a result, there exists a need for an improved apparatus and method for extracting a document structure.