The portable document format (PDF) is a file format used to present documents in a manner independent of application software, hardware, and operating systems. Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it. Released as an open and royalty-free standard in 2008, it has become a de-facto format for publication of documents intended for global distribution, such as scientific, technical, and scholarly papers, in large part because storing articles in the PDF format provides a uniform way to view and print data.
PDF is based on PostScript, which is a page description language that is used to generate text layout and graphics. Like PostScript, PDF can describe page elements in any arbitrary order—i.e., not necessarily in the “left to right, top to bottom” order in which many languages are read. As a result, while PDF documents visually display text flow readable by humans, the stored text objects within the PDF file are not ordered, which is to say that they are not necessarily described in the order that the objects will be read by the reader, a characteristic that is herein referred to as “unordered”. PDF files, PostScript files, and other PDL file formats, to name a few, are “unordered” text files. This creates a challenge for systems to automatically extract the data for use in analyses such as text-mining. Scientific articles, for example, are often presented in a multi-column format, which poses even more challenges to correctly order the text. Tables, figures, and inline citations are also features of scientific and other documents. These features may disrupt the flow of the main body of text, which raises even more challenges.
This need is particularly felt in the scientific and technical communities, especially in cases where research is conducted based on a thorough review of a large body of existing literature. A person doing research, for example, may have access to online libraries that contain large numbers of reports, documents, and papers in PDF format. In this scenario, a simple word search—i.e., a search that returns documents containing a particular word—is relatively easy to do: the tool need only determine whether the document contains the word or does not contain the word. Such a search has very limited utility, especially if the word or phrase being sought includes common words.
However, more valuable would be the ability to find words that exist in a particular context. For example, it may be desirable to search for words that occur next to (or near to) each other in a sentence. This kind of search requires that the tool not only extract the text but also assemble it into a text flow that reproduces the original text. In other words, the tool should not only extract words but be able to stitch words back in its original intended order to create sentences and sentences ordered as intended paragraphs of text. This requires a tool that can handle multi-column layout and that does not get confused by embedded tables and figures.
Since scholarly papers most often conform to a particular structure, for example having an abstract, a description of the problem, a description of the methods used, a description of the results, a bibliography, and so on, these documents are said to have distinct “sections”. From a text extraction perspective, it would be tremendously more valuable to be able to find words that exist inside or outside of a particular ‘section’. In order to do this, however, not only must the text content of the unstructured document be properly reassembled into text flows (or else the sections are incorrectly assembled) but the tool must also have the capability to analyze a properly reordered text flow to identify the sections contained within and be able to categorize the words according to its sections.
Documents of scholarly articles typically include a bibliography section that lists references, usually located at the end of the document. The text within such articles often includes inline references to an entry in the bibliography. A reader who wants to pay particular attention to the source of a particular fact set forth in a scholarly work may find himself or herself flipping back and forth between the current location in the text and the bibliography information at the back of the document, which is time consuming, can be irritating, and provides opportunities for human error in which the reader erroneously attributes a piece of information to the wrong source. Thus, yet another valuable feature would be a tool that can replace inline bibliographic references with the actual bibliographic information or a subset thereof. In order to do this, however, the tool would need to (i) properly reassemble text flows, (ii) correctly identify sections, such as the Bibliography, and (iii) use that information to replace inline citations with bibliographic information in rest of the document.
There is a need for methods and systems that can accurately and efficiently reconstruct text flows for simple to complex, multi-column page layouts. There is a need for methods and systems that can additionally sectionalize the documents and categorize words by section. There is also a need for methods and systems that can additionally use section information to perform section-specific operations, such as the substitution of inline references to bibliographic information with the bibliographic data itself. More specifically, there is a need for methods and systems for efficient and accurate text extraction from unstructured documents.