This disclosure relates generally to the information storage and processing arts. It finds particular application in conjunction with the conversion of documents available in print-ready or image format into a structured format that reflects the logical structure of the document through the recognition of ordered sequences of identifiers in a document.
Many companies and organizations are desirous of converting data and documents originally drafted in an unstructured form (legacy documents) into a structured format to facilitate storage of the documents, reusing or repurposing parts of the documents, providing document uniformity across a database of stored information, and access to content within the documents. The unstructured documents may exist in various page description language formats, such as Adobe's portable document format (PDF), PostScript, PCL-5, PCL-5E, PCL-6, PCL-XL, and the like. The converted structured documents may employ a markup language such as extensible markup language (XML), standard generalized markup language (SGML), and hypertext markup language (HTML), among others. Technical manuals, user manuals, and other proprietary reference documents are common candidates for such legacy conversions.
In structured documents, content is organized into delineated sections such as document pages with suitable headers/footers and so forth. Such organization typically is implemented using markup tags. In some structured document formats, such as XML, a document type definition (DTD) or similar document portion provides overall information about the document, such as an identification of the sections, and facilitates complex document structures such as nested sections.
A particular issue that arises during the conversion process is associated with classes of documents containing normalized identifiers. Normalized identifiers are associated with specific document elements, often corresponding to logical parts of a document. The identification of these sequences of identifiers allows collection of useful information about these parts of the document. An example of normalized identifiers are the CSI numbers defined in the Construction Specifications Institute's industry standard, the CSI's MasterFormat™ (http://www.csinet.org/masterformat). This standard is the specifications-writing standard for most commercial building design and construction projects in North America. It lists titles and section numbers for organizing data about construction requirements, products, and activities. For example, 081323 refers to “bronze doors”. More generally, this numbering technique is used for many document types and with generic document elements such as ‘chapter’.
For the purposes of document conversion, it is necessary both to detect normalized identifiers and to recognize the part of the document that describes the associated object. The primary difficulty associated with detection of normalized identifiers is that they may also be used in other situations, such as referencing a certain object, in which case they may appear almost anywhere in the document, or they may be present in the part of the document that describes the given object. Additionally, there may be variations of style within the same document, when the latter is obtained by composing parts of different documents. This may arise in industry when multiple different providers author a product maintenance manual. These difficulties are illustrated in FIGS. 1 through 4.
The examples illustrated in FIGS. 1-3 occur in the same document. Turning to FIG. 1, the CSI number 110 is a six-digit number (in this case 01 31 19) found at the top of the page, above the CSI title, following the term “Section”. In FIG. 2, the six-digit CSI number (01 32 13) identified as 210 is located in the page footer area above the page number and underlined. In FIG. 3, the CSI number (014000) identified as 310 is located next to the page number in the page footer area and is separated from it by a dashed line. Finally, in FIG. 4, the CSI number (01 70 00) identified as 410 occurs in the second half of a page. In this example it follows the term “Section” and is underlined. As can be observed by these examples, the positions, textual context, and typography of the numbers may vary, not only from document to document, but within a single document, as well. Additionally, the number of digits may vary, for example, if a leading 0 is omitted or if a previous version of the CSI MasterFormat is used, resulting in 5-digit numbers, following a different standard. Increasing the difficulty in detecting these identifiers, this form of numbering differs from pagination-related numbering as there is not necessarily any correlation with the pagination, as zero to many, same or different, valid numbers may appear on a given page. Also the sequence of valid identifiers may include gaps or redundancy.
Accordingly, there is a need in the art for methods and apparatuses for detecting these identifiers and identifying the parts of the document associated with them as a component in a chain of components for performing automatic conversion to XML of documents available in an unstructured format.
All U.S. patents and published U.S. patent applications cited herein are fully incorporated by reference. The following patents or publications are noted.
U.S. Patent Application Publication No. 2004/0006742 to Slocombe (“Document Structure Identifier”) describes a method of automated document structure identification based on visual cues. The two dimensional layout of the document is analyzed to discern visual cues related to the structure of the document, and the text of the document is tokenized so that similarly structured elements are treated similarly. However, this application operates differently from the disclosure herein in that it first looks for lines starting with a number or a bullet.
U.S. patent application Ser. No. 11/599,947 to Dejean et al. (“Versatile Page Number Detector”) describes a method for detection of page numbers in a document utilizing the sequentiality property to recognize page numbers by looking for a series of increasing sequences with a fixed increment.
The disclosed embodiments provide examples of improved solutions to the problems noted in the above Background discussion and the art cited therein. There is shown in these examples an improved method for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.
In an alternate embodiment there is disclosed a system for creating a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The system includes a document conversion graphical user interface and a conversion processor for converting computer parsable text documents of any format to an alternative structured language format to form a converted document. A text fragmenter fragments the text of the converted document(s), breaking the converted document(s) into an ordered sequence of text fragments within a text format. An enumeration module enumerates the text fragments to obtain a sequence of terms, with each term being a matching fragment. A selection module identifies one or more optimal sub-sequence of terms, with an optimal sub-sequence defined as a longest increasing sub-sequence from among the sequence of terms. An association module annotates the computer parsable text document with tags, which include information derived from identification of the optimal sub-sequence(s).
In yet another embodiment there is disclosed a computer-readable storage medium having computer readable program code embodied in the medium which, when the program code is executed by a computer, causes the computer to perform method steps for creating a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes navigating to a document conversion graphical user interface and converting at least one computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented to break the converted document into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms, with each term comprising a matching fragment. One or more optimal sub-sequence of terms is identified, with an optimal sub-sequence defined as a longest increasing sub-sequence from among the sequence of terms. The computer parsable text document is annotated with tags, which include information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.