A common approach to “content analysis” applications that focus on text documents is to extract only the textual content from the documents for further analysis. These applications pay little attention to the document's visual layout or the document structure and its associated metadata, e.g. section heading level or prominence. However, very useful information is often conveyed via the visual layout of a document. For example, visual layout may be used to denote the start and/or end of a possibly untitled section or to highlight the most important point in a section or document.
Complementing the text content analysis with the information contained in the document's visual layout may greatly improve the performance of downstream content analysis. The success of certain text analysis technologies often depends on the application in the appropriate context. In these cases, successfully segmenting a document into identifiable sections and/or subsection blocks via cues, including visual indicators, and typing these sections in an application-specific and meaningful way is a crucial pre-processing step of subsequent text analysis.
The formats of documents of a specific type, for example a particular type of pathology report or resume, may vary significantly from organization to organization. Different organizations may use different keywords or punctuation in the section headings to mark sections of the same type. Therefore multiple, typically manually created, modules for detecting section types may be required to process documents for each organization.