In the field of artificially intelligent computer systems capable of answering questions posed in natural language, cognitive question answering (QA) systems (such as the IBM Watson™ artificially intelligent computer system and other natural language question answering systems) process questions posed in natural language to determine answers and associated confidence scores based on knowledge acquired by the QA system. For such a system to return accurate, well formatted and concise answers, an important preprocessing step of the corpus ingestion process is for any documents being added to the corpus to be segmented. Typically, documentation segmentation is a difficult task that is performed with a software-based algorithmic language modeling approach, but compared to human processing, such algorithmic approaches have limited accuracy and are not well suited for processing documents with images, sophisticated layouts, or rich text formatting (e.g., HTML). Thus, while there is a variety of existing document segmentation tools, the limited ability to evaluate non-textual document information limits their accuracy. As a result, the existing solutions for efficiently preprocessing and segmenting documents are extremely difficult at a practical level.