To extract content from documents, engines typically execute Optical Character Recognition (OCR) or other routines as is known. Training the engines often requires document templates. When entities receive disparate documents from third parties or documents of a similar type, but with highly variable layout, template training requires lengthy sessions and often produces poor results.
With student transcripts, for example, schools vary from other schools in their arrangement on a document of courses, grades, student information, etc. Even within the same school, transcripts vary in layout between students in that courses differ, grades differ, and student information is unique to each person. Transcripts typify the problem of documents lacking common alignment, common structure, and common hierarchy, despite being of a similar type. Tabular extraction techniques seeking common line breaks, line patterns, cells, headers, etc. are ineffective for discerning content in documents of this type.
Accordingly, a need exists to improve content extraction. The inventors have further identified the need to transform inconsistently arranged documents and seemingly disparate structure into ascertainable structure and groupings of content. They also appreciate making improvements without first executing OCR extraction or other computationally-intensive routines. Since certain hardware devices have scanners or screen capture and resident controllers, the inventors have further identified the goal of executing their techniques as part of executable code for implementation on imaging devices and handheld computing devices. Additional benefits and alternatives are also sought when devising solutions.