The common use case of digitizing a paper document or form and converting it into an adaptive or reflowable document presents many challenges. Simply scanning a document will not be sufficient as it will only provide an “image” version of the document and further processing would be required to perform tasks like structure extraction and text extraction. For the particular case of text extraction, the simplest approach is to perform an Optical Character Recognition (“OCR”) process on the scanned document and store the recognized textual content.
However, this simple approach has several significant shortcomings. In particular, a general document comprises sentences, paragraphs, headings, images, tables and other elements arranged arbitrarily over a number of rows and columns. Thus, a natural problem that arises in parsing scanned documents is determining the correct reading order of the document. That is, while reading a document, a human reader can naturally infer the correct reading order in the document as a human reader recognizes the context of the document, which allows the human reader to infer the next direction of the reading order based upon the current point to which the reader has read the document. However, a computing device is not naturally adapted to this type of inference to allow it to determine the correct reading order of a document. As documents are typically arranged in multiple columns and rows, the reading order of a document is not obvious and extracting the reading order of a document is certainly not easily codified as a set of rules to be performed by a computing device. For example, an OCR system cannot determine the correct reading order of a document. Rather, it needs some intelligence to understand the correct reading order of the document so that the correct reading context can be maintained even in the digital version.
One of the specific instances of parsing scanned documents is parsing paper forms and then converting them to digital forms. Reading order is important because a critical aspect in creating a reflowable document from a scanned document is maintaining the reading order of text amongst the various parts in the document and the same applies for a paper form. Conventional approaches attempt to solve this problem through the use of visual modalities which means that they only process a form as an image. While doing so, they do not explicitly take into account the text written in the form and thus drop the essential information required to maintain the context of the form, making it impossible to maintain the correct reading order in the form while parsing it. As a result, conventional approaches to determine reading order of a document heuristically assume a reading order of left-to-right and top-to-bottom. The heuristic approach breaks down for even simple, common cases where, for example, a document assumes a 2-column layout.
Another approach to maintaining the reading order of text amongst the various parts in the document is to employ an n-gram language model to extract relevant features to feed into a language model. Alternatively, a simple recurrent neural network (“RNN”) model may be applied to detect and extract features. However, these approaches have several limitations. First, in determining the correct reading order, it is important to model all the text seen so far in the form contextually. While RNN language based models are known to outperform n-gram models in terms of capturing long term dependencies, language model approaches incur significant limitations. In particular, a word-level model needs the text to be typo-free as otherwise the word level features are not extracted correctly. In particular, when text is extracted using a visual system such as an OCR system, the text extraction itself is not perfect and there are typos in form of missing characters, split words etc. leading to errors in the overall performance of reading order determination.
Thus, there exists a significant and unsolved problem in automatically determining the reading order of a document in a robust manner.