Many companies and individuals are inundated with many documents (e.g., thousands of documents) to process, analyze, and transform in order to carry out day-to-day operations. Some examples of such documents may include receipts, invoices, forms, statements, contracts, and many more pieces of unstructured data. It may be important to be able to quickly understand the information embedded within unstructured data in these documents.
The extraction of text from images may be thought of as a two-step problem: text localization followed by text recognition. In the first part, a model may identify which areas of an image correspond to text. The second part may then involve recognizing text (predicting the character sequence) for each of those image segments. The problem of text localization may share many features in common with the more general task of object detection.
The challenge of extracting text from images of documents has traditionally been referred to as Optical Character Recognition (OCR). When documents are clearly laid out and have a global structure (for example, a business letter), existing tools for OCR may perform well.
There are, however, many use cases that may be referred to as non-traditional OCR. One such non-traditional OCR use case may include detecting arbitrary text from images of natural scenes. Problems of this nature may be formalized in the COCO-Text challenge in which the goal is to extract text that might be included in road signs, house numbers, advertisements, and so on.
Another area that may present a similar challenge is in text extraction from images of complex documents. In contrast to documents with a global layout (such as a letter, a page from a book, a column from a newspaper), many types of documents (hereinafter called “complex documents”) are relatively unstructured in their layout and have text elements scattered throughout (such as, for example, receipts, forms, and invoices). Furthermore, text extraction from complex documents may present different problems than traditional OCR and natural scenes. For instance, in contrast to traditional OCR, complex documents may not be laid out in a clear fashion; and in contrast to natural scenes, complex documents may not have a small number of relatively large text boxes to be extracted from images and video. Specifically, complex documents may need to detect a large number of relatively small text objects in an image. Further, the text objects may be characterized by a large variety of lengths, sizes, and orientations.
Problems in text extraction from complex documents have been recently formalized in the ICDAR DeTEXT Text Extraction from Biomedical Literature Figures challenge. Images for complex documents are characterized by complex arrangements of text bodies scattered throughout a document and surrounded by many “distractions” objects. In these images, a primary challenge lies in properly segmenting objects in an image to identify reasonable text blocks.
Collectively, these regimes of non-traditional OCR pose unique challenges. Some challenges may include background/text separation, font-size variation, coloration, text orientation, text length diversity, font diversity, distraction objects, and occlusions.
FIG. 1A depicts exemplary input images 100 and 105 of image recognition systems. In the images 100/105, the challenge may be to detect text objects 100A/105A as separate from background pixels and other distractions 100B/105B.
FIG. 1B depicts an exemplary output image of image recognition systems. Specifically, image 110 is an output of Mask R-CNN, which is an object detection algorithm. Mask R-CNN is an example of a multi-task network: with a single input (e.g., image), the model must predict multiple kinds of outputs. Specifically, Mask R-CNN is split into three “heads,” where a first head is concerned with proposing bounding boxes that likely contain objects of interest, a second head is concerned with classifying which type of object is contained within each box, and the third head predicts high quality segmentation mask for each box. Importantly, all three of the heads rely upon a shared representation that is calculated from a deep convolutional backbone model, such as a residual neural network (ResNet). Furthermore, Mask R-CNN also uses a pooling mechanism called RoIAlign. Previous models to Mask R-CNN relied on less accurate estimation of boundary values during the pooling process mechanisms (e.g., RoIPool), which inevitably adds too much noise to predict segmentation mask. To overcome this, RoIAlign uses interpolation methods to accurately align feature maps with input pixels. For instance, RoIPool may divide large resolution feature maps to smaller feature maps by quantization, thereby creating misalignment on boundaries (e.g., because of a rounding operation). RoIAlign may avoid the misalignment problem, but RoIAlign may still not retain high spatial resolution information (e.g., because RoIAlign may calculate values of sample locations directly through bilinear interpolation), and high spatial resolution information may be needed for high accuracy text recognition.
In the image 110, Mask R-CNN tries to accomplish three things: object detection (indicated by boxes 110A), object classification (indicated by text-string 110C), and segmentation (indicated by regions 110B).
The present disclosure is directed to overcoming one or more of these above-referenced challenges. The background description provided herein is for the purpose of generally presenting the context of the disclosure. Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art, or suggestions of the prior art, by inclusion in this section.