Optical Character Recognition software was introduced in the 1970's and involves the conversion of scanned images of printed or typewritten text into computer-readable text using computer software and/or algorithms.
Image preprocessing is often a desirable step before OCR. With an image preprocessing step, the image can be optimized prior to OCR so as to improve OCR efficacy. Typical image preprocessing operations include de-speckle, which removes extraneous noise from the image, de-skew, which straightens tilted images, and binarization, which converts the image from color or gray scale to a black-and-white “binary” image.
To convert image-based text into characters, internal OCR software algorithms typically scan images using successive horizontal scans similar to how a fax machine scans a page or an ink-jet printer prints a page. Each horizontal scan of the width of the page is termed a raster and OCR software will analyze rasters from the top of the page to the bottom of the page. Margins and vertical white space (leading) are ignored until a raster containing valid image data (often black text pixels versus white background pixels) are detected. Character recognition then begins assembling rows of pixels representing text by accumulating subsequent rasters vertically until vertical white space/leading is again detected indicating a line or “row” of text has been found and the accumulated pixel data is ready for character-by-character processing of the row of text.
The recognition process traverses the row of text, segmenting the row into individual components of connected black pixels separated by white space. These components are candidates for character recognition.
For each component that is recognized as containing a character, various information is collected such as the alphabet letter, the point-size and other font characteristics of the character, the confidence percentage of the recognition, and the bounding rectangle containing the character.
Ideally, the rows of text in the image are neatly aligned so that any particular raster bisects text uniformly within the row. For example, a raster scanning horizontally might bisect text at its baseline, or at the middle of the character heights, or at the top of the characters consistently along the line of text. In this case, the text is considered vertically aligned.
In some cases, rows of text in the image are not neatly aligned, and a raster might bisect different characters at different heights since the character placement on the image varies. Such is the case in Example 1 where a raster might bisect adjacent characters at differing heights. In this case, the text is considered not vertically aligned.
Example 1quickoverThebrownjumpsthedogfoxlazy
OCR software which scans using the raster approach can experience difficulty with non-vertically aligned text within the same row, producing erroneous OCR results. This can occur if there is no consistent white space marking the start and end of a line of text in the row and the raster can be fractured. Sophisticated algorithms could be conceivably designed as part of the OCR software to accommodate non-vertically aligned text rows, but few if any commercially available OCR software packages have implemented these algorithms, and such remediation would necessarily require new software releases.
Non-vertically aligned row-based text is often observed in forms documents, spreadsheets and tabular data. These document types may or may not contain graphical cell borders delineating the edges of one or more cells. These documents are often generated on computers having the ability to justify text vertically. For example, a cell containing a single line of text would have the text appear vertically centered with equal white space above and below the centered text whereas two lines of text within a table cell would appear vertically distributed with white space appearing above line one, between lines one and two (in the vertical center of the cell separating the two lines of text), and below line two. Comparing these two examples then shows text in the vertical center of the cell with a single line of text and white space in the vertical center of the cell with two lines of text as in Example 2.
3. Thomas Jefferson (1801-1809)Aaron Burr (1801-1805)George Clinton (1805-1809)
Most legacy and contemporary OCR software programs have difficulties with this non-vertical alignment of text within rasters. Artifacts generated by the OCR software due to this difficulty can include, but not be limited to, mis-read text, double-read text and missing text. This can greatly affect the quality and integrity of the OCR generated output, especially with certain documents such as forms, spreadsheets and tabular data that contain a large occurrence of non-vertically aligned text.
A solution to improving the efficacy of OCR for non-vertically aligned text is to preprocess the image(s) so that text is vertically aligned prior to conventional OCR. Accordingly there is a need in the relevant art for a system and method for preprocessing images in such a fashion to vertically align text prior to OCR, thus providing higher quality OCR output.
There is also a need in the art for the system to preprocess the image so that text is vertically aligned prior to OCR to be implemented either internal to the OCR software or external to the OCR software, the latter eliminating the need for updated OCR software containing this functionality to be developed, thus potentially saving cost and time.
There is also a need in the art for a system that can be added to current OCR work flows instead of requiring the entire OCR solution to be replaced or upgraded to add this functionality.
Another need exists in the art for the preprocessing algorithm to vertically align raster text so that said algorithm can integrate into existing image preprocessing or OCR software so as to optimize performance of the preprocessing or OCR process in terms of speed, memory usage and other processing resource demands.
However, in view of the prior art at the time the present invention was made, it was not obvious to those of ordinary skill in the pertinent art how the identified needs could be fulfilled.