The subject application relates to vectorization of text in a scanned document. While the systems and methods described herein relate to vectorization of text characters in a scanned document, it will be appreciated that the described techniques may find application in other image classification systems, other xerographic applications, and/or other document analysis systems.
The shapes of text characters can usually be represented in bitmap or outline (vector) forms. In the latter representation, a character is specified with a set of curves describing its outlines, which has the advantage of resolution independency. Outline (vector) fonts exist extensively in electronically created files. However, they are not native for scanned documents. In scanned document images, the text are obtained as bitmaps.
Vectorization of text specifies text characters with sets of curves. Compared with a more traditional bitmap, vectorization of text in scanned document generates a resolution independent representation. It has the advantages of: 1) smooth text instead of a jagged and bumpy bitmap; 2) better image quality for scaling and/or printing on different output resolution devices (desirable for multi-functional devices and important for mobile devices); and 3) shape of the text can be edited using standard graphic tools such as Adobe Illustrator, which enables easy modification of font attributes (size, boldness, etc.) for repurposes.
Typically, text in a scanned document is stored as a bitmap with binary values (e.g., 0 and 1) corresponding to a white or black pixel color value, Vector representation is used to for electronically generated text because it is resolution independent, whereas a bitmap is not. Additionally, vector representations are more easily manipulated (e.g., bolded, etc.) than bitmaps.
To achieve high quality text vectorization, dominant point detection is a critical step. Conventional algorithms were originally designed for graphical objects with high signal to noise ratio, and are not accurate for text, particularly small text, which has a low signal to noise ratio.
Accordingly, there is an unmet need for systems and/or methods that facilitate dominant point detection and vectorization while overcoming the aforementioned deficiencies.