1. Field of Disclosure
The disclosure generally relates to the field of optical character recognition (OCR), in particular to determining text script and orientation.
2. Description of the Related Art
To accurately recognize the text in an image, optical character recognition (OCR) modules often utilize a great deal of prior knowledge, such as of the shapes of characters, lists of words and the frequencies and patterns with which they occur. Much of this knowledge is language-specific. This makes the language of the text contained in an image a crucial input parameter to specify when using an OCR algorithm.
Thus, it is desirable to identify the language of text being subject to an OCR process. In some situations, however, the language of the text is not known a priori and must be determined from the text itself. For example, the language of the text may be unknown when the text is drawn from a large corpus containing documents in a variety of different languages.
Often, it is desirable to identify the script in which the text appears as a prelude towards identifying the language. Identifying the script can limit the set of potential languages, and thus aid in machine-performed language identification. For example, if the script is identified as Cyrillic, then the possible languages of the text include Russian, Bulgarian and Ukrainian. If the script is identified as Latin, then the language is likely one of the at least 26 languages that use the Latin script. Some scripts, including many Indic scripts, such as Telugu, Kannada and Tamil, have only one language associated with them, and identifying the script inherently identifies the language of the text. Thus, determining the script in which a text is written either indicates the language of the text or simplifies the language determination process.