I. Field of the Invention
This disclosure relates generally to optical character recognition (OCR), and more particularly to lower modifier (also referred to as lower maatra) detection and extraction in Devanagari script based languages.
II. Background
Most North Indic scripts (e.g., the Devanagari script, also called Nāgarī, which is used in India and Nepal) are written from left to right, do not have distinct letter cases, and are recognizable by a horizontal line that runs along the top of letters. Devanagari script is commonly used to write standard Hindi, Marathi, Nepali and Sanskrit. The Devanagari script may be used for many other languages as well, including Bhojpuri, Gujari, Pahari, Garhwali, Kumaoni, Konkani, Magahi, Maithili, Marwari, Bhili, Newari, Santhali, Tharu, Sindhi, Dogri and Sherpa.
To explain the fundamental principles of many North Indic scripts, Devanagari is used as an example. In Devanagari, a character is positioned into a core zone (in a center horizontal strip) and may extend to an upper zone 110 above the core zone and/or a lower zone 130 below the core zone. A letter or character may occupy just the core zone, both the core zone and the upper zone 110, both the core zone and the lower zone 130, or all three zones. Some base letters represent a standalone vowel. Other base letters represent a consonant and carries an inherent ‘a’ vowel sound. A vowel sound, as well as an absence of a vowel sound, require modification of these base consonants (having an inherent ‘a’) or require a separate letter. Vowels other than the inherent ‘a’ are written with diacritics (also termed top or upper modifiers if positioned above the core zone, bottom or lower modifiers if positioned below the core zone, diacritical marks, diacritical points, diacritical signs) placed either below or above the consonant. A horizontal headline (sometimes referred to as a headline, or a Shirorekha in Devanagari) delineates the top of an unmodified consonant and often joints a word together with a single headline per word. In some cases, a headline is broken or disjointed in a word. In other cases, a headline is unbroken across the length of the word. To cancel this inherent vowel, a final consonant is marked (sometimes called a virāma, halant or “killer stroke”) is written below the consonant.
In some North Indic scripts, a full-letter form is used to represent a vowel sound that is unattached to a consonant. Two to five consonants may be concatenated or otherwise combined, for example, using accent marks placed above, below, to a side of the base consonant or with abbreviated consonant symbols, to form a compound character. When applied to North Indic scripts in general, these modifiers (e.g., upper and lower modifiers) add a great deal of complexity to the script due to the large variety. In fact, over a 1000 character combinations and contractions are possible. Currently, OCR systems have difficulty parsing such a complex set of character variations, especially distinguishing between a lower modifier and a stroke of a consonant that protrudes to the lower zone 130.
What is needed is an improved method of preforming OCR on characters having a headline as well as a possible modifier, for example, below a core zone of a character to extract both the manner and placement of articulations of consonants.