Text enhancement systems are used in the area of human language technology (HLT) where manual correction of text is time consuming and creates a bottleneck in HLT applications. Systems in HLT, e.g., document understanding systems and speech recognition systems, depend on reliable automatic misspelling correction capabilities. Although spell checkers are widely available for a number of languages, most spell checkers only detect errors and propose corrections regardless of their context, which increases ambiguity and incorrect suggestions for misspelled words. Also, the available systems are not able to detect and correct all kinds of errors, in addition to having other constraints.
An automatic misspelling correction process includes three main components: (i) error detection process; (ii) candidates generating process; and (iii) best candidate selection process. The most common types of errors can be separated in the following categories based on a detection and correction difficulty:
(i) 1st order errors which include correctly spelled words that have one or more insertions, deletions, substitutions, or transpositions, which results in a non-word (i.e. words that do not follow target language morphological rules, and are not included in a dictionary of the target language), e.g., the Arabic word “—Yaktoboha” which means “He writes it” becomes the non-word “—Yatobota” after the deletion and substitution of some letters;
(ii) 2nd order errors which include correctly spelled words that have one or more spaces insertions or deletions, which results in non-word(s), e.g., the Arabic phrase “—Bareeq Althahab” which means “The glitter gold” becomes the non-word “—Bareeqalthahab” after the space deletion;
(iii) 3rd order errors which are similar to the 1st order errors except that the error(s) results in another correctly spelled word, e.g., the Arabic word “—Yashrab” which means “He drinks” becomes the correctly spelled word “—Bishurb” which means “By drinking”;
(iv) 4th order errors which are similar to the 2nd order errors except that the error(s) results in correctly spelled words, e.g., the Arabic word “—Mutatawir” which means “Advanced” becomes the correctly spelled phrase “—Mot Tawar” which means “Die Developed”.
(v) 5th order errors which are 1st order errors followed by space insertions or space deletions which results in other non-word(s), e.g., the original Arabic word “—Yaktoboha” which means “He writes it” becomes the non-word “—Yatobota” by a 1st order error, then it becomes “—Yato Bota” by space insertion; and
(vi) 6th order errors which are 1st order errors followed by space insertions or space deletions which results in other correctly spelled word(s), e.g., the original Arabic word “—Yaktoboha” which means “He writes it” becomes the non-word “—Saktoboha” by a 1st order error, then it becomes “—Sakata Beha” which means “Stopped-talking With-it” by space insertion.
Two common methods for error detection are a rule based method and a dictionary based method. The rule based method depends on morphological analyzers to check whether the word is following language morphological rules or not. The dictionary based method depends on a large, balanced and revised training corpus to generate a dictionary which covers the most frequently used words in the target language. The rule based method has better coverage of possible words, but the morphological analyses process adversely affects system performance and cannot manage transliterated words. For example, the word “computer” is an English word and it is a common word now in Arabic as “”; however, “” is a non-word from the point of view of Arabic morphological analyzers. To the contrary, the dictionary based method considers the word “computer” in Arabic as a correctly spelled word since it is a frequently used word in the training corpus.
The 3rd, 4th, and 6th order errors are also known as semantic hidden errors because they are correctly spelled words but cause semantic irregularities in their contexts. The error detection process is responsible for detecting the misspelled words, whether they are non-words or semantic hidden words. The error detection process of the semantic hidden errors is more difficult than the non-words error detection process.
Techniques used to detect the semantic hidden errors include semantic distance, confusion set, and neural network. The semantic distance technique is based on comparing a word semantic with surrounding words. However, this approach faces other HLT challenges, such as, word sense disambiguation. In the confusion set technique, the confusion set includes a dictionary of words that occur together, but requires a large dictionary size which produces computational complexity. The neural network technique detects the non-words, but faces challenges, such as, detecting space insertions and deletions.
The candidates generating process is responsible for finding the most probable candidates for the misspelled words. An edit distance method is a commonly used method for this process. The best candidate selection process is responsible for determining and selecting one or more solutions to correct the misspelled word(s). Most systems stop at the candidates generating process and do not include a best candidate selection process. However, the systems that do attempt to provide solutions have constraints and rely on assumptions to perform automatic misspelling detection and correction.
Accordingly, there exists a need in the art to overcome the deficiencies and limitations described hereinabove.