1. Field of the Invention
This disclosure relates to a method and computer-implemented system for detecting and correcting real-word errors in Arabic text.
2. Description of the Related Art
The “background” description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventor, to the extent it is described in this background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present invention.
Research on spell checking of Arabic language increased dramatically in recent years due to the increased demand for Arabic applications that require spell checking and correction facilities. Relatively little Arabic spell checking research has been disclosed on non-word error detection and correction and less on real-word error detection and correction.
(Haddad, B., and M. Yaseen, 2007, “Detection and Correction of Non-Words in Arabic: A Hybrid Approach.” International Journal of Computer Processing of Oriental Languages (IJCPOL) 20(4): 237-257, incorporated herein by reference) presented a hybrid model for non-word Arabic detection and correction. Their work was based on semi-isolated word recognition and correction techniques considering the morphological characteristics of Arabic in the context of morpho-syntactical, morphographemic and phonetic bi-gram binary rules. A hybrid approach utilized morphological knowledge in the form of consistent root-pattern relationships and some morpho-syntactical knowledge based on affixation and morphographemic rules to recognize the words and correct non-words.
(Hassan, A, H Hassan, and S Noeman. 2008. “Language Independent Text Correction using Finite State Automata.” Proceedings of the 2008 International Joint Conference on Natural Language Processing (IJCNLP), incorporated herein by reference); proposed an approach for correcting spelling mistakes automatically. Their approach used finite state techniques to detect misspelled words. The dictionary was assumed to be represented as deterministic finite state automata. They build a finite state machine (FSM) that contains a path for each word in the input string. Then the difference between generated FSM and dictionary FSM is calculated. This resulted in an FSM with a path for each misspelled word. They created Levenshtein-transducer to generate a set of candidate corrections with edit distances of 1 and 2 from the misspelled word. Confusion matrix was also used to reduce the number of candidate corrections. They selected the best correction by assigning a score to each candidate correction using a language model. Their prototype was tested on a test set composed of 556 misspelled words of edit distances of 1 and 2 in both Arabic and English text and they reported an accuracy of 89%. However, using the finite-state transducers composition to detect and correct misspelled word is time consuming.
(Ben Othmane Zribi, C., Hanene Mejri, and M. Ben Ahmed. 2010. “Combining Methods for Detecting and Correcting Semantic Hidden Errors in Arabic Texts.” Computational Linguistics and Intelligent Text Processing: 634-645, incorporated herein by reference). proposed a method for detecting and correcting semantic hidden errors in Arabic text based on their previous work of Multi-Agent-System (MAS)
(Ben Othmane Z C Ben Fraj F, Ben Ahmed M. 2005. “A Multi-Agent System for Detecting and Correcting ‘Hidden’ Spelling Errors in Arabic Texts.” In Proceedings of the 2nd International Workshop on Natural Language Understanding and Cognitive Science NLUCS, ed. Bernadette Sharp. INSTICC Press, p. 149-154, incorporated herein by reference). Their technique is based on checking the semantic validity of each word in a text. They combined four statistical and linguistic methods to represent the distance of each word to its surrounding context. These methods are co-occurrence-collocation, context-vector method, vocabulary-vector method and Latent Semantic Analysis method. They compared this representation with the ones obtained from a textual corpus made of 30 economic texts (29,332 words). They assumed that there is only one error in each sentence and based on that they used a voting method to select one from the suspected errors found by each method. Once an error was detected all candidate suggestions of one minimum edit distance were generated in order to correct the error. A list of all candidates was maintained and substituted with the erroneous word forming a set of candidate sentences. Sentences with semantic anomalies were eliminated from the list using the detection module of the system. The remaining sentences were then sorted using combined criteria of classification namely, typographical distance, proximity value and position of error. The system was tested on a test set of 1,564 words and 50 hidden errors in 100 sentences and a result of 97.05% accuracy was reported. The limitation of their work is assuming that a sentence can have a maximum of one error. In addition, the corpus used in training phase is small and the number of errors in testing is limited.
(Shaalan, K, R Aref, and A Fahmy. 2010. “An Approach for Analyzing and Correcting Spelling Errors for Non-native Arabic learners.” In the Proceedings of the 7th International Conference on Informatics and Systems, INFOS2010, Cairo, p. 53-59, incorporated herein by reference) proposed an approach for detecting and correcting non-word spelling errors made by non-native Arabic learners. They utilized Buckwalter's Arabic morphological analyzer to detect the spelling errors. To correct the misspelled word, they used the edit distance techniques in conjunction with rule-based transformation approach. They applied edit distance algorithm to generate all possible corrections and transformation rules to convert the misspelled word into a possible word correction. Their rules were based on common spelling mistakes made by Arabic learners. After that, they applied a multiple filtering mechanism to reduce the proposed correction word lists. They evaluated their approach using a test data that is composed of 190 misspelled words. The test set was designed to cover only common errors made by non-native Arabic learners, such as Tanween errors, Phonetic errors and Shadda errors. They evaluated their system based on precision and recall measures for both spelling error detection and correction to measure the performance of the system. They achieved 80+% recall and a 90+% precision as reported.
(Alkanhal, Mohamed I., Mohamed A. Al-Badrashiny, Mansour M. Alghamdi, and Abdulaziz O. Al-Qabbany. 2012. “Automatic Stochastic Arabic Spelling Correction with Emphasis on Space Insertions and Deletions.” IEEE Transactions on Audio, Speech, and Language Processing 20(7): 2111-2122, incorporated herein by reference) presented a stochastic-based technique for correcting misspelled words in Arabic texts, targeting non word-errors. They also considered the problem of space insertion and deletion in Arabic text. Their system consists of two components, one for generating candidates and the other for correcting the spelling error. In the first component, the Damerau-Levenshtein edit distance was used to rank possible candidates for misspelled words. This component also addresses merged and split word errors by utilizing the A* lattice search and 15-gram language model at letter level to split merged words. For the split words the component finds all possible merging choices to produce the correct word. In the second component they used the A* lattice search and 3-gram language model at the word level to find the most probable candidate. They reported that their system achieved 97.9% F1 score for detection and 92.3% F1 score for correction.
(Ben Othmane Zribi, C., and M. Ben Ahmed. 2012. “Detection of semantic errors in Arabic texts.” Artificial Intelligence 1: 1-16, incorporated herein by reference) proposed an approach for detecting and correcting real-word errors by combining four contextual methods. They used statistics and linguistic information to check whether the word is semantically valid in a sentence. They implemented their approach on a distributed architecture with reported precision and recall rates of 90% and 83%, respectively. They focused only on errors that cause total semantic inconsistencies; this can be considered as a limitation as they ignored partial semantic inconsistencies and semantic incompleteness errors. In addition they assumed that a sentence can have one error at most. Moreover, the used corpus is relatively small (1,134,632 words long) containing only economics articles (i.e. no variations in topics).
To address the deficiencies of conventional spell checking and correction for Arabic text a method and system is disclosed to detect and correct real-word errors automatically using N-grams language models and context words to detect spelling errors. Two techniques of addressing real-word errors are disclosed including unsupervised and supervised learning techniques.