The Internet is witnessing an explosive growth in the field of information search and retrieval. When retrieval systems and search engines were first introduced, they were directed to serve academic professionals seeking knowledge purely in English. Thus, only English language skills were needed to form meaningful queries and to read and analyze search results. The Internet evolved from serving only academia to serving other fields like tourism, politics, industry, medicine, and commerce, in addition to the general population. The following are examples of some of these services:                Medicine—Being able to retrieve information about a specific disease and possible treatments, especially with the appearance of rare and new diseases in specific regions.        Research—Searching for all publications about a specific topic.        Environmental—Checking weather forecasts and analyses.        Tourism—Viewing historical tourism information on countries and regions.        Political—Analyzing the news.        Commerce—Monitoring stock market activities.        Homeland Security—Checking for suspects' records online, and tracking suspects' e-mails and connections.        
The rapid growth of the Internet continued to serve many non-English speaking nations.
Arabic is the seventh most widely spoken language in the world and the official language of over 29 countries. In addition, there are native Arabic speakers scattered all over the world.
According to the Internet World Stats, there are 59,853,630 Arabic speaking people using the Internet representing 4.1% of all the Internet users in the world. Additionally, the number of Arabic speaking Internet users has grown 2,063.7% in the last eight years (2000-2008).
Due to the complicated morphological structure embedded in the Arabic language, text processing is hard to perform compared to other languages. Text processing is the main step shared among Information Retrieval (IR), text mining, natural language processing and many other applications. The efforts to improve Arabic information search and retrieval processes compared to other languages are limited and modest, thus there is an urgent need for effective Arabic information search and retrieval tools.
Text processing (or document processing) is a process that includes tokenization, normalization, stop words removal, and stemming. Tokenization is breaking the text into tokens (i.e. words), where normalization involves transforming text to insure consistency. Some examples of this process include converting upper case letters to lower, Unicode conversion, and removing diacritics from letters, punctuations, or numbers. Stop words (or stopwords), or stop lists (or stoplists) are list of words that are filtered out prior to, or after, the processing of text based on their level of usefulness in a given context. Finally, stemming is a computational process for reducing words to their roots (or stems).
Stemming is a computational process for reducing words to their root (or stem), and it can be viewed as a recall-enhancing device or a precision-enhancing device.
Stemmers are basic elements in query systems, indexing, web search engines and information retrieval systems (IRS).
Arabic language is a semantic language with a composite morphology. The words are categorized as particles, nouns, or verbs. There are 29 letters in Arabic, and the words are formed by linking letters of the alphabet.
Table 1 below shows a list of Arabic letters.
TABLE 1      HaaJeemThaaTaaBaaAlif      SeenZaayRaaThaalDaalKha      AynThaaTaadaadSaadSheen      MeemLaamKaafQaafFaaGhayn     HamzaYaaWaawHaNoon
Unlike most Western languages, Arabic script is written from right to left. The letters are connected and do not start with capital letter as in English. Due to the unique characteristics of Arabic language, one particularly challenging task for machines is to recognize and extract proper nouns from Arabic texts.
In English, words are formed by attaching prefixes and suffixes to either or both sides of the root. For example the word Untouchables is formed as follows
UntouchablesPrefixRootFirstSecondSuffixSuffix
In Arabic, additions to the root can be within the root (not only on the word sides) which is called a pattern. This causes a serious issue in stemming Arabic documentation because it is hard to differentiate between root particles and affix particles.
Table 2 below displays an example of the Arabic Word= (drinker) and its stems with the common prefixes and suffixes.
TABLE 2Prefixes + Stem (Root + Pattern) + SuffixesRoot drinkPrefixes theStem drinkerSuffixes dualSuffixes pluralSuffixes feminine the drinkers (dual) the drinkers (plural) the drinker (masculine) the drinker (feminine)
Mis-stemming is defined as “taking off what looks like an ending, but is really part of the stem,” and over-stemming is “taking off a true ending which results in the conflation of words of different meanings”.
Arabic stemmers blindly stem all the words and perform poorly especially with compound words, proper nouns and foreign Arabized words. The main cause of this problem is the stemmer's lack of knowledge of the word lexical category (i.e. noun, verb, proposition, etc.)
A possible solution for this problem is to add a lookup dictionary to check the roots. Although this solution seems straightforward and easy, this process is computationally expensive. It has been estimated that there are around 10,000 independent roots. Each root word can have prefixes, suffixes, infixes, and regular and irregular tenses.
Another solution is to define a rule to stem words instead of chopping off the letters blindly; this rule is set by the syntactical structure of the word. For example verbs require aggressive stemming and need to be represented by their roots. Nouns on the contrary only require light suffixes and prefixes elimination. This advanced stemming is known as Lemmatization.
Lemmatization is a normalization technique, generally defined as “the transformation of all inflected word forms contained in a text to their dictionary look-up form”.
To the inventor's knowledge, there has been no proposed algorithm for Arabic Lemmatization.
The structure of Arabic makes it harder to stem the words to their roots. Common stemming errors that stemmers suffer from include over-stemming, under-stemming, and mis-stemming.
The automated addition to the syntactic knowledge reduces both stemming errors and stemming cost.
The current Arabic stemming approaches only focus on the morphological structure. Ignoring Arabic basic rules can cause errors in automatic translation, text clustering, text summarization, and NLP.
Stemming algorithms rely on Arabic language morphology only, the addition of the syntactical knowledge creates what is known as a lemmatizer in linguistics.
Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word.
Arabic language is a semantic language with a composite morphology. Arabic language stemming was done manually prior to TREC (Text Retrieval Conference) and only applied on small corpora. The most common Arabic stemming approaches are the root-based and the light stemmers.
Automatic Arabic stemmers proved to be an effective technique for Text Processing for small collections (Al-Kharashi, 1991; Al-Kharashi & Evans, 1994) and large collections (Larkey, Ballesteros, and Connell 2002).
Xu et al. (2002) showed that spelling normalization combined with the use of tri-grams and stemming could significantly improve Arabic Text Processing by 40%. The two most effective Arabic stemmers are Larkey's et al. (2003) and Khoja's (Khoja and Garside 1999) root-extraction stemmer. On the other hand, Duwairi (2005), El-Kourd et al. (2004), and Mustafa et al. (2004) discovered that N-gram stemming technique is not efficient for Arabic Text processing. In summary, Arabic stemming produced promising results in some applications and failed in others.
This approach can be applied to serve English stemming and lemmatization, the current English stemmers aggressively stem the words which can cause the loss of the meaning. Prior knowledge to the word's type (i.e noun or verb) can lead us to the appropriate stemming.
Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language dependent approach, including normalization, stop words removal, lemmatization and stemming.
The current approach in word tokenization (segmentation) is to extract single words (tokens) from a given document by replacing punctuation marks and non-textual characters by white spaces.
This approach is language independent and applied on both Arabic and English.
Due to this approach, compound words and phrases composed of two or more words are processed separately formatting new words with a totally different meaning.
This issue is more profound in Arabic language where specific phrases and complex (compound) words are used heavily.
The replacement of punctuation marks with white spaces causes another problem, in Arabic language sentences are usually longer and sentences are either separated by a “'” or “.”
The current approach is to replace the “'” with a space which causes a merge of individual sentences.