User generated content has become ubiquitous and occupies a major portion of media content. For example, news portals, television stations, etc., allow people to interact with articles and live programs through user generated content. Further, online stores allow customers to write reviews on their products, and social media allows another avenue for users to quickly post user generated content.
However, in all these different forms of media, there is a continuous and large demand to filter inappropriate language. Usually, these forms of media rely on simple black list filtering and manual human evaluation. However, relying on manual human evaluation is very slow and is not able to keep up with live commenting (i.e., live chats on television). Moreover, even when dealing with a single language (e.g., Arabic), blacklisting and manual filtration is not very effective due to the different dialects and ways in which a word is pronounced or written throughout different countries. Further, in some of these languages (e.g., Arabic), there is no working dictionary for the informal language and no current methodology to map words and phrases to a unified dialect.