Currently, electronic mail (email) and text messaging are some of the most common means for communication using text. Some estimates indicated that even an average computer user receives 40-50 electronic mail messages per day. For this reason, performing text mining applications (such as document content analysis, email routing in the application of customer care and support, filtering, summarization, information extraction, news group analysis, etc.) on electronic mail and other messages may be highly desirable. However, such applications require receiving electronic mail messages and text messages as inputs in order to perform such applications.
Unfortunately, electronic mail data, and text messaging data, can be very noisy. For instance, it may contain headers, signatures, quotations from previous electronic mails in a string of messages, and program code. The data may contain extra line breaks, extra spaces, special character tokens, and it may have spaces and periods within it that are extra and must be removed or they maybe missing. It may also contain words that are badly cased, or even non-cased, and words that are misspelled.
Some current text mining products have electronic mail data cleaning features. However, these products have conventionally been single-pass cleaning techniques that identify and remove a very limited number of noise types. They are currently rules-based systems, wherein the rules are defined by users.
Cleaning of noisy data has also been addressed in the research community. However, most of this work has been done primarily on structured tabular data. In natural language processing contexts, for instance, sentence boundary detection, case restoration, spelling error correction, and word normalization have been studied, but primarily as separate and unrelated issues, and not in relation to email or text messaging
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.