In the age of web 2.0, contents created by Internet users are very broad. A large amount of text contents are generated on the Internet every day, such as, posts on BBS (Bulletin Board System) forum, articles on Blogs and text information on the newly booming Micro-blog. The text contents created by users cover almost everything. However, some of contents involve eroticism, fraud and politically sensitive information. Such contents may affect on-line experience of readers or lead to mental or even economic damages. Therefore, it is urgent for each ICP (Forum, Blog or Micro-blog provider) to effectively and timely filter the data created by users, thereby cleaning forum data and improving user experiences.
In the prior arts, in order to filter the contents containing sensitive information timely, a common method is a scanning technique based on keyword contents, which is particularly scanning keyword(s) related to sensitive information. For example, the keywords such as “eroticism gate”, “sex picture”, and “surreptitious photograph” may be scanned to find a post related to “eroticism gate”. By scanning text contents of the post, once any of the mentioned keywords is found in the text contents, it would be decided that the contents contain sensitive information related to “eroticism gate”. However, during the posting of text contents in practice, some users would purposely “subtly” modify the text contents to be posted in order to avoid censorship and filtering. Taking the keyword “eroticism gate” as an example, a user can modify the keyword “eroticism gate” in the text content to be posted to variants such as “eroX gate”, “ero ◯ gate”, “ero tici sm gate”, “ero×ticism×gate”, “erox0tici0sm gate”, “ero*****ticism**************** gate”.etc. Although these variants could have no influence on reader's understanding of the text content, they can easily be skipped by the scanning sensitive information in the text contents based on keyword scanning techniques in the prior arts. Then, the eroticism, fraud and politically sensitive information could be successfully posted, resulting in the failure of the scanning techniques based on keyword content in the prior arts.