More and more users send and receive a large volume of information through network, and are fully exploring the Internet for information exchange and resource sharing. However, the information usually contains a vast amount of junk information, which is not only of no value to the users, but can also be maliciously batch distributed with illegal purposes. The most commonly seen junk information is junk emails. A user may receive advertisements, propaganda of illegal activities and even viruses in his/her email account. These junk mails occupy a large amount of network resources, and induce a great pressure on servers and network data flow. Furthermore, certain illegal information may become serious potential safety problems of the network.
In response to these circumstances, current websites normally have filtering functions for junk mails, and adopt various kinds of anti-spam methods to avoid distribution of junk information. Such methods include those that index the information content distributed by users, deploy irregular time delay, use manual inspection, or use keyword filtering. Of these methods, the keyword filtering method is most intelligent and efficient. Herein, keywords refer to keywords in junk information, such as words, phrases or word groups that appear frequently in junk information and are representative of the junk information. In keyword filtering, a common practice is to pre-define a certain number of keywords for junk information. When a user distributes information over the Internet, a system scans the information, and determines, based on the keywords and various rules, whether any pre-defined junk information exists in the information. If exists, the information is not allowed to be distributed, or treated as junk information for processing. The user who attempts to distribute such information may even be put in a black list. The keyword filtering method can recognize junk mails automatically, and is most often used for filtering junk mails.
A crucial factor in existing information filtering is how to pre-define junk information reasonably. If reasonably defined, junk information can be correctly recognized from a massive amount of information. If defined improperly, filtering may have poor efficiency. A common practice is to perform selection based on experience or from information that has been identified as junk information, and manually pre-define certain keywords to be contents of junk information. Although this approach may filter junk information, keywords determined based upon human factors have certain randomness, and filtering result thus obtained may have a large error rate. For example, this approach may fail to recognize junk information that is not within the scope of the keywords or junk information in which occurrence rates of the keywords are low. Moreover, the approach may mistaken certain information that is not junk information but has some characteristics of junk information to be junk information.