Along with the development of computer and communication technologies, Internet has become a very important means for propagating and communicating information in people's work and life today thanks to its characteristics such as real-time, fast and convenient, rich in content, and having no time and space limitation. Examples of such information propagation and communication include online media, bulletin board system (BBS), instant messaging (IM) and electronic mails. Proliferation of junk messages, however, has greatly frustrated people who regularly use these tools, wasting not only network bandwidth and memory space, but also time and energy of users.
Existing methods used for filtering junk message are usually filtering methods based on Bayesian algorithm. A method of this type collects a large amount of junk messages and non-junk messages as sample messages, performs word segmentation on these sample messages, computes frequencies and probabilities of characteristic elements obtained, and builds a junk message hash table and a non-junk message hash table. The method finally computes each characteristic element's probability of being a junk message, and builds a new hash table which is used as a basis for verifying whether a target message is a junk message. If a new target message needing verification is received, the method re-compute and re-build the junk message hash table and the non-junk message hash table based upon results of verification and word segmentation of the new target message. A new hash table is then created as a basis for verifying a subsequent target message.
However, the above junk message filtering method is not suitable for an application environment that has a large number of sample messages and has a high demand for instantaneous processing. For example, if the number of junk messages and the number of non-junk messages are each one hundred thousand, with each individual message of a 4 k length and five hundred words, the junk message hash table and the non-junk message hash table built will occupy a huge amount of space. Each time when junk message verification is performed on a new target message, the above-described method will re-build the junk message hash table and the non-junk message hash table based upon the results of the verification and word segmentation of the message, and use these two hash tables to create a new hash table by computing each characteristic element's probability of being a junk message. The new hash table is then used as a basis for verifying a subsequent target message. This massive computation occupies a large amount of system resources and time, causing delays which severely hamper verification of the subsequent target message. This may eventually render the real-time filtering of junk messages impossible.