So far, ever-developing Internet instant messaging tools (i.e. instant messenger (IM)) have been accepted by a majority of Internet users and become necessary software tools for network users. The Internet instant messaging tool is not only used in the usual recreation and entertainment, but also widely used in the user's work. In IM software, a message chatting mode for one-to-one chatting between friends or one-to-N chatting among a group or a discussion group is mainly provided. With the continuous development of Internet applications, a microblog application similar to Twitter is also continuously developing.
Microblog is a micro blog with a high information transmission efficiency and a low information transmission bar. Users may quickly spread and transmit information through the microblog so as to expand the message chatting mode from one-to-one chatting or one-to-N chatting to one-to-infinite chatting. The one-to-infinite chatting mode means that a person can spread messages to countless people, while the person can receive messages from users at an order of more than ten thousands. However, at the same time, such application with so many users will inevitably be used by many advertisement publishers which forward a lot of advertisements or spam messages, thereby not only wasting network resources, but also affecting the user experience of the product.
In the prior art, a microblog operator collects a large number of spam messages or non-spam messages to build a spam message library and a non-spam message library. After a new message to be detected is received, a word segmentation is firstly performed on the message, then the numbers of occurrences of each word obtained from the word segmentation in normal message samples and spam message samples are obtained, and then a probability that each word belongs to a spam message is calculated, so that a probability that the received message is a spam message is calculated according to the Bayesian formula.
However, in practice, the inventor of the present invention found severe disadvantages of the above method, i.e. the method cannot handle most spam messages of microblog for main reasons below.
(1) A spam message sample library is difficult to accurately obtain.
Spam message samples in the spam message sample library are generally detected manually or detected otherwise by some behavior detection algorithms, and the time when a spam message is found is generally several hours later than the occurrence of the spam message, even misjudgments of spam messages often occur. This has a significant impact on the completeness and accuracy of the sample library, even may cause a great deviation between a probability of each word being a spam message obtained by the above method and a true value of the probability.
(2) An avoidance process is performed on spam messages and advertisements against the above word segmentation by existing spammers of the spam messages and advertisements, thereby causing that the spam messages or advertisements are not properly segmented through the above word segmentation.
The traditional detection method relies on the word segmentation performed on messages being detected, thus, before sending a spam message or advertisement, a spammer may process the spam message or advertisement in such a way of: adding one or more interfering symbols to a word or sentence or replacing a commonly used character with an uncommon homophonic character. Thus, after the word segmentation, the spam message is divided into isolated characters, which cannot be accurately matched with the words in the sample library.