A known problem in electronic mailing systems is the receipt of unsolicited and/or unwanted electronic mail messages by a large number of users of such systems. Such messages are typically known as spam. There is no formal standard as such for classifying whether an electronic mail message is spam or not. However, if a message contains a keyword that is considered to be suspicious and/or common to a large volume of electronic mail messages that have been sent to multiple users, then that message may potentially be spam. Electronic mail messages that are spam are differentiated from legitimate messages intended for a respective user.
A drawback associated with spam is that such messages occupy storage space in an electronic mailing system and/or individual mailboxes of users of that system, which may prevent the receipt of legitimate messages. A further drawback associated with spam is that such messages may often be used to propagate viruses and the like, which may cause more serious malfunctioning of the system and/or user machines connected thereto. Deleting spam is time-consuming and may also be error prone resulting in the deletion of legitimate messages.
For the identification and removal of spam in the electronic mailing environment, it is known to use Bayesian filters. Such filters rely on an initial training phase where a newly-received electronic mail message is first analyzed by a human to separate those words in its content that are known to be and/or occur in spam and those words that are legitimate. These results are used in the Bayesian filter to adjust individual probabilities as to whether a word in the newly-received and now analyzed electronic mail message will appear in subsequently-received spam or legitimate electronic mail messages. These individual probabilities are then used to compute an overall probability of whether a subsequently-received electronic mail message with a specific set of words is spam or not. A threshold may be specified, which if exceeded by the overall probability, denotes that the electronic mail message is spam. That message may then be removed or allocated to another database dedicated for storing spam.
Given that the size of electronic mail messages and the number of such messages that are processed by an electronic mailing system per second may be relatively large and the level of analysis, as described above, that is performed on each of such messages by the Bayesian filter for the identification of spam, such a filter is typically implemented and operated in the machine of an individual user and/or in the mail server of such a system at a specific site.
The use of Bloom filters, which are space-efficient representations of sets, in database and network applications can be found in: “Network applications of Bloom filters: A survey” by Andrei Broder and Michael Mitzenmacher, “Content-based overlay networks of XML peers based on multi-level bloom filters” by Georgia Koloniari et al, and “Distance-sensitive bloom filters” by Adam Kirsch and Michael Mitzenmacher. These documents are accessible on the respective web pages for eecs.harvard.edu/˜michaelm/NEWWORK/postscripts/BloomFilterSurvey.pdf, cs.uoi.gr/˜pitoura/distribution/p2pw03.pdf, eecs.harvard.edu/˜michaelm/postscripts/alenex2006.pdf.
It is anticipated that spam will pose a problem in mobile communications systems comparable to that described above in the context of electronic mail messaging. This is especially so since the short messaging service (SMS) by way of which messages are typically transmitted to and from mobile phones may also be automatically generated by and sent to mobile phones from Internet-connected computers, for example. The content of an SMS spam message may, for example, be, “Please phone this telephone number” whereby the telephone number is more expensive to call than other telephone numbers and would accordingly result in a higher telephone-bill being incurred by the owner of the mobile phone than would usually be the case.
Differences exist between electronic mail messages and SMS messages. Firstly, unlike electronic mail messages that may be routed to a user via an arbitrary mail server, SMS messages are transmitted to a user through a mobile phone operator with which that user is registered, this being done via a switching centre associated to the operator. If, for example, an SMS message is received by a mobile phone operator, the message being destined for a mobile managed by another operator, then the message is sent on to that other operator. Thus, unlike for electronic mail messages where a spam filter may be executed on an arbitrary mail server, only a mobile phone operator may be able to implement and execute an SMS spam filter. Furthermore, SMS messages typically have fewer characters than electronic mail messages. Specifically, the maximum length of an SMS message is 160 characters, which is mapped to 140 bytes. Due to the size constraints posed on SMS messages, it is expected that the volume of SMS messages processed per second by a mobile phone operator is higher than by a mail server.
For the identification of spam SMS messages, the following factors may have to be taken into account. SMS messages have a limited size and therefore there is restricted scope for incorporating spam in the content of an SMS message and the manner in which this is done. Thus, a less complicated analysis of SMS messages compared to what is done for electronic mail messages may be needed in order to identify spam. Also, spam SMS messages are typically transmitted by gaining unauthorized access to the cellular network. Since the cellular network is closed, such unauthorized access could only be gained for a limited time, that is, it is more likely that a large volume of SMS spam messages of the same/similar content are transmitted in a limited time-window. Furthermore, spam identification may only be done centrally at the mobile phone operator, which typically processes a larger volume of messages than a mail server. Taking these factors and the above discussion into account, Bayesian filters, whilst being suitable for spam identification in electronic mail messaging systems, may not be suitable for this purpose in mobile communications systems.
Accordingly, it is desirable to provide a method and an apparatus for identifying spam in SMS messages that may be implemented centrally by a mobile phone operator and that may perform such identification in a manner that is compatible with the number of SMS messages typically processed per second by such an operator and the typical size constraints associated to SMS messages.