Spamming is a revenue generating business that dramatically impacts the user experience for subscribers that receive the spam messages. Spam messages typically contain unsolicited advertisements or invitations to take actions like the sending of an opt out message for certain services or the making of a call to a special number. The sending of such opt out message and the making of calls to special numbers as promoted by spam typically generate high termination costs for the user.
SMS messages have certain particularities as a result of which spam detection techniques that are deployed for other types of messages, e.g. e-mail, are not necessarily applicable or efficient. SMS messages have a maximum length of 160 bytes as a consequence of which the information contained therein is very condensed. Spam SMS messages are sent to mobile subscribers in different countries. The language of the content of spam SMS messages often is adapted depending on the home network or visited country. Hence, different subscribers will receive different text patterns depending on their home network or visited network. Further, spam SMS generators apply various obfuscation techniques to avoid being detected. For instance, some letters or characters may be removed, special characters may be inserted, unique numbers or URLs may be generated randomly and inserted in spam SMS messages, etc., to hide the repetitive character of such messages for spam filters installed by mobile network operators.
There is a general demand by mobile network operators for a technique that adequately detects and reports spam SMS messages such that mobile network operators can real-time white-list certain traffic (and consequently allow such traffic), and black-list the spam messages (and consequently block such spam messages).
A first type of existing SMS spam detection techniques relies on content screening and predefined rules.
United States Patent Application US20050278620A1 entitled “Methods, Systems and Computer Program Products for Content-Based Screening of Messaging Service Messages” for instance describes content-based SMS/MMS spam detection and discarding. The content of SMS/MMS messages is screened. Through a GUI, a subscriber must upfront specify in rules which content he/she does not want to receive. For instance, SMS messages containing the word “Viagra” may be discarded automatically (see [0013] of US20050278620A1).
Spam detection and reporting techniques that rely on content inspection are privacy intrusive by definition. Moreover, the user or operator must upfront specify what content elements represent spam indicators. Thus, the filter will not block spam messages with content that has not been specified in rules upfront. Furthermore, obfuscation mechanisms such as replacing letters or characters may be applied by spam generators to avoid being detected.
A second type of known spam SMS detection techniques relies on monitoring of traffic sources. When traffic sources deploy certain behaviour, they are identified as spam sources and their traffic gets blocked.
United States Patent Application US20110106890A1 entitled “Methods, Systems and Computer Program Products for a Mobile-Terminated Message Spam Restrictor” for instance describes an SMS gateway that monitors the traffic from various sources. When SMS traffic of a source exceeds a certain limit, the source gets blocked.
Spam detection through source monitoring has as main disadvantage that it can be circumvented through anti-detection measures like transmitting spam SMS from different sources.
A third type of existing spam SMS detection techniques detects traffic that comes with a repeated pattern and applies some hash technique to reduce the sensitivity for obfuscation measures taken by spam generators.
The article “An Open Digest-based Technique for Spam Detection” from the authors E. Damiani, S. De Capitani di Vimercati, S. Paraboschi and P. Samarati for instance describes an open source technique based on locality sensitive hashing to group identical/similar messages while remaining insensitive for the anti-detection measures described in paragraph 4 of this article. The article seems to focus on bulk e-mail rather than SMS. In order to be able to establish that a group of identical messages constitutes spam, the system described in this article relies on a user's opinion that a certain e-mail is spam. The system in other words requires user collaboration.
U.S. Pat. No. 8,925,087B1 entitled “Apparatus and Methods for In-The-Cloud Identification of Spam and/or Malware” describes another spam detection technique for e-mail. It is mentioned however in Col. 3, lines 43-44 of U.S. Pat. No. 8,925,087B1, that the scanned messages may be text messages as an alternative to e-mail. A host computer that receives an e-mail message, calculates a locality sensitive hash, e.g. the Nilsimsa hash, to group identical or similar messages. The calculated hash value is sent to a central system for analysis. The central system aims at detecting a group of similar hash codes to identify spam clusters. The technique known from U.S. Pat. No. 8,925,087B1 however also relies on a query from the user or receiving host computer and is therefore reactive, just like the method described by E. Damiani et al.
Reactive spam detection techniques such as the ones known from Damiani et al. or U.S. Pat. No. 8,925,087B1 have as disadvantage that a received message must first be identified as suspicious or potential spam by the user or operator before a query is launched and further analysis takes place. As a consequence, a significant amount of spam messages may already have been passed over before any action is taken.
United States Patent Application US 2004/0148330 A1, entitled “Group Based Spam Classification”, describes a technique for detecting spam e-mail that could also be used for SMS messages, as indicated in paragraph of US 2004/0148330 A1. E-mails are clustered into groups of substantially similar e-mails. Thereafter, the spam signature of one or more test e-mails is determined (FIG. 6: 630). When a single test mail is evaluated as spam (FIG. 4: 470; FIG. 5: 540; FIG. 7: 770), or when a predetermined threshold proportion of test mails is evaluated as spam, the entire cluster of similar mails becomes classified and labeled as spam (FIG. 4: 480; FIG. 5: 550; FIG. 7: 780). Optionally, the growth rate of a cluster is monitored and clusters that do not experience a predetermined minimum growth rate are eliminated, i.e. not considered spam, before the evaluation of test mails takes place.
Just like the method described in U.S. Pat. No. 8,925,087B1, the spam detection technique known from US 2004/0148330 A1 relies on the identification of at least one test mail as spam, and therefore is rather reactive: the user initially is not protected from spam. Moreover, such technique typically relies on word matching, i.e. the presence of words that are indicative for spam. This is feasible for e-mail but is more difficult to successfully apply for SMS given the particularities of SMS: restriction in length to 140 bytes with 8-bit characters, elimination of vocals or use of alternate abbreviations instead of full words, etc.