Electronic messaging, both in the forms of email and text messages, is an extensively used and convenient communication mechanism. Unfortunately, electronic messaging is widely used to disseminate spam, attempted phishing scams, malware and other threats. Such message based threats range from unwanted advertising material, to pornographic material, to attempts to get users to click on links or send text messages to phone numbers in attempts to scam the user into entering personal or financial information, or to infect the user's computer or smartphone with malware. Many utilities exist that attempt to filter out message threats before they reach the user's inbox.
Conventional messaging threat detection examines messages for specific features indicative of threats, such as certain keywords commonly used in spam, specific links or phones numbers known to be used by scammers, or cryptographic (e.g., md5, sha2) hashes of messages. Conventional message filtering technology typically focuses on two main approaches: 1) to look for an exact text match of some keyword or preselected feature; and 2) to apply regular expressions for known spam templates. To evade such detection, attackers may change or intentionally misspell the filtered keywords, or change the domain and/or phone number used. Surprisingly, these small variations are often sufficient to evade conventional detection.
For example, one filtering technique is to scan messages for a specific URL or phone number that is known to be used in spam or other message threats. Such a URL or phone number is often called a CTA (Call-to-Action), because the receiver of the message is encouraged to take an action (e.g., click the link or text/call the number). CTA-based filtering can fail to detect message threats where the URL on which the user is encouraged to click is modified slightly, for example by using a shortening service such as Bitly. Furthermore, CTA-based scanning fails to take into account the context of the message. For example, the same domain might appear in both benign and malicious messages, resulting in either false positives or false negatives depending upon how the domain is classified.
In the case of regular expression based detection, an analyst writes a regular expression for a known spam template. In other cases, the regular expression is automatically generated by a script or the like. In either case, the regular expression might be prone to errors. Furthermore, small changes of the text in question such as intentional typos or misspellings often evade this type of detection. When such minor changes have been detected, updates to the previously provided regular expressions must be created and distributed.
Switching topics, similarity hashing functions ensure that similar inputs result in similar hashes. A known property of similarity hashes is that similar messages have similar, but not exact, hashes. For example, using the algorithm called simhash to calculate a hash of the text “Some spam message a” would result in a hash of value 0x5d5eaa672f24cc0c, whereas calculating a hash of the similar message “Some spam message b” would result in a hash value of 0x5d5eaa632a66cc04. Such small modifications to content are common in message threats, for example when spammers use URL shortening services or intentionally misspell words.
A serious difficulty arises when trying to apply simhash to detect message threats. More specifically, even if a message being evaluated produces a similar hash to that of a known threat, it is still challenging to find the similar hash in a dataset of hashes of known message threats. This is so because small changes of the message content affect both the least significant bits and the most significant bits of the resulting hash. Thus, simply sorting a dataset of hashes of known message threats would not result in a meaningful, ordered organization. For this reason, when attempting to determine if there exists a hash in the dataset that matches or is within a given threshold of similarity to a hash of a message being evaluated, binary search and other known efficient approaches for locating members of ordered sets would not be usable. Instead, because minor variations in the input affect the least and most significant bits, a simhash of a given threat candidate conventionally needs to be compared to each simhash in the dataset of confirmed message threats, using pairwise comparisons, until a hit meeting a predetermined similarity threshold is found, or the set of simhashes of confirmed threats is exhausted. This is very computationally expensive, and the number of required pairwise comparisons grows proportionally as the number of simhashes of known message threats increases. Such an approach is simply not practical for any large dataset.
It would be desirable to address these issues.