1. Field of the Invention
The present invention is generally related to automated systems for blocking receipt of undesired electronic mail and, in particular, to a system and methods for reliably detecting and filtering out undesired and unsolicited electronic mail stealthily prepared specifically to avoid detection by conventional filtering systems.
2. Description of the Related Art
The use of electronic mail (email) is widely accepted to the extent that email is often considered an essential tool in conducting business and communicating between individuals. Key aspects of the adoption of email for such use include the immediacy of the communication, the ability to specifically identify and define recipients by email addresses, that email addresses are often publically available by choice and by standard operation of conventional Internet tools, and the normally user-to-user personal nature of email messages.
A number of different uses of email have evolved beyond the most basic person-to-person communications. One of the earliest developments was the use of list-servers (Listserv) to broadcast email messages sent to the Listserv to a well-defined and selective list of Listserv subscribers. Another development was the direct use of email lists to send announcements and other notices to subscriber lists. Similarly, electronic periodicals and information clipping services often use email lists to distribute content or content summaries to subscribing users.
A generally undesired use of email, hereinafter referred to as the delivery of undesired email (UEM) and loosely referred to as xe2x80x9cspamxe2x80x9d email or xe2x80x9cspamming,xe2x80x9d is the typically unsolicited mass emailings to open or unsubscribed email addresses. Other names, such as Unsolicited Bulk Email (UBE) are also used to describe the same problematic email. Information describing the problems and attempts to deal with UEM is available from the Internet Mail Consortium (IMC; www.imc.org). The impact of UEM is generally assessed in a survey, ISPs and Spam: The Impact of Spam on Customer Retention and Acquisition, Engagement #1802565, by the GartnerGroup, San Jose, Calif. In summary, UEM is received by some 90% of email users with publically accessible email addresses. Estimates vary greatly on the financial impact of UEM on the businesses and services, both large and small, that receive or unintentionally forward UEM to users. Although the individual costs of email message delivery are quite small, the size of mass-emailings creates real costs in terms of connectivity, temporary storage, and auditing. Even the cost to individual users may be significant, since many pay directly for connectivity time.
The email addresses used by UEM vendors are typically harvested from the Internet through scans of the information available on the Web, Usenet postings, and the various user directories and lists accessible through the Internet. For example, corporate Web pages may include an email address list of the different employees of the company. Postings to the public Usenet forums and hosted chat-rooms provided by different services often include the email address of each user who posts a message. User addresses may also be directly, though typically indirectly obtained from the different Listservs and subscriber list services that operate over the Internet.
Various email address harvesting tools, ranging from the simple to quite complex and tailorable to scan the Internet using many different scanning techniques, are conventionally available from a variety of sources. Therefore, virtually any user can set up as a mass-emailing vendor of UEM messages. Such UEM vendors typically hide the address and other source identifiers of their mass-mailings, relay their mass-emailings through third-party email forwarding gateways, further change their internet service providers (ISPs) to obscure their identity. Consequently, there is no effective mechanism currently available that will permit or enable users to avoid having their email addresses harvested. There is also no simple way users can pre-identify and refuse email messages from UEM vendors.
A number of products and services exist that attempt to identify and filter UEM messages from the ordinary email received by their users. The general operating principles of these systems are documented in Unsolicited Bulk Email: Mechanisms for Control, Internet Mail Consortium Report: UBE-SOL, IMCR-008, revised May 4, 1998. In general, conventional products and services rely on specific identification of known UEM vendors by source address, the specific content of manually identified UEM content, or heuristic-based identification of UEM messages.
Conventional heuristic-based systems, which are generally more effective that specific identification systems, are executed by a host computer system to review and filter already received email messages into ordinary and UEM categories prior to actual delivery to the email addressees. These systems may utilize a variety of different analyses of email message content to discern UEM, including key-word and key-phrase detection, which are predefined or learned over time and improper and missing email header fields. The predefined keys are static and typically include occurrences of xe2x80x9c$$$xe2x80x9d and variations of xe2x80x9cmake money fast.xe2x80x9d Learning of other key-words and phrases conventionally requires manual intervention. UEM vendors, however, are known to be highly creative and have generally demonstrated that these heuristics can be defeated by selective crafting of the content of the UEM message through the use of automated technical mechanisms.
A known problem with heuristic systems is that the heuristics, operative at a level sufficient to identify a bulk of UEM received by a user, also create a substantial likelihood that non-UEM messages will be improperly identified and filtered. Any loss of non-UEM messages, however, is generally considered completely unacceptable by users. As a result, there does not appear to be any reliable and practically acceptable way to prevent the harvesting of email addresses, and thereby controlling the ability of UEM vendors from originating their mass-mailings, or to reliably identify UEM messages once received.
Consequently, there is a clear need for some reliable and effective manner of detecting and filtering out UEM messages from the stream of email messages received by email users.
Thus, a general purpose of the present invention is to provide an efficient method and system for qualifying and thereby providing a basis for protecting users against receipt of UEM messages.
This is achieved in the present invention by providing a method and system that actively qualifies undesirable email messages sent to the email address of a user-recipient. The content of a received email message is processed to produce multiple signatures representing aspects of the contents of the received email message. These signatures are compared against a database of signatures produced from a plurality of presumed undesirable email messages. A relative-identity of the signatures is scored to provide a basis for distinguishing the received email message from the presumed undesirable email messages.
The multiple signatures generated to represent the received email are produced as digest of algorithmically selected portions of the received email with the signatures defined by like algorithms being comparable. The rate of comparison matches between the multiplicity of suspect email message signatures and the signatures stored by the database serves as the basis for distringuishing the received email message from the presumed undesirable email messages.
The system includes a data store providing for updateable storage of signature records that correspond to undesirable email messages potentially sent to the user-recipient email address. An email filter processor is coupled to the store of signature records and operates against the email messages received to discern and qualify email messages determined by the system to sufficiently correspond to the signature records. The store of signature records is coupleable to automatically receive signature records corresponding to additional instances of undesirable email messages, which are then stored to the data store for subsequent use in comparisons.
The system can be implemented to include at least a portion of the email processor system within a client site email transport system, which receives the email messages addressed to the set of email addresses assigned or associated with the client site, including the predetermined email address.
An advantage of the present invention is that, through an effectively reverse harvesting of UEM vendors, a controlled yield of UEM is made available to the system as a basis for qualifying and potential filtering-out of UEM sent to protected users.
Another advantage of the present invention is that the signature records generated through processing of UEM messages effectively implement one-to-many relationship between an undesirable email, as directed to a recipient-user, and the UEM represented by the signatures stored in the data store. Thus, comparison of the signatures of a suspect email with those contained in the stored signature records enables both ordinary and intentionally created variants of UEM messages to be automatically detectable by the system.
A further advantage of the present invention is that a broad set of algorithms may be implemented as a basis for generating signature records, with only subsets of the algorithms used for particular signature records. A UEM vendor, therefore, cannot reasonably anticipate a particular subset of algorithms to evade in preparing a UEM message, since the subset is not static at least over time. Furthermore, variants of the top-level set of algorithms as well as new algorithms may be introduced with the effect that the top-level set of algorithms is not necessarily closed.
Still another advantage of the present invention is that the algorithms, based on their different basis for determining content similarity, are automatically resilient against ordinary and intentional variations. The resulting scoring of potential UEM messages is based not on exact identities, but on similarity matching. A particular signature record with therefore match the variants of a UEM message that might be produced by automated rewriting. Complex rewriting of individual UEM messages, requiring direct and non-automatable involvement of the UEM vendor, is both impractical and unprofitable for a UEM vendor.
Yet another advantage of the present invention is that a centrally maintained and managed server can be efficiently operated to collect, analyze, and produce signature records. By the proxy collection of UEM, the server is highly-responsive to new mass emailings of UEM and can quickly prepare and hot-update client-site filtering systems against at least any further passage of these UEM messages. Further, the server can automatically update the client systems with new algorithms and specifying different selections of algorithms for use in connection with particular signature records. As a result, client site administration is essentially automatic, yet maintained entirely up-to-date under the secure control of the server. The operation of the server, however, is also substantially autonomous. Little or no human interaction is required on an ongoing basis to manage or operate the system of the present invention. Specifically, no human interaction or contribution is required in order for the present system and methods to develop UEM message signatures. These signatures are automatically developed and distributed.
A still further advantage of the present invention is that the algorithmic operations of the server in analyzing potential UEM and the client in detecting UEM is both fast and highly-reproducible. Computational speed is proportional to and primarily controlled by the size of an email message being processed. A secondary consideration is the computational complexity of the algorithms. In the preferred embodiments of the present invention, the computational complexity is little greater than the computation of checksums of text characters.