The present invention relates to a method of, and system for, replacing external links in electronic documents such as email with links which can be controlled. One use of this is to ensure that email that attempts to bypass email content scanners no longer succeeds. Another use is to reduce the effectiveness of web bugs.
Content scanning can be carried out at a number of places in the passage of electronic documents from one system to another. Taking email as an example, it may be carried out by software operated by the user, e.g. incorporated in or an adjunct to, his email client, and it may be carried out on a mail server to which the user connects, over a LAN or WAN, in order to retrieve email. Also, Internet Service Providers (ISPs) can carry out content scanning as a value-added service on behalf of customers who, for example, then retrieve their content-scanned email via a POP3 account or similar.
One trick which can be used to bypass email content scanners is to create an email which just contains a link (such as an HTML hyperlink) to the undesirable or “nasty” content. Such content may include viruses and other varieties of malware as well as potentially offensive material such as pornographic images and text, and other material to which the email recipient may not wish to be subjected, such as spam. The content scanner sees only the link, which is not suspicious, and the email is let through. However, when viewed in the email client, the object referred to may either be bought in automatically by the email client, or when the reader clicks on the link. Thus, the nasty object ends up on the user's desktop, without ever passing through the email content scanner.
It is possible for the content scanner to download the object by following the link itself. It can then scan the object. However, this method is not foolproof—for instance, the server delivering the object to the content scanner may be able to detect that the request is from a content scanner and not from the end user. It may then serve up a different, innocent object to be scanned. However, when the end-user requests the object, they get the nasty one.
The present invention seeks to reduce or eliminate the problems of embedded links in electronic documents and does so by having the content scanner attempt to follow a link found in an electronic document and scan the object which is the target of the link. If the object is found to be acceptable from the point of view of content-scanning criteria, it is retrieved by the scanner and stored on a local, trusted server which is under the control of the person or organisation operating the invention. The link in the electronic document is adjusted to point at the copy of the object stored on the trusted server rather than the original; the document can then be delivered to the recipient without the possibility that the version received by the recipient differs from the one originally scanned. Note that it does not matter to the principle of the invention whether the linked object is stored on the trusted server before or after it has been scanned for acceptability; if it is stored first and found unacceptable on scanning, the link to it can simply be deleted.
If the object is not found to be acceptable, one or more remedial actions may be taken: for example, the link may be replaced by a non-functional link and/or a notice that the original link has been removed and why; another possibility is that the electronic document can be quarantined and an email or alert generated and sent to the intended recipient advising him that this has been done and perhaps including a link via which he can retrieve it nevertheless or delete it. The process of following links, scanning the linked object and replacing it or not with an embedded copy and an adjusted link may be applied recursively. An upper limit may be placed on the number of recursion levels, to stop the system getting stuck in an infinite loop (e.g. because there are circular links) and to effectively limit the amount of time the processing will take.
Thus according to the present invention there is provided a content scanning system for electronic documents such as emails comprising:
a) a link analyser for identifying hyperlinks in document content;
b) means for causing a content scanner to scan objects referenced by links identified by the link analyser and to determine their acceptability according to predefined rules, the means being operative, when the link is to an object external to the document and is determined by the content analyser to be acceptable, to retrieve the external object and modify the document by replacing the link to the external object by one to a copy of the object stored on a trusted server.
The invention also provides a method of content-scanning electronic documents such as emails comprising:
a) using a link analyser for identifying hyperlinks in document content;
b) using a content scanner to scan objects referenced by links identified by the link analyser and to determine their acceptability according to predefined rules, the means being operative, when the link is to an object external to the document and is determined by the content analyser to be acceptable, to retrieve the external object and modify the document by replacing the link to the external object by one to a copy of the object stored on a trusted server.
Thus the content scanner can follow the link, and download and scan the object. If the object is judged satisfactory, a copy of it is stored on the trusted server, and the link to the external object replaced by a link to that copy.
One trick used by spammers is to embody ‘web bugs’ in their spam emails. These are unique or semi-unique links to web sites—so a spammer sending out 1000 emails would use 1000 different links. When the email is read, a connection is made to the web site, and by finding which link has been hit, the spammer can match it with their records to tell which person has read the spam email. This then confirms that the email address is a genuine one. The spammer can continue to send email to that address, or perhaps even sell the address on to other spammers.
By following every external link in every email that passes through the content scanner, all the web bugs the spammer sends out will be activated. Their effectiveness therefore becomes much reduced, because they can no longer be used to tell which email addresses were valid or not.