Transmitting messages over networks (e.g., in the form of e-mails or instant messages) has become a highly-popular means of communication. Yet it remains difficult to trace such messages with certainty (i.e., to be able to state with a high level of confidence that a particular part of the network, such as a router, participated in delivering the message to its destination). The need to trace with certainty, however, is important in the light of cybercrime, which uses networked message communication to transmit malware and other threats. Without this ability, it is difficult to establish the offender and to take appropriate relief action. One example is spam. Spam continues to consume vast amounts of network and server resources worldwide, with some studies suggesting rates over 80% of all monthly emails. Another study suggests that an estimated 183 billion emails per day are propagated through the Internet just to be, ideally, deleted upon receipt. While wide-scale deployment of advanced anti-spam technologies has significantly reduced the impact of spam for the end-user, the issue remains that large amounts of network resources are still consumed to transmit large volumes of spam. For organizations receiving spam, this translates into high costs due to factors such as:                1. Paying an upstream service provider for the network bandwidth wasted by spam;        2. Keeping spam defenses up to date, requiring the purchase of up-to-date hardware and software, as well as training of personnel;        3. False positives (i.e., legitimate messages that were mistaken as spam) resulting in lost business; and        4. False negatives (i.e., spam messages that were mistaken as legitimate messages) which, in the best case, cause the recipient to waste time to erase the messages and in the worst case expose the recipient to malware or fraud.        
In the prior art, there are a number of methods that can be used to trace the path that a packet took through a network. Most of the methods are designed for determining the source of a denial of service (DoS) attack.
One approach is to let routers (a network device capable of forwarding packets from one network to another) mark packets they transmit with a probability p. This allows DoS victims to determine which routers are involved by checking for these marks.
Another approach is to ask routers to copy packets they receive and encapsulate them in separate trace packets that are then sent to their final destination using Internet Control Message Protocol (ICMP). To avoid flooding the network, routers only send these trace packets with a very low probability.
A third approach is to flood upstream routers selectively and see how this affects the attack packets the client receives. If there is a reduction, the router that was just flooded is likely a participant in the attack.
A fourth approach is to take suspicious flows and route them over a special analysis network.
The problem with adapting these approaches to non-DoS scenarios, such as spam, is that while in a DoS situation a large flood of packets is sent to a single victim, in spam there are many victims, each of which receive, at most, only a few copies of a given spam message. Thus chances are high that a given spam message will be too short to generate sufficient information to trace to the original sender of the spam message. Furthermore, the spam marking technique can be subverted by a malicious router by either not marking a packet, or using the markings of another router.
An alternative is to ask routers to log packets they receive. Victims of an attack are then theoretically able to trace back by querying various routers' log databases. Adapting this approach to non-DoS domains, such as spam, is possible, but given that only very little information is retained in these databases in practice, the possibility of false positives exists. Also, subverted routers can deny that they were responsible for transmitting a given spam packet. Finally, a network is very unlikely to allow other networks to probe its router logs, to determine where an attack is coming from; but without this co-operation, tracing the message path will not succeed.
E-Mail
Outside of DoS, some traceback work has been done in the area of e-mail transmissions. For example, SpamCop™ uses mail headers to make an educated guess about the source of a spam message and allows complainants to send an automated message to the originating ISP. SpamCop™ also publishes a blacklist of spamming ISPs as an enforcement mechanism. However, the fact that e-mail headers can be forged means that SpamCop™ cannot always positively identify the sender.
Domain Keys Identified Mail (DKIM) goes a step further by requiring the originating mail servers to sign every outgoing mail message. This allows positive identification of the source. While there is no direct defense against spammers who authenticate their spam, DKIM allows positive trace-back to the spammer's mail server, exposing it to blacklist inclusion. DKIM is often promoted in conjunction with the Sender Policy Framework (SPF), which requires domains to publish the IP addresses of their mail servers. If a message was not delivered by one of these mail servers, it could indicate a forged return path or a bot, which is indicative of spam.
Armorpost™ also requires all senders to authenticate themselves. However, Armorpost™ is stricter than DKIM in the sense that in order to send a message to a client protected by this system, the sender must join Armorpost™ as well; otherwise, the message will not be delivered. Given the wide range of internet users, such a system is likely going to be too complicated or too intrusive for many netizens, resulting in messages not being transmitted to the intended recipient. Furthermore, Armorpost™ requires the setup of an extensive, hierarchical certificate architecture in order for previously unintroduced Armorpost™ agents to trust each other.
Digital Postmarks provide another means to trace offending ISPs. Using this protocol, the first border router along a packet's path inserts a postmark based on the router's IP address, allowing the recipient to narrow down the source of the packet. However, the path taken once inside the network cannot be traced with this method.
A problem with traceback schemes in general is that there is no effective enforcement once an offender has been identified. This is especially the case with large botnets that are now responsible for sending most of the spam. Simply blacklisting a machine that was responsible for sending spam is not effective, since a spammer can easily subvert another vulnerable machine, often in another network, and continue to send spam from there. Blacklisting an entire domain from which large volumes of spam were sent is also not an attractive option, since this will also filter out legitimate e-mail messages, especially as larger domains from which spam messages are sent are unlikely to be blacklisted. Other alternatives, such as lawsuits against offending ISPs, may prove fruitless if the ISP is located in a jurisdiction that has permissive spam legislation. A recent study by the Committee on Critical Information Infrastructure Protection and the Law (National Academy of Engineering) describes some of the challenges posed by international boundaries. First, there is the challenge to secure evidence quickly. Informing a party in a foreign jurisdiction that a violation is, or has recently, taken place, and then waiting for a response, risks that the evidence needed at the source will be lost. Second, there is the challenge of prosecuting the parties responsible. This requires international treaties and changes in the laws of the countries involved, which is a tedious process fraught with difficulties. Furthermore, any nation that does not sign on to such a treaty is likely to become a haven for spammers and other dubious netizens. Thus an effective treaty on tracing and stopping spammers would have to be effective globally, which is unlikely to occur given the current global political climate.
Other Art
Much of the academic literature on spam prevention has focused on detecting spam at the recipient's end, using scanning techniques such as naive Bayesian algorithms, support vector machines, memory-based classifiers, and boosting trees. Commercial products also tend to focus on detecting spam at the recipient's end, often combining this with complementary techniques such as blacklisting IPs of machines that have sent spam in the past.
For example, Cisco™'s Ironport Antispam™ evaluates messages based on: the reputation of the sender, considering factors such as the country of origin and recent suspicious activity; the reputation of included URL links, based on factors such as the age of the domain registration; the structure of the message, such as missing or suspicious SMTP headers; and the actual contents of the message. Ironport™ also runs an operations center that generates signatures for spam messages that make it past four checks corresponding to the above factors.
Cloudmark™, which grew out of the Vipul's Razor™ open-source project, takes a different approach by relying on a large user community to flag spam. In essence, the program computes a fingerprint for every incoming message and compares this to existing spam fingerprints in a catalogue server. If there is no match, the message is delivered to the recipient's mailbox. In the case where the recipient feels that the message is spam, the program is directed to nominate the message as spam. If a sufficient number of users have designated the same message as spam, it is forwarded for inclusion in the catalogue server database.
The problem with client-side detection in general is that network bandwidth has already been consumed to transmit the message. Thus even if the message is ultimately discarded, the recipient will nonetheless have to pay for the cost of transmission, detection, and elimination. Furthermore, history has shown that client-side scanning for spam is easily subverted, by varying the spelling of certain words, adding large amounts of unrelated text, using graphics, and other means. For client-side scanning to become less vulnerable to subversion, one must first solve the open artificial intelligence problem of building an automated system that correctly understands context within free-form text and arbitrary images.
In addition to scanning and trace-back techniques, there have been proposals that would require significant changes to the way the Internet operates. For example, charging small amounts of money for every mail message, or requiring the originating mail server to complete certain computations, have been suggested as ways to make it economically unfeasible to transmit spam. Another idea involves requiring every message to be labeled according to a universal labeling scheme.
The problem with these large scale change techniques is that they don't work unless a sufficiently large number of participants opt in at around the same time. Looking at other change based protocols, such as IPv6, this is proving rather difficult to achieve.