Traffic on networks, such as the Internet, has grown significantly in recent years. Monitoring trends and similarities in the characteristics of network traffic can provide useful information for a variety of different entities, such as providers of network services and products. One specific area where this information is of use is in the field of malware detection.
Malware, short for malicious software, is software that is designed for hostile or intrusive purposes. For example, malware may be designed with the intent of gathering information, denying or disrupting operations, accessing resources without authorization, and other abusive purposes. Types of malware include, for example, computer viruses, worms, trojan horses, spyware, adware, and botnets. Malware developers typically distribute their software via the Internet, often clandestinely. As Internet use continues to grow around the world, malware developers have more incentives than ever for releasing this software. In fact, studies indicate that the release rate of malicious software today could even be exceeding that of legitimate software.
In order to protect computers from malware, there has been a growing demand for anti-malware software, including secure web gateways (SWGs). A SWG is software designed and optimized for controlling whether to permit transmission of incoming or outgoing content on a network. SWGs are typically installed locally in a corporate office or other entity that expects incoming and outgoing network traffic. SWGs assign scores to traffic destination sources and/or origination points based on the suspiciousness of a destination source and/or origination point. These scores are then used to determine whether or not to allow transmission of packets transmitted to a particular destination source, or from a particular origination point.
Botnets are one example of malware that has become a major security threat in recent years. A botnet is a network of “innocent” host computers that have been infected with malicious software in such a way that a remote attacker is able to control the host computers. The malicious software used to infect the host computers is referred to as a “bot,” which is short for “robot.” Botnets operate under a command and control (C&C) architecture, where a remote attacker is able to control the infected computers, often referred to as “zombie” computers. An attacker may control the infected computers to carry out online anti-social or criminal activities, such as e-mail spam, click fraud, distributed denial-of-service attacks (DDoS), or identity theft.
FIG. 1 illustrates an exemplary C&C architecture of a botnet 100. The C&C master 101, often referred to as a “botmaster” or “bot herder,” distributes malicious bot software, typically over the Internet 102. This bot software stores information or has an algorithm identifying a future time and domain names to contact at the indicated future time. The bot software infects a number of host computers 103 causing them to become compromised. Users of host computers 103 typically do not know that the bot software is running on their computers. C&C master 101 also registers temporary domain names to be used as C&C servers 104. Then, at the indicated future time, the bots instruct host computers 103 to contact C&C servers 104 to get instructions. The instructions are sent over a C&C channel. The ability to send instructions to host computers 103 provides C&C master 101 with control over a large number of host computers. This enables C&C master 101 to generate huge volumes of network traffic, which can be used for e-mailing spam messages, shutting down or slowing web sites through DDoS attacks, or other purposes.
Botnets exploit the domain name system (DNS) to rally infected host computers. DNS is the Internet's hierarchical lookup service for mapping character-based domain names meaningful to humans into numerical Internet Protocol (IP) addresses. Domains exist at various different levels within the DNS hierarchy. For example, a top-level domain (TLD), such as .com or .net, is a domain at the highest level in the DNS hierarchy. A second-level domain (SLD) is a subdomain of a TLD that is directly below the TLD in the DNS hierarchy. For example, “com” is the TLD and “example” is the SLD for the domain name “www.example.com.”
A name server is a server that translates domain names into IP addresses. Each domain has at least one authoritative DNS name server that publishes information about the domain. Domain name resolvers determine appropriate domain name servers for a domain name by performing a sequence of queries beginning with the right-most domain, which is the TLD. In domain name resolution, a query is submitted to one of the root servers to find the authoritative server for the TLD. A query is then submitted to the TLD server for the address of an authoritative server for the second-level domain. This process is continued through the levels of the DNS hierarchy until the IP address sought is returned. For example, in resolving a query for “www.example.com,” a query would be submitted to one of the root servers to find the authoritative server for “com.” A query would then be sent to the server for “com” requesting the address of the authoritative server for “example.com.” A query would then be sent to the server for “example.com,” and this server would respond with the IP address corresponding to “www.example.com.”
Many C&C masters dynamically change the IP addresses associated with the domain names of the C&C servers to avoid detection. Infecting the host computers with bots containing domain names of the C&C servers allows the host computers to contact the appropriate C&C servers through DNS resolution, even if the IP addresses of the C&C servers have changed. Thus, bots may locate C&C servers according to their domain names. Some remote attackers also change the domain names of the C&C servers to even further avoid detection. Nevertheless, bots in a botnet usually act as a group, sending periodic DNS queries to join C&C channels. Because bots within the same botnet are likely to generate similar DNS traffic, analyzing DNS traffic data to detect this behavior is an effective way of detecting botnets.
When a network device queries for a domain name, DNS resolution determines whether or not the domain name exists. These queries leave resource request information indicating the network device requesting the domain name. Some botnets frequently change the domain names they use for resolution, and these domain names may even be randomly assigned. When this is the case, each bot is programmed to query for a multitude of domain names in hopes that the C&C master has registered at least one or more of them via DNS. Because the C&C master will likely only register a few of these domain names at most, the rest of the domain name requests will be for domains that don't exist. These requests may return nonexistent domain (NXDomain) replies and leave NXDomain records in the DNS. Alternatively, if a requested domain name does exist, the request may return an existent domain (YXDomain) reply and leave a YXDomain record in the DNS.
Approaches have been taken to utilize DNS record data to detect suspicious DNS behavior. However, these approaches have been conducted by computing a similarity matrix using small sets of data. Analyzing a larger set of data would provide for a more accurate and comprehensive means of detecting suspicious DNS behavior. Nevertheless, computing similarity of DNS behavior with a large amount of data, such as the NXDomain data from a TLD, requires a large amount of processing power and can take a long time.
As botnet domain names and IP addresses can change quickly over time, it is important to compute this similarity as quickly as possible. Thus, an effective system performing this type of similarity matrix computation would be difficult and expensive to implement. What is needed is a scalable approach for detecting network traffic exhibiting similarities associated with specific behaviors, such as suspicious DNS behavior, and that can quickly perform similarity computation on a large amount of data with reduced processing requirements.