Dictionary domain generation algorithms (DGAs) are computerized algorithms that randomly concatenate words found in English dictionaries or other dictionaries to create new internet domain names. Dictionary DGAs are used in a variety of contexts, including electronic advertising networks and malware.
Malware is any software, script, or program that is designed to damage a computer or network by, for example, disrupting computing functions, performing a denial of service, destroying data, redirecting internet traffic, sending spam, stealing data, or performing other malicious activities. Malware is often installed on a computer or device automatically, without user knowledge. Typical attack methods include viruses, phishing emails, worms, or bots launched through a network.
Networks may involve hundreds or thousands or devices accessing tens of thousands or even millions of websites per day. In environments containing a high volume of requests to visit websites, where any site is potentially malicious, providing real-time network security is a challenge. Network administrators may lack the resources to address this problem and look to services that provide security from malware.
To install malware, an internet domain name may be sent to a device and, when the device accesses the domain name, malware may automatically download and install onto the device. Some types of malware may then contact a command and control network located at the same internet domain or another internet domain, receive instructions from the command and control network, and perform malicious activities.
Some malware leads users to domains whose domain name comprises randomly generated strings of unrelated characters, i.e. character-based DGAs. Such DGAs may use randomization seeds to generate these names in real time. Character-based DGA domain names may be easily spotted by a human user as potentially unsafe or may be easily detected by models.
Other malware may use dictionary DGAs. As compared to character-based DGA domain names, dictionary DGA domain names are harder for human users to identify as potentially unsafe. As an example, a dictionary DGA may perform a randomization process to identify the words “look” and “hurt,” then combine these words to generate “lookhurt.com.” Dictionary DGA domain names may appear to be authentic to unsuspecting users because component words may be related to each other, related to known concepts, related to real events, or related to other aspects of the real world. Dictionary DGAs may have varying levels of sophistication to lure users into visiting these websites.
Dictionary DGA domain names present challenges to network administrators. Conventional approaches to detecting dictionary DGA domain names are inaccurate, and, hence, services that are available to network administrators may be unable to provide adequate security.
Conventional approaches to network security from malware that use DGAs may involve reverse engineering a piece of malware and identifying its respective DGA and the seed. Other conventional approaches may involve developing blacklists, i.e. preventing network users from accessing domains listed in lists of known malicious sites. Blacklisting and reverse engineering may be slow or infeasible. While various groups or agencies may share blacklists or the results of reverse engineering to increase a pool of known malicious sites, these approaches have limited efficacy in preventing users from accessing unknown malicious sites. The status of a site as either malicious or benign may change; previously malicious sites may become benign, and static blacklists may not keep pace with changes. Further, new DGA domain names may be generated nearly instantaneously in bulk, producing hundreds to tens-of-thousands per day by a single malware sample, easily outpacing the rate at which malware may be reverse engineered or at which blacklists may grow.
Another conventional approach to detecting DGAs may involve making predictions based on models. For example, random forest classifiers and clustering techniques may be used. These conventional approaches require identifying manually-selected features, such as entropy, string length, vowel to consonant ratio, the number of English words present in the domain, or the like. Other conventional approaches involve using natural language models. Conventional approaches based on random forest models, clustering models, or natural language models made to target generic DGA domains have some success predicting character-based DGA domain names, but have not been shown to perform well when predicting dictionary DGA domain names.
In view of the shortcomings of current systems and methods for detecting dictionary DGA domain names, an unconventional method with improved accuracy, flexibility, and ability to handle high-volume requests is desired. Further, because conventional approaches are inaccurate, and because many network administrators lack the resources to address dictionary DGAs on their own, there is a need for accurate dictionary DGA detection as a service.