The detection of malicious communication by learning-based detectors is based on generic features describing the communication. For example, the features extracted from proxy log attributes can be used in training to discriminate between malicious and legitimate Hypertext Transfer Protocol (HTTP) requests.
A problem of supervised training in network security is the availability of a sufficiently large and representative dataset of labeled malicious and legitimate samples. The labels are expensive to obtain since the process involves forensic analysis performed by security experts. Sometimes, the labels are not even possible to assign, especially if the context of the network communication is small or unknown and the assignment is desired at a proxy-log level.
Furthermore, the labeled dataset becomes obsolete quite quickly, as a matter of weeks or months, due to the constantly evolving malware. As a compromise, domain-level labeling has been frequently adopted by compiling blacklists of malicious domains registered by the attackers. The domain blacklists can be used to block network communication based on the domain of the destination Uniform Resource Locator (URL) in the proxy log. However, the malicious domains typically change frequently as a basic detection evasion technique. Even though the domains might change, the other parts of the HTTP request (and the behavior of the malware) remain the same or similar.