Malware is a significant problem with effort being devoted towards automatic detection. Single instance files associated with malware have typically two sources: polymorphic and metamorphic malware which installs a unique instance of the attack on each new computer, and legitimate software which creates a unique file for each installation. It is impractical for human analysts to investigate each new file detected in the wild. Since malware authors rely on automation to avoid detection, commercial anti-malware companies need to rely on automation to detect new malware.
Machine learning and data mining technologies provide tools to improve automated malware detection. Typical approaches focus on classifying individual files in isolation and can be categorized as static analysis of binaries and dynamic analysis of program execution. Conventional algorithms range from computationally efficient methods to more complex static analysis algorithms that can be expensive in terms of processing time and processor power consumption. Further methods include using a file's reputation in relationship to the machines which report the file to improve classification. Attackers typically use automation to create new variants of malware which can by-pass anti-virus products. To combat this threat, some automated systems classify unknown files as malware. Some automated malware classification systems attempt to assign a probability that a file belongs to a specific family of malware. However, false positive rates may be unacceptably high for completely automated classification systems.