Computer security software may often use various file attributes to help determine whether certain files are potentially malicious or contain malware. To detect potentially malicious files, some security software may also use file prevalence to determine whether a computer file is unique or unusual. For example, a file that has a low prevalence within a large number and variety of known files may be a file that is highly unusual and, therefore, suspicious. On the contrary, a file with a high prevalence score may suggest that the file is common and likely to be legitimate, since many computers or users have the same or a similar file. File prevalence may also be used to determine the scope of potential threats caused by the proliferation of a malicious file.
Traditionally, a hash function may be performed on computer files to help identify the same files on multiple computing systems and to preserve file integrity. File prevalence may then be calculated on file hash values across multiple systems to determine whether files are common to similar systems. However, in some cases, minor changes to a file or a hash algorithm may result in a vastly different hash value for a file. The resulting hash value may appear to be uncommon and indicate a low file prevalence score, but the file may actually be very common. For example, a malicious file may make slight changes to escape detection by security software if the software only recognizes certain hash values as malware. Thus, reliance on file hashes may cause false positives in identifying potentially malicious files or may miss actual malicious files due to small differences in the files. The instant disclosure, therefore, identifies and addresses a need for improved systems and methods for identifying malicious computer files.