The clustering of executable files into sets can be used for file classification. For example, clustering can be used to classify files as malware or goodware. To do so, in a training phase, a set (cluster) of files known to comprise malware is labeled accordingly, as is a set (cluster) of files known to comprise goodware. Files of unknown status (samples) can then be classified as belonging to one cluster or another based on properties of the samples and the files in the clusters. This categorization can be more specific than a binary classification of a file, for example as malicious (malware) versus benign (goodware). Multiple clusters of files can be used, for example clusters of files known to belong to different families of malware, each specifically labeled per family. A family of malware typically includes different files resulting from modifications having been made to a common, parent codebase.
It can be assumed that different samples belonging to a classification or family will exhibit similar runtime behavior, and hence the similarity in runtime behavior can be used to classify a sample. Therefore, to classify a sample into a specific cluster, the runtime behavior of the sample is determined. The sample is then added to one of clusters, based on the similarity in runtime behavior to that of the members of the cluster. The appropriate label is added to the sample, according to the cluster based classification of the sample. In this context, a label comprises an indicator of the classification of the sample (e.g., malware, goodware, malware of a specific family, etc.). In other words, the sample is assigned to a cluster such that the sample's behavior matches most closely with the other samples in the cluster.
All training clusters, however, are not created equal. For example, if a cluster, C1, contains all malware samples and another cluster, C2, contains half malware and half goodware samples, then C1 is of better quality than C2. Moreover, all cluster based classifications of unknown samples are not equally reliable. For example, if a first sample, S1, is assigned to a better quality cluster than a second sample, S2, then the classification of S1 is typically more reliable than that of S2.
It would be desirable to address these issues.