Byte-distribution analysis is a statistical analysis technique, which has been used to classify digital data. Byte-distribution analysis generally involves examining a binary file in terms of its byte constituents. I.e., a binary file is a sequence of bytes with values i, ranging from i=0 to i=255, and each byte has a frequency of occurrence, fi within the file. Byte-analysis uses the histogram of frequencies fi, 0≦i≦255, to classify a file.
Byte analysis is described in Abou-Assaleh, T., Cercone, N., Keselj, V. and Sweidan, R., N-gram based Detection of New Malicious Code, Proceedings of the 28th Annual International Computer Software and Applications Conference, IEEE, 2004. N-gram analysis is a generalization of byte-distribution analysis to sequences of N consecutive bytes (i1, i2, . . . , iN).
Prior art implementations of byte-distribution analysis for security analysis of files have not been sufficiently robust and accurate to make their way into commercial products. Such implementations suffer from false negatives and false positives. False negatives are malicious files that elude detection, and false positives are non-malicious files that are reported as being malicious. It is thus desirable to find an implementation of byte-distribution analysis that has low enough margins of false negatives and false positives, that warrant its commercial use.