Via the Internet, individuals and organizations with malicious intent distribute software that damages computer systems and/or is used to steal the personal information of users (including individual users or entities such as companies). Such malicious software, or malware, often exploits code vulnerabilities and/or gets installed onto users' computer systems by tricking end users/socially engineer end users into taking some action.
One particular exploit is to create malicious input files in well-known document formats, such as malicious Microsoft Word or .pdf documents, and trick users into opening them. Once opened, typically by exploiting vulnerabilities in the application, the malicious input files run and/or plant executable code that gives malware authors illicit control of their victim's computers and opens the system for attack.
Moreover, these malicious input files are also one of the biggest sources of re-infections, which may be generally defined as a reoccurrence of a malware threat with similar characteristics in a short period of time after it is believed to have been successfully removed.
In order to protect users, anti-malware vendors need to get samples of these malicious input files for analysis. In general, the more rapidly the files are obtained the better, so that remedial actions may be taken and other users may be protected.
However, heretofore there is no effective, rapid mechanism for distinguishing the small number of newly-created malicious input files from the vast number of new non-malicious input files that continuously appear across the Internet, so as to acquire samples of only the malicious ones for analysis. As a result, common scenarios where malware continually attacks the same machine in this way leads to a degraded user experience from repeated notifications, and wasted system and network resources from repeatedly addressing the infection and not the root cause, namely malicious input files.
Still further, as virtualized distributed environments become more prevalent, there exists a gap in preventing infection across such machines based on information collected from a subset. For example, if a malicious input file is discovered on only one particular machine, this is not used to inform other machines, and thus they risk becoming infected.