In today's world, many companies rely on computing systems and software applications to conduct their business. Computing systems and software applications deal with various aspects of companies' businesses, which can include finances, product development, human resources, customer service, management, and many other aspects. Businesses further rely on communications for a variety of purposes, such as, exchange of information, data, software, and other purposes. Computing systems/software are frequently subject to cyberattacks by viruses, malicious software or malware, and/or other ways that can be highly disruptive to operations of the computing systems/software. Malware can disrupt computer operations, gather sensitive information, gain access to private computer systems, or the like. Malware is typically defined by its malicious intent and does not include any software that may cause unintentional harm due to some deficiency.
Malware typically operates in a stealthy mode and can steal information and/or spy on computer users during a particular period of time, which can be an extended period of time. It operates without knowledge of the users and can cause significant harm, including sabotage of computing system, extortion of payment, etc. Malware can include, but is not limited to computer viruses, worms, Trojan horses, ransomware, spyware, adware, scareware, and other malicious programs. It can be an executable code, scripts, active content, and/or other software. In order to gain access to computing systems, malware is often disguised as, or embedded in, non-malicious files. Periodically, malware can be found embedded in programs officially supplied by legitimate companies, e.g., downloadable from websites, which can be useful or attractive, but having hidden tracking functionalities that gather marketing statistics.
A variety of methods have been implemented in the computing world to combat malware and its variants. These include anti-virus and/or anti-malware software, firewalls, etc. These methods can actively and/or passively protect against malicious activity and/or can be used to recover from a malware attack. Training sets are developed for the purposes of training machine learning models that can be used to detect presence of malicious code in data. To generate such training sets, a significant analysis of data and pre-processing activities may need to be performed, which can cause a delay. Further, existing training sets may be large, which may make it difficult training machine learning models. Thus, there is a need for a way to perform expedient analysis of data, extraction of features contained in the data, generation of a reduced size training set, and determination whether malware may exist in the data using such training set.