Currently, the number of software applications that may be installed on user devices (e.g., personal computer, smartphone, tablet, etc.) is growing significantly and the number of files that may be created by these applications is rising exponentially. Certain files which are created by the software applications upon installation and operation of the application are unique, i.e., the files may exist as a single copy. It is very difficult to categorize such files without performing a detailed analysis of their contents.
Often, these files can be images of parent assemblies in machine code (i.e., native images), which are part of the .NET technology. A .NET application may be created using a certain number of assemblies together, where an assembly is a binary file serviced by a Common Language Runtime (“CLR”) environment. A .NET assembly includes the following metadata elements:                a portable execution (“PE”) file header;        a CLR header;        Common Intermediate Language (“CIL”) code;        metadata used in the assembly of types (e.g., classes, interfaces, structures, enumerations, delegates);        a manifest of the assembly; and        additional built-in resources.        
In general, the PE header identifies that the assembly can be loaded and executed in operating systems of the Windows® family. The PE header also identifies the type of application (e.g., console application, application with graphic user interface, code library and the like).
The CLR header constitutes data that can support all .NET assemblies so that they can be maintained in the CLR environment. The CLR header contains such data as flags, CLR versions, entry point (e.g., in a particular instance, the address for the beginning of the function Main( )), which allows the executing environment to determine the makeup of the file being managed (i.e., a file containing managed code).
Each assembly contains CIL code, which is an intermediate code not dependent on the processor. During execution, the CIL code is compiled in real time mode by a JIT (just in time, i.e., dynamic compilation) compiler into instructions corresponding to the requirements of the specific processor.
In any given assembly, there is also metadata that fully describes the format of the types (e.g., classes, interfaces, structures, enumerations, delegates, etc.) present within the assembly, as well as external types to which the assembly makes reference (i.e., types described in other assemblies). In the executable environment, the metadata is used to determine the location of types on in the binary file, for the placement of the types in memory, and to simplify the process of a remote call for the methods of the types.
The assembly may also contain a manifest, which describes each module making up the assembly, the version of the assembly, and also any external assemblies to which the current assembly makes reference. The manifest also contains all metadata needed to specify the requirements of the assembly for versions and the identity of the assembly, as well as all the metadata needed to determine the scope of the assembly and to allow links to resources and classes. The following table shows the data contained in the manifest of an assembly. The first four elements—name of the assembly, version number, language and regional parameters, as well as the strong name data—constitute the identity of the assembly.
Informa-tionDescriptionName ofText line giving the name of the assembly.AssemblyVersionMain and supplemental version numbers, revision Numbernumber and build number. The CLR environment uses them to apply the version management policy.LanguageInformation on languages or regional parameters and supported by the assembly. This information shouldRegionalbe used only to designate the assembly as anParametersaccompanying assembly containing information about the language or regional parameters (an assembly containing information about language and regional parameters is automatically considered an accompanying assembly).Strong The publisher's public key, if a strong name is assignedNameto the assembly.DataList of AllHash and name of each file making up the assembly. Ffiles of theAll files entering into the assembly should be locatedAssemblyin the same folder as the file with the manifest of the assembly.InformationInformation used by the execution environment to on Links tocompare the links to types with the files containing theirTypesdeclarations and implementations. This involves types which are exported by the assembly.InformationA list of other assemblies for which there are staticon Links tolinks from the given assembly. Each link includes theAssembliesname of the dependent assembly, the metadata of theassembly (version, language and regional parameters, operating system, etc.) and the public key, ifthe assembly has a strong name.
Any .NET assembly may contain any given number of embedded resources, such as application icons, graphic files, audio fragments or string tables.
An assembly can consist of several modules. A module is a part of an assembly, i.e., a logical collection of code or resources. The hierarchy of entities used in the assembly is: assembly>module>type (classes, interfaces, structures, enumerations, delegates)>method. A module can be internal (i.e., inside a file of the assembly) or external (i.e., a separate file). A module does not have an entry point, nor does it have any individual version number, and therefore it cannot be loaded directly by the CLR environment. Modules can only be loaded by the main module of the assembly, such as a file containing the manifest of the assembly. The manifest of the module contains only an enumeration of all external assemblies. Each module has a Module Version Identifier (“MVID”), which is a unique identifier written out in each module of the assembly, which changes during each compilation.
FIG. 1A illustrates an exemplary layout of a single-file assembly. As shown, in single-file assemblies, all requirement elements (e.g., headers, CIL code, metadata of types, the manifest and resources) are situated inside a single file *.exe or *.dll.
FIG. 1B illustrates an example of a multiple-file assembly. A multiple-file assembly consists of a set of .NET modules that are deployed in the form of a single logical unit and provided with the same version number. Typically, one of these modules is called the main module and contains the manifest of the assembly and also may contain all necessary CIL instructions, metadata, headers and additional resources.
The manifest of the main module describes all the other related modules on which the operation of the main module depends. The secondary modules in a multiple-file assembly may be assigned the extension *.netmodule. The secondary *.netmodule modules also contain CIL code and metadata of types, as well as a manifest of the level of the module, in which the external assemblies needed by the given module are enumerated.
As with any PE file, an assembly can be signed with a digital signature (e.g., an X.509) that is situated in the overlay of the PE file or digitally-signed catalog file (.cat). A StrongName signature is used in addition or separately, i.e., a hash generated by using the contents of the assembly and the RSA private key. The hash is situated in the assembly between the PE header and the metadata. The hash makes it possible to check for no change in the assembly since the time when it was compiled. For a single-file assembly, free bytes are left after the PE header when the file is compiled. The hash of the file is then calculated using the private key and the resulting hash is entered into these available bytes.
The technology is different for multiple-file assemblies. Besides the hash of the main file of the assembly, hashes are also calculated for the external modules, after which the data is entered into the main assembly. The modules do not have their own signatures and they have different MVIDs from the main module. The following items are entered into manifest of the assembly:                the PublicKey—i.e., the public key of the StrongName signature, and        the PublicKeyToken—i.e., the hash of the public part of the key of the StrongName signature.        
Typically, assemblies are divided into: private and public/shared. Private assemblies should always be located in the same catalog as the client application in which they are used (i.e., the application catalog) or in one of its subcatalogs.
In contrast, a public assembly can be used at the same time in several applications on the same device. Public assemblies are not situated inside the same catalog as the applications in which they are supposed to be used. Instead, they can be installed in a global assembly cache (GAC). The GAC can be located in several places at the same time as shown in the following table:
.NET Assembly FrameworkwordPath to GACversionlength%WINDIR%\assembly\GAC1.x—%WINDIR%\assembly\GAC_322.x-3.xx32%WINDIR%\assembly\GAC_642.x-3.xx64%WINDIR%\assembly\GAC_MSIL2.x-3.xAnyProcessor%WINDIR%4.x and higherx32\Microsoft.NET\assembly\GAC_32%WINDIR%4.x and higherx64\Microsoft.NET\assembly\GAC_64%WINDIR%4.x and higherAnyProcessor\Microsoft.NET\assembly\GAC_MSIL
An assembly being installed in a GAC should have a strong name. A strong name is the modern-day .NET equivalent of the global unique identifier (GUID) that was used in COM. Unlike the GUID values in COM, which are 128-bit numbers, the .NET strong names are based in part on two interrelated cryptographic keys, known as a public key and a secret (private) key.
A strong name consists of a set of interrelated data, including, at least:                the name of the assembly (being the name of the assembly without the file extension).        the version number of the assembly;        the public key value;        a value designating the region, which is not mandatory and can be used for localization of the application; and        the digital signature created with use of the hash obtained from the contents of the assembly and the value of the private key.        
In order to create the strong name of an assembly, a user can obtain the public and private key, for example, the data of the public and private keys is generated by the utility sn.exe, provided as part of the .NET Framework SDK. This utility generates a file containing data for two different, yet mathematically related keys—the public and private keys. The locations of this file are then indicated to the compiler, which writes the full value of the public key in the manifest of the assembly.
In a particular case, the compiler generates on the basis of the entire content of the assembly (e.g., CIL code, metadata, etc.) a corresponding hash. The hash is a numerical value that is statistically unique to fixed input data. Consequently, in the event of a change in any data of a .NET assembly (even a single character in a string literal), the compiler will generate a different hash. The generated hash then combines with the private key data contained inside the file to obtain the digital signature, inserted in the assembly inside the CLR header data.
FIG. 1C illustrates an exemplary process for generating a strong name. Typically, the private key data is not indicated in the manifest, but used only to identify the content of the assembly by the digital signature (along with the generated hash). After completing the process of creating and assigning a strong name, the assembly can be installed in the GAC.
The path to the assembly in the GAC can be, for example:
C:\Windows\assembly\GAC_32\KasperskyLab\2.0.0.0_b03f5f7f11d50a3a\KasperskyLab.dll, where:
C:\Windows\assembly is the path to the GAC;
\GAC_32—is the GAC_architecture of the processor;
\KasperskyLab is the name of the assembly;
\2.0.0.0_b03f5f7f11d50a3a is the version of the assembly_public key marker; and
KasperskyLab.dll is the \assembly name.extension.
The execution of the code of an assembly, in one particular case, occurs as follows. First, the PE header is analyzed to determine which process should be started (32 or 64 bit). Next, the selected file version MSCorEE.dll is loaded (C:\Windows\System32\MSCorEE.dll for 32-bit systems). An example of the source code of an assembly is presented as follows:
static void Main( )
{
                System.Console.WriteLine(“Kaspersky”);        System.Console.WriteLine(“Lab”);}        
For the execution of the method (for convenience, the code is presented in its original form, and not compiled into CIL code), such as the method System.Console. WriteLine(“Kaspersky”), the JIT compiler transforms the CIL code into machine commands.
FIG. 2 illustrates an exemplary method of executing an assembly code. Initially, before executing the function Main( ), the CLR environment finds all the declared types (classes) (for example, the type Console). Next, the CLR environment determines the methods, combining them in a record inside a unified “structure” (one method each, as defined in the type Console). The entries contain the addresses at which the implementations of the methods can be found. At the first accessing of the method WriteLine, the JIT compiler is called up. The JIT compiler is aware of the method being called up and the type which defines this method. Once called up, the JIT compiler searches in the metadata of the corresponding assembly for the implementation of the method code (i.e., the code implementing the method WriteLine(string str)). The JIT compiler then compiles the CIL code into machine code and saves the compiled code in dynamic memory. Next, the JIT compiler returns to the internal “structure” of the type data (Console) and replaces the address of the method being called up with the address of the memory section with the machine code. The method Main( ) again accesses the method WriteLine(string str). Since the code has already been compiled, the access is without a JIT compiler call. After executing the method WriteLine(string str) control returns to the method Main( ).
It follows from the description that the function works “slowly” only at the time of the first call, when the JIT compiler is converting the CIL code into processor instructions. In all other instances, the code is already in memory and is provided as optimized for the given processor. However, if yet another program is started in another process, the JIT compiler will be called up again for this same method.
The native images mentioned above solve the problem of slow working of the function at the time of the first call. When the assembly is loaded, an image will be loaded from which the machine code will be executed. Using this technology, it is possible to speed up the loading and running of an application because the JIT compiler does not need to compile anything and/or also create the data structures each time again. All of this is taken from the image. An image can be created for any given .NET assembly regardless of whether or not it is installed in the GAC. For the compilation, in one example, one uses the utility ngen.exe, located by the path %WINDIR%\Microsoft.NET\Framework\<Framework_version>\ngen.exe. When ngen.exe is launched, machine code is created for the CIL code of the assembly using the JIT compiler, and the result is saved to disk in the Native Image Cache (“NIC”). The compilation is done on the local device, taking into account its software and hardware configuration, and, therefore, the image should be used only in the environment for which it was compiled. The purpose of creating such images is to increase the effectiveness of the managed applications, that is, the finished assembly in machine code is loaded in place of the JIT compilation.
If the code of the assembly is used by many applications, the creation of an image substantially increases the speed of launching and executing the application, since the image can be used by many applications at the same time, while the code generated on the fly by the JIT compiler is used only by the copy of the application for which it is being compiled.
The path to the compilable image is formed as follows, for example: C:\Windows\assembly\NativeImages_v4.0.30319_32\Kaspersky\9c87f327866f53aec68d4fee40 cde33d\Kaspersky.ni.dll, where
C:\Windows\assembly\NativeImages is the path to the image cache in the system;
v4.0.30319_32 is <version.NET Framework>_<processor architecture (32 or 64)>;
Kaspersky is the friendly name of the assembly;
9c87f327866f53aec68d4fee40cde33d is the hash of the application; and
Kaspersky.ni.dll is <friendly name of the assembly>.ni.<extension>.
When creating an image of machine code of the assembly ngen.exe for 64-bit applications, related can be saved in the registry branch HKEY_LOCAL_MACHINE\SOFTWARE\ Microsoft\.NETFramework\v2.0.50727\NGenService\Roots, for 32-bit applications in HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\MicrosoftVNETFramework\v2.0.50727\N GenService\Roots\.
If the image was installed for an assembly located in the GAC, the branch can be called: . . . \Roots\Accessibility, Version=2.0.0.0, Culture=Neutral, PublicKeyToken=b03f5f7f11d50a3a, processorArchitecture=msil. But if the assembly was not installed in the GAC, then it can be called: . . . \Roots\C:/Program Files (x86)/ATI Technologies/ATI.ACE/Core-Static/A4.Foundation.DLL
Prior to Windows 8®, the developer always had to initiate himself the creating, updating and removing of the images of assemblies, making use of ngen.exe (or by configuring the installer). With Windows 8®, images could be created automatically for certain Windows® assemblies.
In one particular case, the Native Image Service is used to control the images. This allows the developers to postpone the installation, updating and removal of images in machine code, these procedures being carried out later on, when the device is standing still. Native Image Service is launched by the program installing the application or the update. This is done by means of the utility ngen.exe. The service works with a queue of requests saved in the Windows® registry, each of the requests having its own priority. The priority established determines when the task will be performed.
In another particular instance, images in machine code are created not only on the initiative of the developers or administrators, but also automatically by the .NET Framework platform. The .NET Framework platform automatically creates an image, tracking the work of the JIT compiler. In general, creating an image during the operation of an application takes too much time, and, therefore, this operation is often carried out later on, for which purpose the CLR environment places the tasks in a queue and executes them during a standstill of the device.
The CLR environment uses the assembly binding module (i.e., the Assemble Binder) to find assemblies for loading at the moment of executing the corresponding assembly. The CLR may use several kinds of binding modules. An image binding module (i.e., a Native Binder) is used to search for images. The searching for a required image is performed in two stages—first, the given module identifies the assembly and image in the file system and, second, the given module checks the correspondence of the image to the assembly.
FIG. 3 illustrates a method of operating the binding module. As shown, in step 310, the assembly binding module searches for the assembly, the search is performed in the GAC, which presupposes that the sought assembly is signed and the content of the assembly is not read; and in the application catalog where the assembly is opened and the metadata is read.
Next, in step 320, the image binding module searches for an image in the NIC corresponding to the identified assembly. In the event that the image is identified, this is checked in step 330, and the image binding module reads the necessary data and metadata from the image in step 340, to ensure that the image satisfies certain criteria, for which a careful analysis is performed, including, but not limited to, reviewing:                the strong name;        the time of creation (the image should be more recent than the assembly);        the MVID of the assembly and the image;        the .NET Framework version;        the processor architecture; and        the version of related images (for example, the image mscorlib.dll).        
If the assembly does not have a strong name, then the MVID is used for the check. At step 350, the image is analyzed to determine whether it is current and control is transferred to the JIT compiler in step 370 if it is not current. Otherwise, the code from the image is loaded in step 360.
It follows from the foregoing description that the number of native images substantially exceeds the number of assemblies and the native images generated by the same parent assembly may differ from one device to another and from one image version to another, all of which greatly complicates the task of categorizing the images. Some conventional file categorization methods use cloud services and the like, but no solutions have been created that are able to correctly and efficiently categorize an image.