The present invention relates to the field of computer, more particularly, to a method and apparatus for normalizing non-numeric features of files.
Most modern software uses configuration files to provide flexibility to users, enabling users to customize configuration items based on their specific usage scenarios. For example, the users may customize the value of the configuration item, MaxClients (the maximum number of clients) in the configuration file, httpd.conf, in order to adjust the maximum number of clients simultaneously connected to the Apache HTTP server.
Some routine IT operations, e.g., application or data backup and recovery, workload transfer, file disaster recovery, are becoming more and more complex and challenging, since they are highly dependent on the identification of configuration files in a distributed environment. Therefore, there is a great demand to identify these configuration files in the existing environment to accomplish these common IT operations.
Due to their variability, multi-presence and massive amount, identifying configuration files is challenging, labor-intensive and error-prone. Existing solutions for configuration file discovery highly depend on extensive expert knowledge or highly intensive human interaction.
A conceivable method for automatically identifying configuration files is to use a classifier. A classifier is an algorithm or a corresponding apparatus, which can, after learning by using training data, determine whether an object belongs to a specific category based on a combination of features values of the object. Therefore, it can be conceived that the classifier can determine whether a file belongs to a configuration file based on the metadata like path, access permissions, size etc. of the file. However, since the classifier can only receive numeric features as input and cannot receive non-numeric features, non-numeric features of configuration files like file path cannot be used by the classifier to identify configuration files.
Thus, a solution for normalizing non-numeric features of files like configuration files to numeric features in order to identify configuration files is needed in the art.