The present invention, in some embodiments thereof, relates to data analysis and, more specifically, but not exclusively, to methods and systems of identifying and selecting an efficient function for data records classifying and/or for estimating and/or ranking data according to the data records classification.
As organizations, sensors, diagnostic tools and monitoring systems generate and retain large amounts of data; it becomes increasingly important to classify effectively this data. Data may be classified using information about the data.
However, information about the data may not be useful to all users because such information lacks context. Therefore, in the recent years, various data mining tools and algorithms are used.
Gathered data usually includes a set of records where each record includes a set of features each representing and individual measurable heuristic property of a phenomenon or an event being monitored, for instance over time and/or in iterations. Identifying which of the features of the records is efficient for classification and/or evolution is a key step in any data mining process.
A typical feature engineering process for identifying efficient features consists of a training-set preparation where data is structured and normalized. Then, manual analysis of the structured and normalized data is done to identify patterns and regularities. Now, hypothesis is generated by formalization and of observed patterns and implementation of tools for their detection. The hypothesis is evaluated by measurement of the statistical significance and correlation strength for the hypothesis. Now, the features are integrated into a model which the performance thereof is evaluated on a new dataset. Errors are analyzed for investigating repeating errors causes and regularities. According to this analysis iterative hypothesis generation process may be applied.