Data mining is a step in database knowledge discovery, and is to find a hidden relationship from a large amount of data and extract valuable information. Generally, a database technology, statistics, online analysis processing, and a method and a technology in the field of machine learning are combined in data mining in order to process data from different perspectives.
A specific procedure of data mining includes the steps of service understanding, data understanding, data preparation, model establishment, model evaluation, and model deployment.
In a data preparation process, obtained original data needs to be preprocessed. The original data is flat-wide table data saved in a database or a data warehouse. Referring to Table 1, the original data includes a missing value (for example, the age of Li XX) and an outlier (the age and call duration of Zhang XX), and further includes a continuous value (an age column, a package fee column, a call duration column) and a discrete value (gender, region, whether off-net). Each column in the original data is referred to as one characteristic. In an actual application, different characteristics may be selected as target characteristics according to different training needs.
TABLE 1Original data tablePackageCall durationWhetherUser IDNameAgeGenderRegionfee(minute)off-net1651654Wang 28FemaleGuangzhou128150NoXX1651655Li XX—MaleShenzhen328450No1651656Zhang106MaleBeijing188−10YesXX. . .
A process of preprocessing a characteristic of the original data includes methods such as missing value filling, outlier processing, continuous value standardization, continuous value discretization, and discrete value combination operation.
In a data preparation process, preprocessing on the original data is a very important step in a data mining procedure and a data modeling procedure. By means of preprocessing, the original data may be transformed into a training data set suitable for a data modeling algorithm, and more importantly, a result of preprocessing directly affects effects of data mining and data modeling. However, in conventional data mining, data preparation is usually performed by an expert in the field of data mining. The data preparation not only has a high requirement on model-establishment personnel, but also needs manual participation in a preprocessing process. Therefore, efficiency is relatively low, a long time is consumed, and a data preprocessing procedure cannot be reused.
Currently, preprocessing is usually performed on the original data by means of grid searching. When data preprocessing is performed by means of grid searching, all preprocessing methods and parameter configuration of each method need to be set. For example, continuous value discretization includes methods such as equi-width binning, equi-depth binning, and equi-frequency binning. A parameter of the equi-width binning method may be 10, 50, 100, or the like. A series of grids are divided into according to different preprocessing methods and different parameters. Each grid corresponds to one combination of the preprocessing methods, which is referred to as one preprocessing solution. Points in each grid are calculated in sequence. A data result output after each time of calculation is used as training data, to perform model training. After the training, an effect of a model is assessed in order to generate an assessment indicator corresponding to each grid. A result corresponding to a grid that has an optimal assessment indicator and is obtained by means of screening is used as a final result.
All feasible preprocessing solutions need to be exhaustively listed when grid searching is used, and modeling is performed for a preprocessing result of each solution in order to obtain a final data preprocessing solution. Obviously, there are many methods for preprocessing the original data, and the methods may correspond to different parameter values. Therefore, there are relatively many preprocessing solutions generated by means of combination, complexity is in an exponential relationship with a quantity, and a calculation amount is large. In addition, when each preprocessing solution is assessed, a complete data modeling procedure needs to be performed. A data modeling procedure calculation time is long, and the calculation amount of repeated modeling is large. Consequently, operating load of a computer is increased, a computing resource is wasted, and work efficiency of the computer is reduced.