As voice recognition is expanded to an online service using a mobile network of a cellular phone, many researches on a configuration of a voice recognition server are also being conducted. In particular, Google has proposed a clustered acoustic model method while introducing a mobile voice recognition search service. A model classification criterion of Google lies in the assumption that weights of a plurality of Gaussians in a model state occupied in an actual environment vary. Accordingly, the proposed method is a method of initially converting collected voice data to a model string through recognition, calculating a distance value between a Gaussian weight of a model state and a Gaussian weight of a model state present in classified centroid as KL-divergence, and thereby classifying the distance value as close centroid. A clustered acoustic model is generated by repeating the above process using a vector quantization (VQ) method. The above method uses an aspect that a Gaussian weight varies based on an acoustic condition, and needs to configure a 2-pass system of initially performing recognition and then determining a model state string in order to use a clustered acoustic model during a recognition process, and does not relatively express well a change of a speaker. In the case of the above method, according to an increase in the number of clustered models generated by classifying data, an amount of data used for modeling decreases.
Microsoft (MS) has also proposed a method of classifying an acoustic model. MS has expressed an eigenVoice vector and an eigenChannel vector as an i-vector by integrating the eigenVoice vector and the eigenChannel vector, which are technologies used for speaker adaptation and speaker recognition, using one equation, and has also proposed that it is possible to express classification with respect to a speaker and a channel by mixing the classification using a single matrix. Acoustic characteristics were hierarchically classified using a difference between i-vectors that are generated to be different for each speech. Here, it is unsuitable to inclusively process a speaker factor, an environmental factor, a channel characteristic, and the like, using one equation. It is unclear to find out whether the clustered acoustic model effect has appeared by simply obtaining an acoustic difference, or has appeared to be robust against a speaker, environmental noise, and the like. The above method also needs to obtain an i-vector value in order to select a classification model and thus, needs to configure a 2-pass recognizing system.