1. Field of Invention
The present invention relates to a voice recognition learning data creation method and an apparatus which creates learning data in order to learn the voice model used for unspecified speaker voice recognition.
2. Description of Related Art
As one of the voice recognition technologies used for unspecified speakers, there is the voice recognition technology which uses the Dynamic Recurrent Neural Network (DRNN) voice recognition model. Applicants have completed the submission of applications concerning voice recognition technology accomplished by DRNN, as Japanese Laid Open Patents hei 6-4079 and hei 6-119476.
In the DRNN voice model, a characteristic vector series of some words is input as time series data. Then, in order to obtain an appropriate output for the words, there is a build up between each unit in accordance with a pre-learning precedent and a bias which is respectively determined. As a result, an output is obtained in relation to spoken voice data of non-specified speakers which is close to the taught output for the words.
For example, the time series data of the characteristic vector series of the words xe2x80x9cohayoxe2x80x94good morningxe2x80x9d of some unspecified speaker is input. Then, in order to obtain an output which is close to the taught output which is ideally output for the words xe2x80x9cohayoxe2x80x94good morningxe2x80x9d, data for each respective two dimensions of the characteristic vector in each time of the words xe2x80x9cohayoxe2x80x94good morningxe2x80x9d are applied to the corresponding input unit and converted by the established buildup of the learning precedent and bias. Time series processing is then performed for the time series data of each of the characteristic vector series of some input single word. As a result, output which is close to the taught output for the word is obtained for the voice data spoken by some non-specified speaker.
With regard to the DRNN voice model prepared for all of the words which should be recognized, the learning precedent which changes the buildup to obtain an appropriate output for the respective words is recorded from pages 17-24 of the communications sounds technological report of the electronic information communications association publication xe2x80x9cTechnical Report of IEICI sp 92-125 (1993-01).xe2x80x9d
The present invention is not limited to the DRNN voice model. At the time of creating a voice model using unspecified speaker voice recognition, a database is used in which the learning data is created from the speech data for the spoken words (for example, about 200 words) of several hundred people. Ordinarily a voice model is created which accomplishes learning on the basis of the learning data included in the database.
However, there are cases in which a voice model is created for words which are not in the database and which must be obtained from the user. Prior to this invention, when creation of a voice model was accomplished for words not in the database, several hundred persons were asked to say the words and learning data for the words was created using these as a source of the spoken data. Hence, there was a need to create a voice model based on the learning data.
Whenever a voice model was created for new words, it was necessary to gather several hundred people in order to create learning data for learning the voice model. Consequently, a great amount of time was required to create the voice model, with another problem being that it was high in terms of cost.
In accordance with the present invention, a voice recognition model of words which are not included in the database can be created by creating a learning data of several hundred people using the spoken data of a selected individual or several people. It is thus an object of the present invention to provide a voice model learning data creation method and a voice model apparatus that can generate a voice model for new words in a short period of time and at low cost.
The voice model learning data creation method according to the present invention creates learning data in order to learn the speech model of voice recognition. The voice model learning data creation method creates standard speaker data for spoken data of at least one individual from among the spoken data obtained from a number of speakers which are held in a preestablished database. In addition, learning speaker data is obtained from the database. A conversion coefficient is created for converting standard spoken data into learning speaker data using the preestablished word data. In order to create the learning data for new words, data is obtained from standard speakers which speak the new words, and the data is converted to the learning speaker data space using the conversion coefficient. Thus, learning data is created for new words.
In the case when a voice model is created for new words which do not exist in the database, a voice model can be created from the learning data of the words on the basis of the speaker data of a few individual standard speakers. In order to create a speech model relative to the new words, the need for creating learning data by collecting the speaker data of several hundred individuals as with the conventional art is no longer necessary, and a voice model is created in a short time and at low cost.
In addition, data which exists in the standard spoken data space and the learning spoken data space is stored as a characteristic vector for the respective words obtained by analyzing the frequency of voice signals. The process for converting the data obtained from standard speakers who say new words is accomplished by using differential vectors for the characteristic vectors representing the respective new words in the standard speaker data space and in the learning speaker data space.
If a characteristic vector (for example, data which is manifest by an LPC (cepstrumxe2x80x94phonetic) coefficient having 10 dimensions) obtained by the frequency analysis of voice signals is used, high precision data is obtained. Furthermore, since utilization is made of preobtained differential vectors and conversion of data is made from the standard speaker data space to the learning speaker data space, data conversion is accomplished simply and with high precision.
In addition, data existing in the standard speaker data space and the learning speaker data space is code data which quantizes the characteristic vectors for each of the respective words obtained through the frequency analysis of the voice signal. In addition, the process for converting the data obtained from the speech of standard speakers for new words converts it to the learning speaker data space using the conversion coefficient. The process accomplishes the data conversion of the code data which obtains the vector quantized code data from the standard speaker data for new word data and converts the data from the standard speaker data space to the learning speaker data space by mapping in the learning speaker data space.
In other words, the invention accomplishes processing by vector quantizing the characteristic vectors obtained through the frequency analysis of voice signals. Although the data becomes slightly rough, the processing time is shortened and simplified.
In addition, the voice model learning data creation apparatus of the present invention creates learning data in order to learn the voice model used in voice recognition. The apparatus is provided with a standard speaker data storage component which stores the spoken data of at least one individual selected from the spoken data obtained from many individuals which is held in a preestablished database. A learning speaker data storage component stores spoken data of other than standard speakers as a learning speaker database. An artificial learning word data creation component has a data conversion component which, using a preobtained conversion coefficient, accomplishes data conversion from the standard speaker data space to the learning speaker data space. An effective learning data component stores the data created by the artificial learning word data creation component. At the time of creating the learning data for the new words, data which is obtained from the speech of standard speakers including the new words is converted into the learning speaker data space using the conversion coefficient. Learning data is created using the artificial learning word data for the new words.
In the case where a voice model for new words which are not included in the database is created, the creation of the learning data of those words can be made on the basis of the spoken data spoken by a small number of standard speakers. Thus, the need for creating learning data by gathering together several hundred people to obtain speaker data in order to create voice data for the new words, as was required with the conventional art, is no longer necessary. Therefore, a voice model can be created in a short period of time.
In addition, the standard speaker data which is stored in the standard speaker data storage component and the learning data which is stored in the learning data storage component are characteristic vectors for the words respectively obtained through the frequency analysis of the voice signals. In addition, the process of converting data obtained from the speech of standard speakers for the new words into learning speaker data using the conversion coefficient is accomplished using the characteristic vectors. The characteristic vectors represent the respective words in the standard speaker data space, and the differential vectors of the characteristic vectors represent the respective words in the learning speaker data space.
The data creation method uses the characteristic vectors obtained through the frequency analysis of voice signals and obtains data with high precision. By using the differential vector which has been preobtained, data conversion is accomplished from the standard speaker data space to the learning speaker data space. Thus, data conversion can be accomplished simply and with high precision.
In addition, the standard speaker data which is housed in the standard speaker data storage component and the learning speaker data which is housed in the learning speaker data storage component are the vector quantized code data of the characteristic data for each of the respective words obtained through the frequency analysis of the voice signals. The process for converting the data obtained from the speech of standard speakers for new words into learning speaker data space using the conversion coefficient uses the vector quanticized code data in the standard speaker data space for the new word data to accomplish the data conversion from the standard speaker data space to the learning speaker data space by mapping the code data in the learning speaker data space.
In other words, the invention accomplishes the vector quantization of characteristic vectors obtained through the frequency analysis of voice signals. Although the data becomes slightly rough, the more simplified processing shortens the processing time, and the memory which houses the standard speaker data and the learning speaker data can be made with a small capacity.