The present technology relates to a voice processing apparatus, a voice processing method, and a program, and more particularly to a voice processing apparatus, a voice processing method, and a program which are capable of easily performing a voice quality conversion desired by a user, for example.
In recent years, a study has been made on a lifelog for continuing to record an individual's life for a long time using a wearable camera and a microphone.
In addition to a voice of a user wearing an apparatus, a voice of other person is sometimes mixed in the microphone. In this case, in addition to the voice of the user, the voice of the other person is also recorded in the lifelog.
On the assumption that the user publically opens the lifelog in the practical use of the lifelog, it is not suitable, from a viewpoint of privacy protection, that the voice of the other person recorded in the lifelog is publically opened as it is without processing.
As a method of protecting other person's privacy, there is a method of erasing the other person's voice from the lifelog.
However, for example, when a conversation between the user and the other person has been recorded in the lifelog, erasing only the other person's voice makes the conversation unnatural (or the conversation is not established), which sometimes makes a significance of the lifelog ineffective.
Therefore, as the method of privacy protection, there have been increased demands for a personality erasing method of erasing only the other person's personality while processing the voice and retaining context information of the conversation. An example of the personality erasing method of a voice includes a voice quality conversion for converting a voice quality of the voice.
For example, Japanese Patent Application Laid-Open No. 2008-058696 describes a technology that without having conversion coefficients for a voice quality conversion corresponding to the number of pairs of a reference speaker whose voice quality is to be converted and a target speaker whose voice quality is targeted to a voice quality conversion in conversion of a voice quality, a voice of at least one of the one or plurality of reference speakers and target speakers is used to conduct learning for generating a voice quality conversion model, and a predetermined adapting method is used to adapt the voice quality conversion model to the voice of at least one of an arbitrary reference speaker and an arbitrary target speaker, and the voice of the arbitrary or specified reference speaker is converted into the voice of the voice quality of the specified or arbitrary target speaker.