1. Field of the Invention
The present invention relates to a speech processing apparatus and method that perform sound source recognition, speaker recognition, or speech recognition from input speech.
2. Description of the Related Art
As one of individual authentication technologies, a speaker recognition technology that recognizes a speaker from a characteristic amount of input speech is known. In “Furui et al., “Speech information Processing”, Morikita Publishing Co., Ltd., 1998”, the speaker recognition technology is classified into three types, i.e., a text-dependent type, a text-independent type, and text-specified type.
In a text-dependent type speaker recognition system, a speaker is recognized based on a comparison between a characteristic amount of speech by the speaker (a user) as a recognition target with respect to a specific text and characteristic amounts of speech by many speakers with respect to the same text prepared by the system in advance.
In a text-independent type speaker recognition system, a text spoken by a user is free. That is, the system recognizes a speaker by collating a characteristic amount obtained by normalizing speech from a user with characteristic amounts of previously recorded speech from a plurality of speakers. Therefore, it is known that accurate recognition is difficult as compared with the text-dependent type speaker recognition.
In a text-specified type speaker recognition system, a text that requests a user to speak is specified from the system side. The user actually speaks the specified text, and the system recognizes the speaker based on a comparison between a characteristic amount of this speech from the user and previously recorded characteristic amounts.
In the text-dependent type and text-independent type systems, when, e.g., a loudspeaker is used to reproduce recorded speech of the other person, a deception may be possibly carried out based on “impersonation” that taking on the position of an identical person is tried. On the other hand, in the text-specified speaker recognition system, since a text is specified at the time of authentication, it is considered that robustness with respect to “impersonation” is high as compared with the text-dependent type and text-independent type speaker recognition systems. However, in view of advancement of a digital signal processing technology in recent years, it is necessary to assume a situation where a specified text is produced on site by using speech synthesis technology to synthesize recorded speech of other persons. Further, the text-dependent type and text-independent type speaker recognition systems have a problem that usability is poor since a user must not misread a text.
Furthermore, although a technique of identifying a sound source by comparing spectral shapes of input speech or aged changes of the spectral shapes is also known, this technique can identify apparently different sound sources, e.g., a dog and a person but is hard to recognize actual speech and recorded speech.
Moreover, in not only speaker recognition but also speech recognition, an environmental sound (a sound output from a loudspeaker of, e.g., a television or a radio) around a user may be mixed in input speech, thereby possibly resulting in induction of erroneous recognition.