The present invention relates to a method for analyzing a speech signal to extract emphasized portions from speech, a speech processing scheme for implanting the method, an apparatus embodying the scheme and a program for implementing the speech processing scheme.
It has been proposed to determine those portions of speech content emphasized by the speaker as being important and automatically provide a summary of the speech content. For example, Japanese Patent Application Laid-Open Gazette No. 39890/98 describes a method in which: a speech signal is analyzed to obtain speech parameters in the form of an FFT spectrum or LPC cepstrum; DP matching is carried out between speech parameter sequences of an arbitrary and another voiced portions to detect the distance between the both sequences; and when the distance is shorter than a predetermined value, the both voiced portions are decided as phonemically similar portions and are added with temporal position information to provide important portions of the speech. This method makes use of a phenomenon that words repeated in speech are of importance in many cases.
Japanese Patent Application Laid-Open Gazette No. 284793/00 discloses a method in which: speech signals in a conversation between at least two speakers, for instance, are analyzed to obtain FFT spectrums or LPC cepstrums as speech parameters; the speech parameters used to recognize phoneme elements to obtain a phonetic symbol sequence for each voiced portion; DP matching is performed between the phonetic symbol sequences of two voiced portions to detect the distance between them; closely-spaced voiced portions, that is, phonemically similar voiced portions are decided as being important portions; and a thesaurus is used to estimate a plurality of topic contents.
To determine or spot a sentence or word in speech, there is proposed a method utilizing a common phenomenon with Japanese that the frequency of a pitch pattern, composed of a tone and an accent component of the sentence or word in speech, starts low, then rises to the highest point near the end of the first half portion of utterance, then gradually lowers in the second half portion, and sharply drops to zero at the ending of the word. This method is disclosed in Itabashi et al., “A Method of Utterance Summarization Considering Prosodic Information,” Proc. I 239˜240, Acoustical Society of Japan 200 Spring Meeting.
Japanese Patent Application Laid-Open Gazette No. 80782/91 proposes utilization of a speech signal to determine or spot an important scene from video information accompanied by speech. In this case, the speech signal is analyzed to obtain such speech parameters as spectrum information of the speech signal and its sharp-rising and short-term sustaining signal level; the speech parameters are compared with preset models, for example, speech parameters of a speech signal obtained when the audience raised a cheer; and speech signal portions of speech parameters similar or approximate to the preset parameters are extracted and joined together.
The method disclosed in Japanese Patent Application Laid-Open Gazette No/39890/98 is not applicable to speech signals of an unspecified speakers and conversations between an unidentified number of speakers since the speech parameters such as the FFT spectrum and the LPC cepstrum are speaker-dependent. Further, the use of spectrum information makes it difficult to apply the method to natural spoken language or conversation; that is, this method is difficult of implementation in an environment where a plurality of speakers speak at the same time.
The method proposed in Japanese Patent Application Laid-Open Gazette No. 284793/00 recognizes an important portion as a phonetic symbol sequence. Hence, as is the case with Japanese Patent Application Laid-Open Gazette No. 39890/98, this method is difficult of application to natural spoken language and consequently implementation in the environment of simultaneous utterance by a plurality of speakers. Further, while adapted to provide a summary of a topic through utilization of phonetically similar portions of speech and a thesaurus, this method does not perform a quantitative evaluation and is based on the assumption that important words are high in the frequency of occurrence and long in duration. Hence, nonuse of linguistic information gives rise to a problem of spotting words that are irrelevant to the topic concerned.
Moreover, since natural spoken language is often improper in grammar and since utterance is speaker-specific, the aforementioned method proposed by Itabashi et al. presents a problem in determining speech blocks, as units for speech understanding, from the fundamental frequency.
The method disclosed in Japanese Patent Application Laid-Open Gazette No. 80782/91 requires presetting models for obtaining speech parameters, and the specified voiced portions are so short that when they are joined together, speech parameters become discontinuous at the joints and consequently speech is difficult to hear.