1. Field of the Invention
The present invention relates to an indexing apparatus, an indexing method, and a computer program product that allocates an index to a speech signal.
2. Description of the Related Art
Speaker indexing (hereinafter, “indexing”) has been used to assist viewing of and listening to multiple speakers at conferences, TV or radio programs, panel discussions, etc. Indexing is a technology that allocates indexes to relevant portions of a speech signal representative of an utterance of a speaker. The index includes speech information, such as who made the utterance, when and how long the utterance was made. Such indexing is helpful in various ways. For example, it facilitates searching an utterance of a particular speaker, and detecting a time period during which the particular speaker made active discussion.
When performing the indexing, a speech signal is subdivided into numerous smaller strings, strings having the same or similar characteristic feature are grouped into a longer segment, and a segment is considered as an utterance of one speaker. JP-A 2006-84875 (KOKAI), for example, discloses a technique for calculating the characteristic feature. Concretely, JP-A 2006-84875 (KOKAI) teaches creating an acoustic model representative of speech features from each of the segments that are created by subdividing a speech signal. Subsequently, for each acoustic model, a likelihood is acquired for detecting a similarity of each subdivided speech signal. Then, a vector including the likelihood as a component is used as an index that indicates a speech feature of the speech signal. Accordingly, utterances of the same speaker have a high likelihood with respect to a specific acoustic model, so that similar vectors are obtained from such utterances. In other words, if the vectors are similar, it means that those vectors have originated from the same speaker.
However, in the technology described in JP-A 2006-84875 (KOKAI) there is a problem that when the speech signals used to create acoustic models include utterances of multiple speakers, the utterances of different speakers erroneously sometimes indicate a high likelihood with respect to a common acoustic model. In this case, a feature is provided (vector is created) improperly to distinguish utterances of different speakers, with the result that indexing accuracy is degraded.