Due to widespread popularization of multimedia contents, such as voice and motion image, there is a demand for a highly precise multimedia retrieval technology. With respect to such a technology, a voice retrieval technology that identifies a portion where voices corresponding to a retrieval term (query) subjected to retrieval is uttered in a sound signal has been studied.
As for voice retrieval, a retrieval scheme with a sufficient performance has not been established yet in comparison with character string retrieval technologies based on image recognition. Hence, various technologies have been studied in order to realize a voice retrieval with a sufficient performance.
For example, Non-patent Literature 1 (Y. Zhang and J. Glass, “An inner-product lower-bound estimate for dynamic time warping”, in Proc., ICASSP, 2011, pp. 5660-5663) discloses a method of comparing sound signals with each other at a fast speed. This method enables a fast-speed identification of a portion corresponding to a query input by voice in a sound signal subjected to retrieval.
According to the technology disclosed by Non-patent Literature 1, when, however, the utterance rate of voice subjected to retrieval is different from the utterance rate of a person who has input a query, the retrieval precision decreases.
The present disclosure has been made in order to address the aforementioned technical problem, and it is an objective of the present disclosure to provide a voice retrieval apparatus, a voice retrieval method, and a non-transitory recording medium which are capable of highly precisely retrieving a retrieval term from a sound signal with a different utterance rate.