Automatic speech recognition (ASR) has evolved to the point where audio searching over audio data has become commonplace. However, to date this has been limited to but a handful of more commonly used languages (e.g., English, German, Hindi). It is very often the case that there are not enough resources for training an ASR system for a given language that may not be so widely used.
More particularly, audio search has proven to be a very useful interface, particularly in situations where one wants to find relevant audio segments out of a huge volume of content. Many efforts have been made in past to solve this problem. One approach is the employment of large vocabulary continuous speech recognition (LVCSR), which generates text transcripts of the audio data and then applies standard textual information retrieval approaches to generate search results. In the context of speech, it has been found useful to use word lattices instead of 1-best text transcripts. Such an approach is discussed in “Lattice-based search for spoken utterance retrieval,” HLT-NAACL 2004, and “Rapid and accurate spoken term detection,” INTERSPEECH 2007. The limitations of such an approach becomes evident when considering the limitation that a fixed vocabulary imposes on the searchable terms, even within the same language, while such an approach cannot even be extended to languages other than a target language.
Accordingly, phonetic representation of the audio data has been found to be desirable in these situations, where it is anticipated that the search queries may involve words which were never seen during the training process. Such approaches has been explored as well, e.g., as discussed in “Query by example spoken term detection for OOV terms,” ASRU 2009. However, since most of the needed components such as a dictionary, acoustic models and language models are language-dependent, it is not clear how the approach can be applied in situations where resources for a language are not available.
Other approaches have been investigated which do not depend on speech recognition process at all. These approaches rely on an intermediate representation such as posteriogram of the speech signal and some form of dynamic time warping to capture relevant patterns. While these approaches are suitable for language-independent search, the memory and computational requirements of these approaches are prohibiting.