The present invention relates to searching audio signals for query strings. In particular, the present invention relates to performing such searches without prior knowledge of the vocabulary of the query string.
With increases in the storage capacity of computing devices, it has become easier to store large amounts of recorded speech in a digital format. To help users find a particular segment of recorded speech, systems have developed that allow a user to search the recorded speech for particular keywords. These systems typically perform speech recognition on the recorded speech to identify words represented by the speech. Text strings representing the search query are then compared to the recognized words to identify the portion of the audio signal that contains the query terms.
One challenge to these audio search systems is that speech recognition is imperfect. Because of this, if a system uses a single speech recognition output, it will have poor recall when the recognizer makes an error. For example, if the recognizer identifies the word “ball” but the speech signal actually contained the word “doll”, the audio search system will not return a match for the query term “doll” even though it is present in the speech signal.
To avoid this problem, many systems have utilized a lattice of possible speech recognition results instead of a single speech recognition result. Although this lattice approach improves recall, it greatly increases the amount of time needed to search for a query term. In addition, existing text-level indexing methods can not be trivially applied.
To speed up the search, it has been proposed that indexes should be generated from the lattice before the search query is received. Such indexes identify the location of particular sets of keywords in the audio signal. When the query is received, the index is consulted to find the location of the keywords of the query.
Such indexes must be very large in order to cover all possible query terms that may appear. In addition, it has been found that such indexes typically lack query terms that are the most useful in differentiating one portion of the audio signal from another. In particular, terms that are infrequently used, and thus are less likely to be included in the index, are more likely to differentiate two audio segments.
To overcome this problem, the prior art has suggested using indexes of sequences of sub-word units, such as phonemes, instead of full keywords. For example, the sequences can be formed of 4-grams of phonemes. Because these sequences are typically shorter than keywords, there are fewer possible sequences that need to be included in the index.
In one prior art system (discussed in C. Allauzen et al., General Indexation of Weighted Automata-Application to Spoken Utterance Retrieval, Proc. HLT′04), each sequence of sub-word tokens is placed in the index with an expected term frequency of the sequence. This expected term frequency, also known as the expected count, is an indication of the number of times that the sequence of sub-word tokens appears in a lattice associated with a segment of the audio signal. When a query is received, it is divided into sub-word tokens and sequences of sub-word tokens are identified in the query. The expected term frequency of the query is then determined by using the expected term frequency of the sequence of sub-word tokens that has the lowest expected term frequency in the index. Such indexing is done for a plurality of segments of the speech signal and the segment with the highest expected term frequency is identified as containing the query term.
One problem with this prior art technique is that approximating the expected term frequency of the query by the minimum expected term frequency of all of its sub-word token sequences causes the worst matching sub-word token sequence to dominate the estimate. In addition, the sequence relationship between the individual sub-word token sequences is not exploited under the prior art.