Today, there are a variety of systems that enable the detection of a word or phrase spoken in an audio signal. The first step in digital processing of any analog audio signal is to convert it to a sampled digital form. For signals of telephone bandwidth, typically 8000 16-bit waveform samples are taken per second, resulting in a “linear pulse code modulated” (linear PCM) representation. If the signal is to be transmitted over a telecommunications network, further analysis of the signal may be used to reduce the bit rate required while retaining as much speech intelligibility as possible. The signal is encoded into a lower bit rate form, transmitted and then decoded, with the encoding and decoding algorithms together described as a “codec”.
A phoneme is a sound unit in a language that is capable of conveying a change of meaning. For example, the English words sing and ring differ in the first phoneme. A single phoneme may represent different letters in a language. For example, in English, the k in key and the c in car are regarded as the same phoneme because the sound for both letters is the same when spoken. Different languages have different sets of phonemes.
Audio search systems typically work in two phases. If the signal has been encoded for transmission it is decoded to linear PCM representation and then processed as if it had never been encoded. An initial “indexing” phase is applied as the signal is received or as soon as convenient thereafter. The second phase is when a search of the audio is required—one or more search terms are supplied and the system uses the stored “index” data to locate occurrences of those search terms in the audio. The index data may be stored between indexing and search or may be streamed from an indexing process into a search process.
Some audio search systems take an audio signal and use Large Vocabulary Continuous Speech Recognition (LVCSR) as the indexing phase, resulting in a text representation of the audio content. The text representation is usually more than a simple text transcript—it may include time markers and alternative transcriptions for parts of the audio signal. Based on the text representation of the audio signal, at search time the system can detect a specific word or phrase spoken in the audio signal. One drawback to these types of systems is that a large amount of processing resources is necessary to process an audio signal in real-time. A second is that any errors made by the LVCSR system will limit the accuracy of all subsequent searches involving affected words.
Other systems take a different approach. The indexing phase computes “distances” representing the similarity of each short time-slice of the audio to one or more models. Each model corresponds to a phoneme or part of a phoneme. These distances are then stored in an index file. At search time, arbitrary phrases may be entered and compared to the stored distances, resulting in “search hits” for the specified phrase, where each hit comprises a phrase identity, location and match score. Although much less than required for LVCSR indexing, significant processing resource is required by this approach during the indexing phase. Further, it can produce an index file that is sometimes larger than the audio signal, thus resulting in the use of large amounts of disk space if a large quantity of audio data is analyzed and stored.
Both of the above approaches involve statistical models previously trained on large amounts of speech. Typically these are hidden Markov models (HMMs) based on phonemic transcriptions of the training speech. Each model comprises one or more “states” and each state has an associated statistical distribution over a “feature” space corresponding to a representation of possible input audio. Many variants on this theme are known—in particular:
(i) A phoneme can comprise a sequence of distinct acoustic segments. For example, a pronunciation of the phoneme for the letter t in English has up to three distinct segments that are together perceived as the sound corresponding to the letter t. By analogy with this, the hidden Markov models typically used to represent and detect phonemes in audio search systems are constructed with multiple states for each phoneme.
(ii) The models may be built using different levels of detail other than the phoneme, including word level or any “sub-word” level such as syllable, demi-syllable, phoneme, sub-phoneme etc.
(iii) A given system may include models at more than one of these levels—one key benefit of using sub-word models is that such models may be combined in order to match and search for words which are not included in the training data.
(iv) The models may take account of context, so that for example different models may be used for the vowels in the English words “bad” and “bat”—this is typical of LVCSR systems and results in a much larger total number of states in the system.
(v) There may be sharing (or “tying”) of parameters among the models in many different ways—in particular, multiple HMM states may share a given probability distribution.
(vi) Although usually described in terms of speech, similar approaches and models may be used for non-speech sound patterns, such as music.
Whatever model structure is chosen, there is a set of sound model elements, each represented by a distinct probability distribution and a key component of the indexing algorithms is the assessment of similarity—generating a numeric “score” which reflects how well each successive short time-slice (or “frame”) of incoming audio data matches each of the (possibly very many) sound model elements. That assessment is typically in the form of “distances” (where smaller distances represent better matches) or “likelihoods” (where smaller likelihoods represent worse matches). The computation of these scores requires significant processing resource, even in those LVCSR systems which employ sophisticated algorithms to restrict the computation for each time frame to some subset of the possible probability distributions.
The search need not be restricted to words or phrases. One or more instances of any sound segment (speech, non-speech or a combination) may be captured and used to build a single hidden Markov model in order to search in incoming audio data for further occurrences similar to that segment/those segments. The term “sound bite” is used in this document for such an approach. As for sub-word HMMs, the searching process requires considerable resource to compute scores reflecting the similarities between incoming sound frames and states of the target model(s). (The use of single instances to represent models is also known in the art as “template matching”. It is known that template matching is a special case of hidden Markov modeling, wherein there is a one-to-one correspondence between frames of the template and states of a single HMM which represents the whole sound segment, each state represents a unique sound model element and the corresponding probability distributions have a particularly simple form.)