The message domain of many word-spotting applications, such as personal memo and dictation retrieval, tends to be very user-specific and liable to change over time. An unrestricted keyword vocabulary is therefore important to allow the user to search for any term in the audio database. However, if an unrestricted keyword set is used, the location of keyword hits in the speech data cannot be determined in advance of a keyword retrieval request. Since the user expects to receive the results of a keyword search in a reasonably short time, the retrieval process must operate much faster than the actual length of the speech. For example, to achieve a response of three seconds for one minute of speech data, the processing needs to be twenty times faster than real-time.
It is well-known in speech processing to use Hidden Markov Models to model acoustic data. A textbook on the topic is "Readings in Speech Recognition" by A. Waibel and K. F. Lee; Palo Alto: Morgan Kaufmann.
There are known fast implementation approaches, such as lattice-based word-spotting systems of the type described in the paper by James, D. A. and Young, S. J. entitled "A fast lattice-based approach to vocabulary independent wordspotting", Proc ICASSP' 94, Adelaide, 1994, but these require a large amount of memory for lattice storage.
Less memory intensive word-spotting techniques are required for implementation in low-cost, portable devices where memory space is restricted.
A known alternative approach is to search the acoustic data directly, rather than using a lattice model. A `filler model` and a `keyword model` are used together to identify the locations of putative keywords in the acoustic data. This known approach is described in more detail with reference to FIG. 1.
The present invention aims to provide a method for finding a keyword in acoustic data which is faster than known methods as well as being memory-efficient.
The term `phone` is used in this specification to denote a small unit of speech. Often, a phone will be a phoneme but may not always comply with the strict definition of phoneme used in the field of speech recognition.