Increasingly data is being stored electronically and there is a growing need to be able to quickly and accurately search such electronic data. For electronic data which represents textual information searching can be relatively easy. However for audio data containing speech interrogating the data to find specific spoken words is not so easy.
Speech recognition technology has been applied to the searching of audio information and various approaches have been proposed.
One approach, termed word spotting, processes the audio data after the search term has been defined to determine whether or not that particular search term occurs in the audio. Whilst this approach does allow searching for any search term it requires processing of each and every audio data file each time a search is performed. This can limit the speed of searching and is computationally very expensive in terms of processing power.
An alternative approach is to process the audio data file once and create a metadata file which can be linked to the audio data. This metadata can then be searched quickly to locate a desired search term.
The usual approach to creating the metadata is to create a transcript of the audio file using a large vocabulary speech recogniser. Whilst very fast searching is possible—the metadata file representing a textual transcript can be searched in the usual fashion—there are limitations with this approach. For instance the large vocabulary speech recogniser makes hard choices when producing the transcript which can lead to errors therein. For example, in English, the phrase “a grey day” is usually indistinguishable acoustically from “a grade A”. A speech recogniser acting on such an audio input will ultimately decide on one option, using contextual and grammatical clues as appropriate. If the wrong option is chosen the transcript will contain an error and a search on the metadata for the correct search term cannot generate a hit.
Also large vocabulary speech recognisers are inherently limited by their vocabulary database in that they can only identify sound patterns for words they have previously been programmed with. Therefore when audio data is processed the resulting metadata transcript file can only contain words which the recogniser had knowledge of at the time of processing. Thus where an audio data file contains a spoken word that the recogniser has no knowledge of (i.e. is not in the recogniser's dictionary), for instance the name of a new product or company or the name of a person, the metadata transcript will not contain that word and again a search for that term can never generate a hit. This is especially an issue for searching data archives of news organisations etc. and, although the database of words the speech recogniser has available can be updated, the audio data files processed before the update will be limited by the database at the time the metadata was created. To incorporate the new words the audio would have to be re-processed, which is a time consuming task.
A more recent approach has retained phonetic information when creating a metadata file for searching—see for example K. Ng and V. Zue, “Phonetic Recognition for Spoken Document Retrieval,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, USA, pp. 325-328, 1998. In this approach the speech recogniser does not attempt to identify words in the audio file but instead represents the phonetic content of the audio file. The metadata file then consists of a representation of the identified phones in the audio file.
This approach offers more flexibility in that, in effect, the metadata file represents the identified sounds in the speech and the speech recogniser has not made any hard decisions about what words these sounds correspond to. The concept of a word is only realised at search time, when an input search term (e.g. a text string representing one or more words) is converted into a phone sequence and a search performed on the metadata file to identify instances of that phone sequence. This approach does require more processing during searching than the large vocabulary based transcription approach but avoids problems such as the “grade A” vs. “grey day” choice. The vocabulary of such phonetic systems is therefore not limited by a dictionary of known words that is used at pre-processing time and is only limited by the database of phones which can be identified—which is generally unchanging in a given language. Searches for words recently added to the dictionary can be carried out without the need to re-process the audio. The search can identify all instances of similar sound patterns allowing the user to quickly verify whether the identified speech is of relevance.
If the speech recogniser is configured to simply output the most likely sequence of phones for a given section of speech this sequence is likely to contain many phone recognition errors. Any such errors will lower the search accuracy of the system and may require the use of additional search techniques as a means of compensating for the likely recognition errors.
There exists a known technique that addresses these problems by storing a lattice representing multiple possible phone matches in the index file, rather than just storing information regarding the most likely phone sequence a piece of spoken audio represents—D. A. James and S. J. Young, “A Fast Lattice-Based Approach to Vocabulary Independent Wordspotting”, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Adelaide, Australia, Vol. 1, pp 377-380, 1994. Other lattice based approaches are described in:    Foote J. T. et al, “Unconstrained keyword spotting using phone lattices with application to spoken document retrieval”, Computer Speech and Language, Academic Press London, vol. 11, no. 3, July 1997, pp 207-224,    Seide F. et al, “Vocabulary-independent search in spontaneous speech”, 2004 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 1, 17 May 2004, pp 253-256, and    Gelin P. et al, “Keyword spotting for multimedia document indexing”, Multimedia Storage and Archiving Systems II, Vol. 3229, 3 Nov. 1997, pp 366-377.
A lattice comprises a series of nodes each representing a point in time during the utterance. The nodes are connected by different pathways, each pathway representing different possible phones/phone sequences. The lattice file stores the N most likely phones/phone sequences between nodes, i.e. a series of pathways of the lattice. Thus the lattice file contains an indication of possible phones at different times in the speech between the start and end node and the connectivity between the possible phones. The choice of N, i.e. how many different hypotheses to store, sometimes referred to as the depth of the lattice, results in a choice between considerations of accuracy, storage and computational load. A lattice with a high depth has more possible hypotheses available, and hence the potential for improved accuracy, but has a much higher storage content, especially as a modest increase in depth can increase the number of possible pathways significantly. This also makes searching a high depth lattice much more computationally intensive.