Speech recognition technology provides the capability to design computer systems that can recognize spoken words. Speech recognition systems accept audio speech data, which are digitized audio speech signals, and output textual information. A number of speech recognition systems are available on the market. The most powerful can recognize thousands of words. However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent. More recently speech recognition systems have been developed that can recognize speech without being trained using a particular voice and accent. Such systems may recognize the speech of most or any speakers, and are said to be speaker independent.
Audio speech data may be treated like any other data and stored and organized in a database. In the case of textual or numeric data, searches may be readily performed on the data by a database management system for the database. However, unlike textual or numeric data, there is no simple and efficient way to search audio speech data. Prior systems required developers who wished to search audio speech data had to develop complex software procedures in order to perform the searching. For example, to perform a typical search, a user will want to know which audio or video assets satisfy given text query search criteria, the time offsets within each matched media asset where matches occurred, and the user may want to know the speech recognition confidence of each match. Conventionally, this required development of software to perform several iterations of extracting the relevant text, time offset, and confidence data from the speech recognition results, build appropriate B-tree indices on this extracted data, and associate time offsets and confidence values. with their corresponding text data. In addition, procedures would have to be developed that would use the index and search through the text data for matched rows, and then search through the matched rows for time offsets into the media asset where matches occurred.
What is needed is a technique by which simple and efficient searches may be performed on audio speech data and which provides reduced development effort.