An audio indexing system normally includes a speech recognition subsystem, converting the audio information into a textual form, and an indexing subsystem, which extracts the features to be used for searching and browsing. Thus, in conventional automatic audio indexing arrangements, an example of which (100) is schematically illustrated in FIG. 1, the input audio speech signal (102) is processed by a speech recognizer (104) to convert it into a raw textual form that may proceed to some type of feature extraction arrangement (105) that, for example, resolves the raw text output from the speech recognizer into “morphs” or “stems” (see infra). The resulting text (106) is then stored in an audio indexing database (108), in which it can be accessed by an audio indexing subsystem (110), providing retrieval, summarization and other indexing functions. The feature extracting function could also be performed within the speech recognizer (104) itself, thus obviating the need for a separate arrangement (105) for performing that function or, alternatively, feature extraction may be lacking altogether, in which case the words that are recognized by the speech recognizer (104) are transferred in their original form (without “morphing”, “stemming” or the like) directly to the audio indexing database (108). A general discussion of audio indexing systems is provided in J. S. Garofolo, E. M. Voorhes, V. M. Stanford, “TREC-6 1997 Spoken Document Retrieval Track Overview and Results”, in E. M. Voorhes, D. K. Harman, editors, The Proceedings of the Sixth Text Retrieval Conference, NIST Special Publication 500-240, pp 98-91, and M. Vishwanathan, H. S. M. Beigi, S. Dharanipragada, A. Tritchler, “Retrieval from Spoken Documents Using Context and Speaker Information”, in The Proceedings of International Conference on Document Analysis and Retrieval, (ICDAR 99), Bangalore, India, 1999, pp 567-572.
One of the weak points of the approach described above is that the text data, created by the speech recognizer and/or feature extraction arrangement, typically contains numerous errors (e.g., word insertions, deletions and substitutions), caused by the noisy character of the incoming audio and inherent imperfections of the speech recognition and/or feature recognition systems. Such errors are necessarily reflected in the resulting audio indexing database and cause problems when the database is searched.
A need, therefore, has been recognized in conjunction with providing an audio indexing system that mitigates the above-described errors.
The papers “A Fertility Channel Model for Post-correction of Continuous Speech Recognition” (James Allen et al., International Conference for Speech and Language Processing, 1996) and “Error Correction via a Post-processor for Continuous Speech Recognition” (Erick K. Ringger and James F. Allen, Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing, 1996) describe the use of a statistically trained translation model in an air travel (ATIS) dialog/natural language understanding system. The statistical translation model was trained to translate from speech recognition output to hand-corrected (clean) text.
A primary objective of Allen et al., in particular, was to reduce the word-error-rate, and hence improve the readability of the speech recognizer output. However, no provision was made for directly improving the audio indexing effectiveness, as measured through standard information retrieval metrics such as average precision. Also, the system disclosed in Allen et al. was forced to operate in real-time, which restricted its capabilities.
Thus, a need has also been recognized in conjunction with providing an audio indexing system that improves upon the shortcomings of previous efforts in the field, including those discussed above.