The present invention relates to speech recognition systems and, more particularly, to methods and apparatus for detecting non-target languages in a monolingual speech recognition system.
Speech recognition and audio indexing systems are generally developed for a specific target language. The lexica, grammar and acoustic models of such monolingual systems reflect the typical properties of the target language. In practice, however, these monolingual systems may be exposed to other non-target languages, leading to poor performance, including improper transcription or indexing, potential misinterpretations or false system reaction.
For example, many organizations, such as broadcast news organizations and information retrieval services, must process large amounts of audio information, for storage and retrieval purposes. Frequently, the audio information must be classified by subject or speaker name, or both. In order to classify audio information by subject, a speech recognition system initially transcribes the audio information into text for automated classification or indexing. Thereafter, the index can be used to perform query-document matching to return relevant documents to the user.
If the source audio information includes non-target language references, however, the speech recognition system may improperly transcribe the non-target language references, potentially leading to improper classification or indexing of the source information. A need therefore exists for a method and apparatus for detecting non-target language references in an audio transcription or speech recognition system.
With the trend in globalizing communication technologies and providing services to a wide, multilingual public, the ability to distinguish between languages has become increasingly important. The language-rejection problem is closely related to this ability and thus to the problem of automatic language identification (ALI). For a detailed discussion of automatic language identification techniques, see, for example, Y. K. Muthusamy et al., xe2x80x9cReviewing Automatic Language Identification,xe2x80x9d IEEE Signal Processing Magazine, 11(4):33-41 (October 1994); J. Navrxc3xa1til and W. Zxc3xchlke, xe2x80x9cPhonetic-Context Mapping in Language Identification,xe2x80x9d Proc. of the EUROSPEECH-97, Vol. 1, 71-74 (1997); and J. Navrxc3xa1til and W. Zxc3xchilke, xe2x80x9cAn Efficient Phonotactic-Acoustic System for Language Identification,xe2x80x9d Proc. of the Int""l Conf. on Acoustics, Speech and Signal Processing (ICASSP), Vol. 2, 781-84, Seattle, Wash., IEEE (May, 1998), each incorporated by reference herein.
A number of automatic language identification techniques have been proposed or suggested for distinguishing languages based on various features contained in the speech signal. Several sources of language-discriminative information have been identified as relevant for the task of language identification including, for example, the prosody, the acoustics, and the grammatical and lexical structure. Automatic language identification techniques based on the prosody or acoustics of speech attempt to identify a given language based on typical melodic and pronunciation patterns, respectively.
Due to the complexity of automatic language identification techniques based on the grammatical and lexical structure, however, most proposals have advanced techniques based on acoustic-prosodic information or derived lexical features in order to represent the phonetic structure in a less complex manner. ALI techniques have been developed that model statistical dependencies inherent in phonetic chains, referred to as the phonotactics. In the statistical sense, phonotactics can be viewed as a subset of grammatical and lexical rules of a language. Since these rules differ among languages, the ability to discriminate among languages is naturally reflected in the phonotactic properties.
Generally, methods and apparatus are disclosed for detecting non-target language references in an audio transcription or speech recognition system using confidence scores. The confidence score may be based on (i) a probabilistic engine score provided by a speech recognition system, (ii) additional scores based on background models, or (iii) a combination of the foregoing. The engine score provided by the speech recognition system for a given input speech utterance reflects the degree of acoustic and linguistic match of the utterance with the trained target language. In one illustrative implementation, the probabilistic engine score provided by the speech recognition system is combined with the background model scores to normalize the engine score as well as to account for the potential presence of a non-target language. The normalization narrows the variability range of the scores across speakers and channels.
The present invention identifies a non-target language utterance within an audio stream when the confidence score falls below a predefined criteria. According to one aspect of the invention, a language rejection mechanism interrupts or modifies the transcription process when speech in the non-target language is detected. In this manner, the present invention prevents improper transcription and indexing and false interpretations of the speech recognition output.
In the presence of non-target language utterances, the transcription system is not able to find a good match based on its native vocabulary, language models and acoustic models. The resulting recognized text will have associated lower engine score values. Thus, the engine score alone may be used to identify a non-target language when the engine score is below a predefined threshold.
The background models are created or trained based on speech data in several languages, which may or may not include the target language itself. A number of types of background language models may be employed for each modeled language, including one or more of (i) prosodic models; (ii) acoustic models; (iii) phonotactic models; and (iv) keyword spotting models.