The present invention relates in general to telecommunication systems, and more particularly, to cross-speaker speech recognition for telecommunication applications.
With increasingly sophisticated telecommunication systems, speech recognition technology is increasingly important. For example, speech recognition technology is useful for various automated intelligent network functions, such as for a voice controlled intelligent personal agent that handles a wide variety of call and message functions. The voice controlled intelligent personal agent designed and implemented by Lucent Technologies, for example, includes natural language, voice controlled services such as automatic name (voice) dialing, name (voice) message retrieval and playback, voice messaging, and call screening.
Many current implementations of speech recognition technology are limited to same-speaker recognition. For example, current state-of-the-art voice name dialing requires a subscriber to xe2x80x9ctrainxe2x80x9d a set of names, repeatedly speaking the set of names to form a name list. Subsequently, constrained by this set, the speech recognizer will recognize another spoken sample as one of these names in the set, and dial a corresponding associated directory number. Such current systems do not provide for voice name dialing from untrained names or lists. In addition, such current systems do not provide for cross-speaker recognition, in which a name spoken by a subscriber may be recognized as the same name spoken by an incoming caller of the subscriber.
Many current types of speaker-trained speech recognition technologies are also whole word based or template-based, rather than sub-word (phoneme) based. Such whole word or template-based speech recognition technologies attempt to match one acoustic signal to another acoustic signal, generating a distinct and separate statistical model for every word in the recognizer vocabulary set. Such speech recognition technology is highly user specific, and generally does not provide for recognition of the speech of a different speaker. In addition, such template based speech recognition is impractical, expensive and difficult to implement, especially in telecommunication systems.
As a consequence, a need remains for an apparatus, method and system for speech recognition that is capable of recognizing the speech of more than one user, namely, having capability for cross-speaker speech recognition. In addition, such cross-speaker recognition should be sub-word or phoneme-based, rather than whole word or template-based. Such cross-speaker speech recognition should also have high discrimination capability, high noise immunity, and should be user friendly. Preferably, such cross-speaker speech recognition should also utilize a xe2x80x9chidden Markov modelxe2x80x9d for greater accuracy. In addition, such cross-speaker speech recognition technology should be capable of cost-effective implementation in advanced telecommunication applications and services, such as automatic name (voice) dialing, message management, call return management, and incoming call screening.
The apparatus, method and system of the present invention provide sub-word, phoneme-based, cross-speaker speech recognition, and are especially suited for telecommunication applications such as automatic name dialing, automatic message creation and management, incoming call screening, call return management, and message playback management and name list generation.
The various embodiments of the present invention provide for such cross-speaker speech recognition utilizing a methodology that provides both high discrimination and high noise immunity, utilizing a matching or collision of two different speech models. A phoneme or subword-based pattern matching process is implemented, utilizing a xe2x80x9chidden Markov modelxe2x80x9d (xe2x80x9cHMMxe2x80x9d). First, a phoneme-based pattern or transcription of incoming speech, such as a spoken name, is created utilizing a HMM-based recognizer with speaker-independent phoneme models and an unconstrained grammar, in which any phoneme may follow any other phoneme. In addition, utilizing a HMM-based recognizer with a constrained grammar, the incoming speech is utilized to select or xe2x80x9crecognizexe2x80x9d a closest match of the incoming speech to an already existing phoneme pattern representing a name or word, if any, i.e., recognition is constrained by existing patterns, such as phoneme patterns representing names. The methodology of the invention then determines likelihood of fit parameters, namely, a likelihood of fit of the incoming speech to the unconstrained, speaker-independent model, and a likelihood of fit of the incoming speech to the selected or recognized existing pattern. Based upon a comparison of these likelihood of fit parameters, the various embodiments of the present invention determine whether the incoming speech matches or, as used equivalently herein, collides with a particular name or word. Such matches or xe2x80x9ccollisionsxe2x80x9d are then utilized for various telecommunication applications, such as automatic voice (name) dialing, call return management, message management, and incoming call screening.
A method for cross-speaker speech recognition for telecommunication systems, in accordance with the present invention, includes receiving incoming speech, such as a caller name, generating a phonetic transcription of the incoming speech with a HMM-based, speaker-independent model having an unconstrained phoneme grammar, and determining a transcription parameter as a likelihood of fit of the incoming speech to the speaker-independent, unconstrained grammatical model. The method also selects a first existing phoneme pattern, if any, from a plurality of existing phoneme patterns, as having a highest likelihood of fit to the incoming speech, and also determines a recognition parameter as a likelihood of fit of the incoming speech to the first existing phoneme pattern. The method then determines whether the input speech matches the first existing phoneme pattern based upon a correspondence of the transcription parameter with the recognition parameter in accordance with a predetermined criterion, such as whether a ratio of the two parameters is above or below a predetermined, empirical threshold.
In the various embodiments, the plurality of existing phoneme patterns are generated from a plurality of speakers, such as from subscribers and incoming callers. The incoming speech may also be from any speaker of a plurality of speakers. The plurality of phoneme patterns, in the preferred embodiment, form lists for use by a subscriber, such as a name list, a message list, or both. Any given name may be associated with a variety of phoneme patterns or samples, generated by different speakers, such as by the subscriber and by various incoming callers.
Cross-speaker recognition is provided when a name, as a phoneme pattern spoken by one individual, is matched (or collides with) a phoneme pattern spoken by another individual. For example, a name as spoken by an incoming caller (a person who calls a subscriber) may be recognized as the same name as spoken by the subscriber for automatic call returning.
In the preferred embodiment, the matching or collision determination is performed by comparing the transcription parameter to the recognition parameter to form a confidence ratio. When the confidence ratio is less than a predetermined threshold, the method determines that the input speech matches the first existing phoneme pattern; and when the confidence ratio is not less than the predetermined threshold, the method determines that the input speech does not match the first existing phoneme pattern.
The embodiments are also utilized to generate various lists, such as a name list for automatic name dialing. Generating the name list includes receiving as incoming speech a first sample of a name and, performing collision or matching determination on the first sample. When the first sample does not match the first existing phoneme pattern, a transcription of the first sample is (initially) included within the plurality of existing phoneme patterns. In the preferred embodiment, for increased reliability, this is followed by receiving as incoming speech a second sample of the name in the preferred embodiment, and again performing collision or matching determination on the second sample. The embodiments determine whether the second sample matches the first sample and, when the second sample does match the first sample, the various embodiments include the name in the name list, and include corresponding transcriptions of both the first sample and the second sample in the plurality of existing phoneme patterns.
The various embodiments of the present invention also include generating a caller name and message list, to track names and messages left by incoming callers. Generating the message list includes receiving as incoming speech a caller name and performing collision or matching determination on the caller name. When the caller name does not match a first existing phoneme pattern, the various embodiments include the caller name in the message list and indicate that one call has been received from the caller name. When the caller name does match the first existing phoneme pattern, the various embodiments increment a count of calls received from the caller name.
The various embodiments of the present invention also perform message playback, typically utilizing cross-speaker speech recognition. For example, a calling party may leave their name, providing a phoneme pattern (placed in a message list) for cross-speaker recognition from a phoneme pattern subsequently spoken by the subscriber. Performing message playback includes receiving incoming speech; selecting the first existing phoneme pattern, from a subset of the a plurality of existing phoneme patterns corresponding to the message list, as the highest likelihood of fit to the incoming speech; and playing a first message associated with the first existing phoneme pattern. When there are a plurality of messages are associated with the first existing phoneme pattern, the various embodiments also sequentially play the plurality of messages.
The various embodiments also include performing call return, also utilizing cross-speaker recognition. Also for example, a calling party may leave their name, providing a phoneme pattern (placed in a message list) for cross-speaker recognition from a phoneme pattern subsequently spoken by the subscriber to return the call. Performing call return includes receiving incoming speech; selecting the first existing phoneme pattern, from a subset of the plurality of existing phoneme patterns corresponding to a name list and a message list, as the highest likelihood of fit to the incoming speech; and transmitting a telecommunication number associated with the first existing phoneme pattern.
The various embodiments also perform incoming call screening. The subscriber selects a plurality of names to be on a call screening list, which have a corresponding plurality of existing phoneme patterns. Performing incoming call screening then includes receiving an incoming call leg; and receiving as incoming speech a caller name and performing collision or match determination on the caller name. When the caller name does not match the first existing phoneme pattern, the various embodiments transfer the incoming call leg to a message system, and when the caller name does match the first existing phoneme pattern, the various embodiments transfer the incoming call leg to the subscriber.
Numerous other advantages and features of the present invention will become readily apparent from the following detailed description of the invention and the embodiments thereof, from the claims and from the accompanying drawings.