(Not Applicable)
(Not Applicable)
1. Technical Field
This invention relates to the field of speech recognition software, and more particularly, to a system for transcribing telephone answering machine voice mail messages.
2. Description of the Related Art
An ever growing number of people have begun using personal computers as a source of voice mail services. By alleviating the need and expense of a separate telephone answering machine, the use of a personal computer to record voice mail messages over a telephone line allows consumers to save money. This trend is likely to continue due to the impressive amount of computing power presently available to consumers in affordable multimedia personal computers. Moreover, the components enabling personal computers to provide voice mail services, such as sound cards and modems, have become standard equipment on most high-speed multimedia personal computers.
Beyond the savings afforded to consumers, computer voice mail systems provide consumers with increased flexibility over their telephone answering machine counterparts. The increasing speed and storage capacity of personal computers enables these machines to record longer messages and store far more messages than conventional telephone answering machines. Moreover, voice mail messages left by callers can be recorded in any of a variety of standardized multimedia or audio file types such as Wave or MP3 files. Such digital files can be manipulated, copied, stored, or transmitted.
Despite the many advantages of using a personal computer for voice mail, however, there exist disadvantages. One such disadvantage is that although the storage capacity of modern personal computers may seem limitless, audio files themselves can be quite large. Thus, storing or archiving old voice mail messages may consume far more storage capacity than what is available in any particular personal computer. In a network context, where storage capacity is often obtained at a premium, the large size of voice mail audio files can become even more problematic.
Another disadvantage inherent to computer based voice mail systems is that the large size of audio files can hinder rapid transmission of the files over networks and can cause network congestion. Such congestion often results in decreased network performance or even a network service outage. Further contributing to the problem is that compression of a Wave or MP3 file typically does not result in a substantial enough reduction of the file size. Thus, a detailed voice mail message of three to four minutes in length, saved as a Wave or MP3 file, can take much longer than the three or four minute playing time to upload and transmit the audio file via a conventional 28.8 kpbs modem connection.
Another disadvantage, inherent to all voice mail systems, is that undoubtedly an occasion will arise in which the user would find a textual transcription of the voice mail message convenient. Such is the case when a voice mail contains directions to a location. Whether the voice mail message containing the directions is left on a conventional telephone answering machine or on a computer based voice mail system, the user must transcribe the voice mail message manually to obtain an accurate transcription of the voice mail message.
Another known technology, referred to as speech recognition, is the process by which an acoustic signal received by microphone is converted to a set of text words by a computer. These recognized words may then be used in a variety of computer software applications for purposes such as document preparation, data entry, and command and control. Recently, speech recognition has been applied to recording technology. Specifically, voice recorders have been designed to record audio input which subsequently can be supplied to a speech recognition engine for conversion to text. Still, in order to convert recorded audio to text, first the speech recognition engine must be trained to recognize the speaker supplying the originally recorded audio input.
Speaker Recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech signals. Speaker Recognition can be divided into Speaker Identification and Speaker Verification. Speaker Identification determines which registered speaker provides a given utterance from amongst a set of known speakers. By comparison, Speaker Verification accepts or rejects the identity claim of a speakerxe2x80x94is the speaker the person they say they are? Speaker Recognition technology has been applied to the problem of using a speaker""s voice to control access to restricted services, for example, phone access to banking, database services, shopping or voice mail, and access to secure equipment. Both technologies require users to xe2x80x9cenrollxe2x80x9d in the system, that is, to give examples of their speech to a system so that it can characterize (or learn) their voice patterns. Speaker Recognition methods can be divided into text-dependent and text-independent methods.
Paramount to text-independent speaker identification systems is the extraction of features from a given utterance which uniquely belong to a speaker and do not change with time. Specifically, when collecting enrollment data in a speech recognition system, the features of a speaker""s speech can be extracted and associated with a known speaker and stored in a database along with a reference, for example a name or identifier associated with the known speaker. Typically, during feature extraction, a speaker-independent phoneme detector can recognize a phoneme that is distinctive from speaker to speaker. The enrollment data subsequently can be retrieved using the reference and compared with features extracted from an unknown speaker voice. If the features extracted from the unknown speaker voice favorably compare with the features of the retrieved enrollment data, the unknown speaker can be identified as the speaker who had provided the retrieved enrollment data.
Notwithstanding advances in Speaker Recognition technology, voice mail systems have yet to incorporate Speaker Recognition technology beyond access control. Moreover, although both computer based voice mail systems and speech recognition systems employing Speaker recognition technology exist, there has yet to be a union of the two technologies to better serve the user. Accurate and efficient transcription of voice mail messages based on Speaker Recognition technology would greatly enhance the usefulness of a computer based voice mail system. As a result, there has arisen a need for a system of transcribing computer voice mail messages.
The invention disclosed herein for transcribing computer voice mail messages in accordance with the inventive arrangements satisfies the long-felt need of the prior art by using a speech recognition system equipped with Speaker Recognition technology in conjunction with a computer based voice mail system. The invention can receive or import a voice mail message stored in an audio file from a computer voice mail system. After importation of the voice mail message, the system can identify the speaker of the voice mail message. Using enrollment data corresponding to the identified speaker, the system can convert the voice mail message to text, or transcribe, the audio contained in the audio file. Finally, the text can be stored in a text file. Thus, the resulting text file is much smaller in size than the imported audio file from which the text was converted. The decreased file size is especially beneficial for saving storage space and reducing the resources needed to transmit the file. Moreover, the resulting text file can be made available to the user in a variety of forms including, but not limited to displaying the text on a video display terminal, printing the text, transmitting the text file, or storing the text file for use at a later time.
The invention concerns a method and a system for transcribing a voice mail message. The method of the invention involves a plurality of steps including, first providing a computer voice mail message stored in an audio file to a computer speech recognition system and, second, submitting the computer voice mail message to a speaker identification process in the speech recognition system. Notably, the speaker identification process can identify an enrolled speaker as a source of the computer voice mail message. Finally, responsive to the identification of the enrolled speaker, the computer voice mail message can be submitted to a speech conversion process in the speech recognition system. The speech conversion process can perform speech-to-text conversion of the computer voice mail message using speaker enrollment data corresponding to the identified enrolled speaker. Furthermore, the speech-to-text conversion can produce a transcription of the computer voice mail message. In one embodiment of the present invention, the transcription further can be displayed.
The speaker identification process can identify an enrolled speaker having speaker enrollment data as a source of the voice mail message using text-independent speaker identification. Alternatively, the speaker identification process can provide to a user a list of enrolled speakers, each enrolled speaker having corresponding enrollment data. The speaker identification process can accept a selection by the user of one of the enrolled speakers in the list; and, subsequently, can identify the selected enrolled speaker as a source of the voice mail message.
The speaker identification process can create a speaker enrollment if the speaker identification process fails to identify an enrolled speaker as a source of the computer voice mail message. Furthermore, the created speaker enrollment can be associated with a non-enrolled speaker. Finally, when the created speaker enrollment has been associated with the non-enrolled speaker, the associated speaker can be identified as a source of the voice mail message. Significantly, the step of creating an enrollment can include performing an unsupervised enrollment of the associated speaker.
Notably, the invention can be a system for transcribing a voice mail message. The system can include a voice mail system for recording a voice mail message spoken by a caller; a speaker identification processor for identifying a source speaker associated with the recorded voice mail message; and, a speech recognition system for performing speech-to-text conversion of the recorded voice mail message using speaker enrollment data corresponding to the identified source speaker associated with the recorded voice mail message. Significantly, the speech-to-text conversion can produce a transcription of the voice mail message. Moreover, the system can further include display means for displaying the transcription. Additionally, the display means can be either a printer for printing the transcription or a user interface for visually displaying said transcription.
Significantly, the speaker identification processor can perform text-independent speaker identification. In addition, the system can further include an unsupervised enrollment processor for creating speaker enrollment data associated with a source of the voice mail message not identified by the speaker identification processor. The speech recognition system can perform speech-to-text conversion of a voice mail message spoken by the unknown speaker using the created speaker enrollment data.
The present invention can be realized in hardware, software, or a combination of hardware and software. Any kind of computer system or other apparatus adapted for carrying out the methods described herein is suited. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
The present invention can also be embedded in a computer program product, Is which comprises all the features enabling the implementation of the methods described herein, and which when loaded in a computer system is able to carry out these methods. Computer program means or computer program in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form.