1. Field of the Invention
This invention relates to a mass-scale, user-independent, device-independent, voice messaging system that converts unstructured voice messages into text for display on a screen. It is worthwhile initially looking at the challenges facing a mass-scale, user-independent, voice messaging system that can convert unstructured voice messages into text. First, ‘mass-scale’—means that the system should be scalable to very large numbers, for example 500,000+ subscribers (typically these are subscribers to a mobile telephone operator) and still allow effective and fast processing times—a message is generally only useful if received within 2-5 minutes of being left. This is far more demanding than most ASR implementations. Second, ‘user-independent’: this means that there is absolutely no need for a user to train the system to recognise its voice or speech patterns (unlike conventional voice dictation systems). Third, ‘device-independent’: this means that the system is not tied to receiving inputs from a particular input device; some prior art systems require input from say a touch tone telephone. Fourth, ‘unstructured’: this means that messages have no pre-defined structure, unlike response to voice prompts. Fifth, ‘voice messages’: this is a very specific and quite narrow application field that raises different challenges to those faced by many conventional automated speech recognition (ASR) systems. For example, voice mail messages for a mobile telephone frequently includes hesitations, ‘ers’ and ‘ums’. A conventional ASR approach would be to faithfully convert all utterances, even meaningless sounds. The mindset of accurate or verbose transcription characterises the approach of most workers in the ASR field. But it is in fact not appropriate at all for the voice messaging domain. In the voice messaging domain, the challenge is not accurate, verbose transcription at all, but instead capturing meaning in the most helpful manner for the intended recipient(s).
Only by successfully addressing all five of these requirements is it possible to have a successful implementation.
2. Description of the Prior Art
Conversion from speech-to-text (STT) uses automatic speech recognition (ASR) and has, up until now, been applied mainly to dictation and command tasks. The use of ASR technology to convert voicemail to text is a novel application with several characteristics that are task specific. Reference may be made to WO 2004/095821 (the contents of which are incorporated by reference) which discloses a voice mail system from Spinvox Limited that allows voicemail for a mobile telephone to be converted to SMS text and sent to the mobile telephone. Managing voicemail in text form is an attractive option. It is usually faster to read than to listen to messages and, once in text form, voicemail messages can be stored and searched as easily as email or SMS text. In one implementation, subscribers to the SpinVox service divert their voicemail to a dedicated SpinVox phone number. Callers leave voicemail messages as usual for the subscriber. SpinVox then converts the messages from voice to text, aiming to capture the full meaning as well as stylistic and idiomatic elements of the message but without necessarily converting it word-for-word. Conversion is done with a significant level of human input. The text is then sent to the subscriber either as SMS text or email. As a result, subscribers can manage voicemail as easily and quickly as text and email messages and can use client applications to integrate their voicemail—now in searchable and archivable text form—with their other messages.
The problem with transcription systems that are significantly human based however is that they can be costly and difficult to scale to the mass-market—e.g. to a user base of 500,000+ or more. Consequently, it is impractical for major mobile or cell phone operators to offer them to their subscriber base because for the required fast response times it is just too expensive to have human operators listening to and transcribing the entirety of every message; the cost per message transcribed would be prohibitively high. The fundamental technical problem therefore is to design an IT-based system that enables the human transcriptionist to operate very efficiently.
WO 2004/095821 envisaged some degree of ASR front-end processing combined with human operators: in essence it was a hybrid system; the present invention develops this and defines specific tasks that the IT system can do that greatly increase the efficiency of the entire system.
Hybrid systems are known in other contexts, but the conventional approach to voice conversion is to eliminate the human element entirely; this is the mindset of those skilled in the ASR arts, especially the STT arts. We will therefore consider now some of the technical background to STT.
The core technology of speech-to-text (STT) is classification. Classification aims to determine to which ‘class’ some given data belongs. Maximum likelihood estimation (MLE), like many statistical tools, makes use of an underlying model of the data-generating process—be it the toss of a coin or human speech production system. The parameters of the underlying model are estimated so as to maximize the probability that the model generated the data. Classification decisions are then made by comparing features obtained from the test data with model parameters obtained from training data for each class. The test data is then classified as belonging to the class with the best match. The likelihood function describes how the probability of observing the data varies with the parameters of the model. The maximum likelihood can be found from the turning points in the likelihood function if the function and its derivatives are available or can be estimated. Methods for maximum likelihood estimation include simple gradient descent as well as faster Gauss-Newton methods. However, if the likelihood function and its derivatives are not available, algorithms based on the principles of Expectation-Maximization (EM) can be employed which, starting from an initial estimate, converge to a local maximum of the likelihood function of the observed data.
In the case of STT, supervised classification is used in which the classes are defined by training data most commonly as triphone units, meaning a particular phoneme spoken in the context of the preceding and following phoneme. (Unsupervised classification, in which the classes are deduced by the classifier, can be thought of as clustering of the data.) Classification in STT is required not only to determine which triphone class each sound in the speech signal belongs to but, very importantly, what sequence of triphones is most likely. This is usually achieved by modelling speech with a hidden Markov model (HMM which represents the way in which the features of speech vary with time. The parameters of the HMM can be found using the Baum-Welch algorithm which is a form of EM.
The classification task addressed by the SpinVox system can be stated in a simplified form as: “Of all the possible strings of text that could be used to represent the message, which string is the most likely given the recorded voicemail speech signal and the properties of language used in voicemail?” It is immediately clear that this is a classification problem of enormous dimension and complexity.
Automatic speech recognition (ASR) engines have been under development for more than twenty years in research laboratories around the world. In the recent past, the driving applications for continuous speech, wide vocabulary ASR have included dictation systems and call centre automation of which “Naturally Speaking” (Nuance) and “How May I Help You” (AT&T) are important examples. It has become clear that successful deployment of voice-based systems depends as heavily on system design as it does on ASR performance and, possibly because of this factor, ASR-based systems have not yet been taken up by the majority of IT and telecommunications users.
ASR engines have three main elements. 1. Feature extraction is performed on the input speech signal about every 20 ms to extract a representation of the speech that is compact and as free as possible of artefacts including phase distortion and handset variations. Mel-frequency cepstral coefficients are often chosen and it is known that linear transformations can be performed on the coefficients prior to recognition in order to improve their capability for discrimination between the various sounds of speech. 2. ASR engines employ a set of models, often based on triphone units, representing all the various speech sounds and their preceding and following transitions. The parameters of these models are learnt by the system prior to deployment using appropriate training examples of speech. The training procedure estimates the probability of occurrence of each sound, the probability of all possible transitions and a set of grammar rules that constrain the word sequence and sentence structure of the ASR output. 3. ASR engines use a pattern classifier to determine the most probable text given the input speech signal. Hidden Markov model classifiers are often preferred since they can classify a sequence of sounds independently of the rate of speaking and have a structure well suited to speech modelling.
An ASR engine outputs the most likely text in the sense that the match between the features of the input speech and the corresponding models is optimized. In addition, however, ASR must also take into account the likelihood of occurrence of the recognizer output text in the target language. As a simple example, “see you at the cinema at eight” is a much more likely text than “see you at the cinema add eight”, although analysis of the speech waveform would more likely detect ‘add’ than ‘at’ in common English usage. The study of the statistics of occurrence of elements of language is referred to as language modelling. It is common in ASR to use both acoustic modelling, referring to analysis of the speech waveform, as well as language modelling to improve significantly the recognition performance.
The simplest language model is a unigram model which contains the frequency of occurrence of each word in the vocabulary. Such a model would be built by analysing extensive texts to estimate the likelihood of occurrence of each word. More sophisticated modelling employs n-gram models that contain the frequency of occurrence of strings of n elements in length. It is common to use n=2 (bigram) or n=3 (trigram). Such language models are substantially more computationally expensive but are able to capture language usage much more specifically than unigram models. For example, bigram word models are able to indicate a high likelihood that ‘degrees’ will be followed by ‘centigrade’ or ‘fahrenheit’ and a low likelihood that it is followed by ‘centipede’ or ‘foreigner’. Research on language modelling is underway worldwide. Issues include improvement of the intrinsic quality of the models, introduction of syntactic structural constraints into the models and the development of computationally efficient ways to adapt language models to different languages and accents.
The best wide vocabulary speaker independent continuous speech ASR systems claim recognition rates above 95%, meaning less than one word error in twenty. However, this error rate is much too high to win the user confidence necessary for large scale take up of the technology. Furthermore, ASR performance falls drastically when the speech contains noise or if the characteristics of the speech do not match well with the characteristics of the data used to train the recognizer models. Specialized or colloquial vocabulary is also not well recognized without additional training.
To build and deploy successful ASR-based voice systems clearly requires specific optimization of the technology to the application and added reliability and robustness obtained at the system level.
To date, no-one has fully explored the practical design requirements for a mass-scale, user-independent, hybrid voice messaging system that can convert unstructured voice messages into text. Key applications are for converting voicemail sent to a mobile telephone to text and email; other applications where a user wishes to speak a message instead of typing it out on a keyboard (of any format) are also possible, such as instant messaging, where a user speaks a response that it captured as part of an IM thread; speak-a-text, where a user speaks a message that he intends to be sent as a text message, whether as an originating communication, or a response to a voice message or a text or some other communication; speak-a-blog, where a user speaks the words he wishes to appear on a blog and those words are then converted to text and added to the blog. In fact, wherever there is a requirement, or potential benefit to be gained from, enabling a user to speak a message instead of having to directly input that message as text, and having that message converted to text and appear on screen, then mass-scale, user-independent, hybrid voice messaging systems of the kind described in the present specification may be used.