This invention relates to an interactive voice response system and in particular relates to speech recognition processing within an interactive voice response (IVR) system.
In today""s business environment, the telephone is used for many purposes: placing catalogue orders; checking airline schedules; querying prices; reviewing account balances; notifying customers of price or schedule changes; and recording and retrieving messages. Often, each telephone call involves a service representative talking to a caller, asking questions, entering responses into a computer, and reading information to the caller from a terminal screen. This process can be automated by substituting an interactive voice response system with speech recognition for the operator.
A business may rely on providing up-to-date inventory information to retailers across the country and an interactive voice response system can be designed to receive orders from customers and retrieve the data they request from a local or host-based database via a business application. The IVR updates the database to reflect any inventory activity resulting from calls. The IVR enables communications between a main office business application and a marketing force. A sales representative can obtain product release schedules or order product literature anytime, anywhere, simply by using the telephone. A customer can inquire about a stock item, and the IVR can determine availability, reserve the stock, and schedule delivery.
A banking application using an IVR with speech recognition includes the following sequence of steps. A prompt is played to the caller and the caller voices a response. The voice signal is acquired and speech recognition is performed on the voice signal to create text. Only once the speech recognition is finished and the text is formed is the text response analyzed and processed for a result which may be played to the caller. For instance the user may ask how much money in his savings account and the speech recognition engine processes the whole signal so that a Natural Language Understanding (NLU) module can extract the relevant meaning of xe2x80x98savings account balancexe2x80x99. This result is passed to a banking application to search and provide the answer.
One problem with the above voice activated database query type application is that two time intensive tasks are performed one after the other. It is known for a voice input to be processed to text and this input to be used as the basis for a query to be processed on a database. Each process can take up time of the order of seconds and the total time of the combined processes can be noticeable to the user. It would be desirable to reduce the total time for a voice recognition and database query. Moreover this has a related cost summary.
It is known to predict certain speech elements in advance of a full analysis. U.S. Pat. No. 5,745,873 discloses a method for recognising speech elements (e.g. phonemes) in utterances including the following steps. Based on acoustic frequency, at least two different acoustic representatives are isolated for each of the utterances. From each acoustic representative, tentative information on the speech element in the corresponding utterance is derived. A final decision on the speech element in the utterance is then generated, based on the tentative decision information from more than one of the acoustic representatives. Advantage is taken of redundant cues present in elements of speech at different acoustic frequencies to increase the likelihood of correct recognition. Speech elements are identified by making tentative decisions using frequency-based representations of an utterance, and then by combining the tentative decisions to reach a final decision. This publication discloses that a given sub-band section (in this case a frequency band) of speech contains information which may be used to predict the next sub-band section. One aspect is to get a more accurate recognition result by separately processing frequency bands.
However the above solution is still a sequential process in a voice application and the total time taken is still the combined time of the speech recognition and the later processing.
In one aspect of the invention there is provided a method for processing in an interactive voice processing system comprising: receiving a voice signal from user interaction; extracting a plurality of measurements from the voice signal; calculating an average of said measurements; locating a reference characteristic matching said average; and using text associated with the closest reference characteristic as an estimate of the text of the voice signal.
Thus the invention requires only acoustic analysis of a voice signal to determine a response. Since it does not require phonetic analysis of the signal to convert into text and then a natural language analysis to extract the useful meaning from the text considerable processing time is saved.
In a preferred embodiment the acoustic feature is a non-phonetic feature extracted from a frequency analysis of the voice signal. More than one non-phonemic feature of the voice signal may be acquired and used to determine the response. For instance, it is known that the acoustic effects of nasalisation, including increased bandwidths, will spread forward and backward through many segments.
Certain predictive qualities can be found in speech signals. In purely linguistic terms, articulatory settings shift during speech towards those of the prosodically marked element within a given breath group. There exist significant differences between average formant frequencies and related acoustic parameters in the xe2x80x9csamexe2x80x9d carrier phrase according to the segmental content of a prosodically marked (stressed) item within the breath-group. When presented with very short, and increasing portions of speech, subjects are able to predict what a complete utterance or word may be. These predictive capabilities, based either on significant differences within the signal itself, or on top-down (i.e. prior) knowledge, or both, to enhance the performance of advanced ASR and NLU/Dialogue Management-enabled services: during recognition, predications can be made as to what the most likely and significant information is in the phrase. Such predictions could be used to activate natural language understanding (NLU) and the task specific dialogue management modules which would therefore be able to return a possible result (or N-Best possibilities) to the application or service before the speaker has finished speaking. This would lead to increased response-times, but could also be used to cater for poor transmission rates in offering the most likely responses even before the complete signal has been received effectively subsetting the total number of possible responses. Further, a running check between predictions and the result of actual recognition would help provide a dynamic indicator of how effective (i.e. how accurate) such predictions were for a specific instance: allowing greater or lesser reliance on such predicted responses on a case-by-case basis.
In the preferred embodiment only that portion of the speech signal received to date is analyzed. Significant differences in that signal as compared with other signals with the same linguistic content (the xe2x80x9csamexe2x80x9d sounds and words) are used to predict the later, as yet unanalyzed, portion of the signal.
The acoustic processing is performed to acquire characteristics of the voice signal which have the relevant predictive characteristics and are more accessible and quicker to calculate. For instance, to perform full speech analysis on a phrase the whole phrase must be acquired so that a rough speech analysis of a ten second phrase using a 400 MHz processor can take additional time on top of the 10 seconds of speech. Whereas the initial acoustic characteristics of a voice signal may be obtained in the first second or seconds before the speech is even completed.
Due to the relatively low variation of the queries input to the voice system (compared with say a general Internet searching engine) the probability of selecting text on the basis of the acoustic signal can be expected to be high. This is increasingly so for a voice application with limited functionality when the expected voice inputs do not vary significantly.
In the preferred embodiment the acoustic feature is average formant information of frequency samples of the voice signal. For instance the first, second, third and fourth formants values give unique and valuable characteristics about a voice signal. An average first and second formant (formant centroid) and average excursion from the centroid are also unique and valuable characteristics.
In a preferred embodiment look ahead text is associated with a particular acoustic feature or feature set. More preferably, the text comprises the key words necessary to determine a response. This avoids the associated text being processed by natural language understanding methods to extract the key words.
One of the problems with the above approach is that the acoustic information may not be sufficiently unique to make the best match of predicted phrases. A solution therefore is to make a prediction of the phrase, querying a business application with predicted keywords in the phrase to get a predicted result while at the same time performing full speech recognition on the voice signal. The predicted text or keywords are compared with the speech recognition result and if the comparison is close enough the predicted result is used. If not close enough, the speech recognition text is processed to extract key words and the business application is queried again. With this iterative approach, the time saving of not performing speech recognition and natural language understanding is maintained for correctly predicted phrases.
In another aspect of the invention there is provided method for processing in an interactive voice processing system comprising: Receiving a voice signal from user interaction; extracting a plurality of characteristics from a first portion of the voice signal; locating a reference characteristic matching said plurality of measurements; and using text associated with the closest reference characteristic as an estimate of the text of the whole voice signal.
Advantageously the characteristics from the first portion of the voice signal include average formant information. More advantageously the characteristics from the first portion of the voice signal include a first phoneme or first group of phonemes. Furthermore formant information and the phoneme information is used together to estimate the text of the whole voice signal.
According to a further aspect of the invention there is provided an interactive voice response (IVR) system comprising: means for receiving a voice signal from user interaction; means for extracting a plurality of measurements from the voice signal; means for calculating an average of said measurements; means for locating a reference characteristic matching said average; and means for using text associated with the closest reference characteristic as an estimate of the text of the voice signal.