1. Field of the Invention
The present invention relates to voice processing apparatus and the like, and more particularly to voice processing systems that use speech recognition.
2. Description of the Related Art
Voice processing systems whereby callers interact over a telephone network (e.g. PSTN or Internet) with computerised equipment are very well-known in the art, and include voice mail systems, voice response units, and so on. Typically such systems ask a caller questions using prompts formed from one or more prerecorded audio segments, and the caller inputs answers by pressing dual tone multiple frequency (DTMF) keys on their telephones. This approach has proved effective for simple interactions, but is clearly restricted in scope due to the limited number of available keys on a telephone. For example, alphabetical input is particularly difficult using DTMF keys.
There has therefore been an increasing tendency in recent years for voice processing systems to use speech recognition in order to augment DTMF input (N.B. the terms speech recognition and voice recognition are used interchangeably herein to denote the act of converting a spoken audio signal into text). The utilisation of speech recognition permits the handling of callers who do not have a DTMF phone, and also the acquisition of more complex information beyond simple numerals from the caller.
As an illustration of the above, WO96/25733 describes a voice response system which includes a prompt unit, a Voice Activity Detector (VAD), and a voice recognition unit. In this system, as a prompt is played to the caller, any input from the caller is passed to the VAD, together with the output from the prompt unit. This allows the VAD to perform echo cancellation on the incoming signal. Then, in response to the detection of voice by the VAD, the prompt is discontinued, and the caller input is switched to the recognition unit, thereby providing a barge-in facility.
Speech recognition in a telephony environment can be supported by a variety of hardware architectures. Many voice processing systems include a special DSP card for running speech recognition software. This card is connected to a line interface unit for the transfer of telephony data by a time division multiplex (TDM) bus. Most commercial voice processing systems, more particularly their line interface units and DSP cards, conform to one of two standard architectures: either the Signal Computing System Architecture (SCSA), or the Multi-vendor Integration Protocol (MVIP). A somewhat different configuration is described in GB 2280820, in which a voice processing system is connected via a local area network to a remote server, which provides a speech recognition facility. This approach is somewhat more complex than the TDM approach, given the data communication and management required, but does offer significantly increased flexibility.
Speech recognition systems are generally used in telephony environments as cost-effective substitutes for human agents, and are adequate for performing simple, routine tasks. It is important that such tasks be performed accurately, otherwise there may be significant caller dissatisfaction, and also as quickly as possible, both to improve caller throughput, and because the owner of the voice processing system is often paying for the call via some freephone mechanism (eg an 0800 number), or because an outbound application is involved.
(Note that as used herein, the term xe2x80x9ccallerxe2x80x9d simply indicates the party at the opposite end of a telephone connection to the voice processing system, rather than to specify which party actually initiated the telephone connection).
There has been an increase in recent years in the complexity of input permitted from the caller. This is supported firstly by the use of large vocabulary recognition systems, and secondly by supporting natural language understanding and dialogue management. As a simple example of this, a pizza ordering application several years ago might have gone through a menu to determine the desired pizza size, topping etc., with one prompt to elicit each property of the pizza from a caller. Now however, such an application may simply ask: xe2x80x9cWhat type of pizza would you like?xe2x80x9d. The caller response is passed to a large vocabulary speech recognition unit, with the recognised text then being processed in order to extract the relevant information describing the pizza.
The extraction of such information is typically performed by a natural language understanding (NLU) unit working in conjunction with a dialogue manager. These units have knowledge of grammar and syntax, which allows them to parse a caller response such as xe2x80x9cI would like a large pizza with pepperonixe2x80x9d to extract the particular information desired by the application, namely that the desired pizza (a) is large, and (b) has a pepperoni topping.
Such natural language processing and dialogue management is described in xe2x80x9cBuilding Intelligent Dialog Systemsxe2x80x9d by S McRoy, S Ali, A Restificar, and S Channarukul, in Intelligence, Spring 99, p14-23, and (at a more practical level) in xe2x80x9cAn Object-Oriented Approach to Dialogue Management in Spoken Language Systemsxe2x80x9d by R Sparks, L Meiskey, and H Brunner, in Human Factors in Computing Systems, 1994, p211-217.
The above approach presents a much more natural interface for callers, provides greater flexibility, and potentially can significantly reduce call handling time. However, large vocabulary speech recognition is still not completely reliable in all cases. Therefore, it is common when the statistical confidence on the recognition result is low, to play back to the caller their selection for confirmation. Whilst this leads to a more robust system, the approach taken by current systems can seem rather robotic, and does not lead to the most efficient dialogue with the caller.
Accordingly, the invention provides a method of operating a voice processing system comprising the steps of:
receiving spoken input from a user;
performing speech recognition to convert said spoken input into text equivalent;
identifying at least two information elements in said text equivalent, each having an uncertainty associated therewith;
selecting a prompt according to which of said at least two information elements has the greatest uncertainty associated therewith; and
playing out said selected prompt to the user.
Voice applications involving speech recognition frequently play back user input for confirmation. Therefore, according to the invention, where the user has input two items, these are ranked in terms of uncertainty (typically based on the speech recognition confidence level associated with that information element). The playback prompt is then structured to account for which information element has the greatest uncertainty associated with it. This helps structure the playback to maximise the efficiency of the confirmation process.
Thus in particular, in the preferred embodiment, the prompt is selected from a set of two or more possible prompts, each possible prompt containing the same information but having different emphasis. The different emphasis is typically achieved by varying prompt word order, but other parameters can be used instead, such as varying the volume and/or duration and/or pitch of one or more words in the prompt. This selection can be generally achieved by providing multiple pre-recorded prompts, by re-ordering or processing of individual audio segments forming a prompt, or by using a text to speech synthesis system with appropriate parameter settings.
The effect of such different emphasis is to focus user attention on the information element having the greatest uncertainty. This means that he/she is more likely to notice if a mistake has been made in the recognition or identification of the information elements. In addition, he/she is also more likely to stress the incorrect element in any repetition of the input, thereby improving the chances of performing a correct recognition the next time.
Where such repeat input is received, this is typically processed as for the previous input, in other words by performing speech recognition and identifying at least two information elements from this repeat input. In the preferred embodiment, an error condition is obtained if the uncertainty associated with the information elements in such repeat input is greater than that for the previous input, since this indicates that the recognition performance is diverging (deteriorating). Typically at this point automated interaction with the user is terminated, and they are passed to a human agent for handling.
The invention typically finds application in a telephony environment, in which the voice processing system and the user communicate with each other over a telephone network. In this situation, the spoken input is received over a telephone connection, and the prompt is played out over the telephone connection. Note that the audio qualities of a telephone connection can be somewhat variable, which is one of the reasons why speech recognition in this environment is often subject to uncertainty and error. (In fact, errors associated with the transmission channel are currently worse for Internet-based services where the telephony voice data is transmitted over a data network).
The invention further provides voice processing apparatus comprising:
an input device which receives spoken input from a user;
a speech recognition unit which converts said spoken input into text equivalent;
a natural language understanding unit which identifies at least two information elements in said text equivalent, each having an uncertainty associated therewith;
a prompt generator which selects a prompt according to which of said at least two information elements has the greatest uncertainty associated therewith; and
an output device which plays out said selected prompt to the user.
Such voice processing apparatus may be adapted for connection to the telephone network (conventional PSTN or the Internet), in a customer server kiosk, or in any other appropriate device. Note that the speech recognition means and/or any natural language understanding may or may not be integral to the voice processing system itself (as will be apparent more clearly from the preferred embodiments described below).
The invention further provides a computer readable medium containing instructions readable by a computer system operating as a voice processing system, said instructions including instructions for causing the computer system to perform the following steps:
receiving spoken input from a user;
performing speech recognition to convert said spoken input into text equivalent;
identifying at least two information elements in said text equivalent, each having an uncertainty associated therewith;
selecting a prompt according to which of said at least two information elements has the greatest uncertainty associated therewith; and
playing out said selected prompt to the user.
The computer readable medium may comprise a magnetic or optical disk, solid state memory device, tape, or other appropriate storage apparatus. In some cases this medium may be physically loadable into the storage device. In other cases, this medium may be fixed in the voice processing system, and the instructions loaded onto the medium via some wired or wireless network connection. Another possibility is for the medium to be remote from the voice processing system itself, with the instructions being downloaded over a wired or wireless network connection for execution by the voice processing system.
It will be appreciated that the computer program and apparatus of the invention will benefit from substantially the same preferred features as the method of the invention.