1. The Present Invention
The present invention relates to voice processing apparatus and the like, and more particularly to voice processing systems that use speech recognition.
2. Description of the Related Art
Voice processing systems whereby callers interact over a telephone network (e.g. PSTN or Internet) with computerised equipment are very well-known in the art, and include voice mail systems, voice response units, and so on. Typically such systems ask a caller questions using prompts formed from one or more prerecorded audio segments, and the caller inputs answers by pressing dual tone multiple frequency (DTMF) keys on their telephones. This approach has proved effective for simple interactions, but is clearly restricted in scope due to the limited number of available keys on a telephone. For example, alphabetical input is particularly difficult using DTMF keys.
There has therefore been an increasing tendency in recent years for voice processing systems to use speech recognition in order to augment DTMF input (N.B. the term speech recognition are denote the act of converting a spoken audio signal into text). The utilisation of speech recognition permits the handling of callers who do not have a DTMF phone, and also the acquisition of more complex information beyond simple numerals from the caller.
As an illustration of the above, WO96/25733 describes a voice response system which includes a prompt unit, a Voice Activity Detector (VAD), and a speech recognition unit. In this system, as a prompt is played to the caller, any input from the caller is passed to the VAD, together with the output from the prompt unit. This allows the VAD to perform echo cancellation on the incoming signal. Then, in response to the detection of voice by the VAD, the prompt is discontinued, and the caller input is switched to the recognition unit, thereby providing a barge-in facility.
Speech recognition in a telephony environment can be supported by a variety of hardware architectures. Many voice processing systems include a special DSP card for running speech recognition software. This card is connected to a line interface unit for the transfer of telephony data by a time division multiplex (TDM) bus. Most commercial voice processing systems, more particularly their line interface units and DSP cards, conform to one of two standard architectures: either the Signal Computing System Architecture (SCSA), or the Multi-vendor Integration Protocol (MVIP). A somewhat different configuration is described in GB 2280820, in which a voice processing system is connected via a local area network to a remote server, which provides a speech recognition facility. This approach is somewhat more complex than the TDM approach, given the data communication and management required, but does offer significantly increased flexibility.
Speech recognition systems are generally used in telephony environments as cost-effective substitutes for human agents, and are adequate for performing simple, routine tasks. It is important that such tasks be performed accurately, otherwise there may be significant caller dissatisfaction, and also as quickly as possible, both to improve caller throughput, and because the owner of the voice processing system is often paying for the call via some free phone mechanism (e.g. an 0800 number), or because an outbound application is involved.
(Note that as used herein, the term xe2x80x9ccallerxe2x80x9d simply indicates the party at the opposite end of a telephone connection to the voice processing system, rather than to specify which party actually initiated the telephone connection).
There has been an increase in recent years in the complexity of input permitted from the caller. This is supported firstly by the use of large vocabulary recognition systems, and secondly by supporting natural language understanding and dialogue management. As a simple example of this, a pizza ordering application several years ago might have gone through a menu to determine the desired pizza size, topping etc., with one prompt to elicit each property of the pizza from a caller. Now however, such an application may simply ask: xe2x80x9cWhat type of pizza would you likexe2x80x9d. The caller response is passed to a large vocabulary continuous speech recognition unit, with the recognised text then being processed in order to extract the relevant information describing the pizza.
The extraction of such information is typically performed by a natural language understanding (NLU) unit working in conjunction with a dialogue manager. These units have knowledge of grammar and syntax, which allows them to parse a caller response such as xe2x80x9cI would like a large pizza with pepperonixe2x80x9d to extract the particular information desired by the application, namely that the desired pizza (a) is large, and (b) has a pepperoni topping. The dialogue manager further provides flexibility in terms of generating prompts (perhaps using text-to-speech synthesis) to acquire specific information from a caller.
The above approach presents a much more natural interface for callers, provides greater flexibility, and potentially can significantly reduce call handling time. However, the increased flexibility also increases the scope for caller confusion. In such cases call efficiency can actually be reduced, and a lost call may result. Prior art voice processing systems have not addressed the problem of caller confusion or uncertainty that is sometimes an inevitable consequence of trying to support a caller interface that is more natural, but at the same time also more complex.
Accordingly, the invention provides a method of operating a voice processing system comprising the steps of:
receiving spoken input from a user;
performing speech recognition to convert said spoken input into text equivalent;
analysing at least one semantic or prosodic property of said spoken input by looking for task words in the text equivalent of the spoken input; and
responsive to said analysis, determining that the user input has effectively completed if there has not been a task word for more than a predetermined period of time.
The invention typically finds application in a telephony environment, in which the voice processing system and the user communicate with each other over a telephone network. In this situation, the spoken input is received over a telephone connection, and the voice processing system may itself play out prompts over the telephone connection, such as in response to a determination that the caller input has effectively been completed. The particular prompt played back to the caller in these circumstances may of course be dependent on what information the caller has so far provided to the voice processing system.
Underlying the present invention is the fact that conventional human dialogue is regulated by the concept of turn-taking, with linguistic cues that indicate when one party has finished speaking, and is expecting or inviting the other party to take over. Prior art voice processing systems have not been sensitive to such cues, and so seem extremely artificial in terms of the dialogue that they support. This in turn can cause difficulties for callers trying to use such systems, particularly if they have relatively little experience with such man-machine interfaces.
Unlike prior art systems, the present invention allows a determination of when the caller input has effectively (rather than actually) been completed. In other words, it detects not when the caller has stopped speaking altogether, but rather when the caller has stopped saying anything useful or relevant. This is achieved by analysing at least one semantic or prosodic property of said spoken input. The intention is firstly to assist more quickly callers who are in difficulty (whether or not they are conscious of the fact), and secondly to speed up call handling by interrupting callers who are giving lots of irrelevant information. The naturalness of the caller interface can also be improved by this approach, since the techniques employed mirror to a certain extent what happens in normal conversation.
Since playing out a prompt before the caller has stopped speaking altogether effectively represents an interruption of the user, it must be timed so as to minimise any confusion or indeed offense to the user. Thus if the system starts its prompt at a particular moment when the user is still talking, the user is likely to miss the start of the system prompt. Therefore, in the preferred embodiment, once it has been determined that the user input is effectively completed, the system then plays out the next prompt when there is some break in the spoken input. This ensures that the user is likely to hear all of the system prompt, and reinforces the naturalness of the dialogue.
Note that typically the duration of a break to trigger the interruption will be much shorter than the time-out period for user input into conventional voice processing systems. Thus such prior art systems rely on silence to determine completion of user input, and so need a relatively long time-out period to discriminate input completion from the short, transitory breaks which are natural in any spoken input due to breathing, etc. By contrast, in the present invention, the determination of effective caller completion is made from semantic and/or prosodic properties, thereby allowing a much shorter interval to then be used to trigger the interruption.
This clearly has advantages even when the user does not in fact intend to continue after the break, in that the quicker determination of caller completion both accelerates call handling, and also prevents an artificially long period of silence intruding upon the dialogue.
Other possible cues for an interruption by the voice processing system (which would typically be employed in addition to the use of a break) include when the caller is dwelling on an extra-linguistic word, such as xe2x80x9cumxe2x80x9d. Such an approach does however require a fast response from the speech recognition system to be effective, and so would be more difficult to use as a trigger than a break, which can be detected very quickly by the voice processing interface software.
There are a wide variety of techniques available for the step of analysing at least one semantic or prosodic property of said spoken input. These can be used either singly, or in conjunction with one another. One possibility based on semantics is to look for task words in the text equivalent of the spoken input. In this case the determination can be made that the user input has effectively completed when the caller has not spoken any task word within a last predetermined period of time, or according to any other suitable criteria to indicate a reduction or termination in useful information from the caller.
Another possibility, based this time on prosody, is to look for a prolonged pitch fall as representing effective completion of the spoken input. Such a combination of increased duration and falling pitch is a good indication in normal conversation that the speaker has completed their contribution. A further possibility is to look for a reset of the pitch excursion envelope; this typically indicates that the caller is psychologically re-starting their input, which in turn suggests that he or she has become confused.
Since the prosodic indicators are independent of the textual equivalent of the caller input, they can be calculated in parallel with the speech recognition. However, since such prosodic indicators may not be completely reliable by themselves, in the preferred embodiment they are used in conjunction with semantic properties. As an example of this, one good indication that the caller has effectively completed is that the caller is asking the machine a question. This can be detected both semantically, in terms of word order or words such as xe2x80x9chowxe2x80x9d etc., and also prosodically, typically by virtue of a final rise in caller pitch.
The invention further provides a voice processing apparatus comprising:
an input device to receive spoken input from a user;
a speech recognition unit to convert said spoken input into text equivalent; and
means for analysing at least one semantic or prosodic property of said spoken input, wherein responsive to said analysis, it is determined that the user input has effectively completed.
Such voice processing apparatus may be adapted for connection to the telephone network (conventional PSTN or the Internet), in a customer server kiosk, or in any other appropriate device. Note that the speech recognition means and/or any natural language understanding may or may not be integral to the voice processing system itself (as will be apparent more clearly from the preferred embodiments described below).
The invention further provides a computer readable medium containing instructions readable by a computer system operating as a voice processing system, said instructions including instructions for causing the computer system to perform the following steps:
receiving spoken input from a user;
performing speech recognition to convert said spoken input into text equivalent;
analysing at least one semantic or prosodic property of said spoken input;
and responsive to said analysis, determining that the user input has effectively completed.
The computer readable medium may comprise a magnetic or optical disk, solid state memory device, tape, or other appropriate storage apparatus. In some cases this medium may be physically loadable into the storage device. In other cases, this medium may be fixed in the voice processing system, and the instructions loaded onto the medium via some wired or wireless network connection. Another possibility is for the medium to be remote from the voice processing system itself, with the instructions being downloaded over a wired or wireless network connection for execution by the voice processing system.
It will be appreciated that the computer program and apparatus of the invention will benefit from substantially the same preferred features as the method of the invention.