In addition to providing printed telephone directories, many companies also provide a variety of information services to their telephone network subscribers and users. These services may include stock quotes, directory assistance and many others. In most of these applications, when the information requested can be expressed as a number or number sequence, the user is required to enter a request via a touch tone telephone. This is often aggravating for the user since he is usually obliged to make repetitive entries in order to obtain a single answer. This situation becomes even more difficult when the input information is a word or phrase. In such situations, the involvement of a human operator may be required to complete the desired task.
Because companies are likely to handle a very large number of calls per year, the associated labor costs become very significant. Consequently, such companies, along with telephone equipment manufacturers, have devoted considerable efforts to the development of systems that reduce the labor costs associated with providing information services on the telephone network. These efforts comprise the development of sophisticated systems that can be used in the context of telephone networks. Of particular interest, the field of Computer Telephony (CT) provides applications with value-added functionality to telephone users.
In order to provide a wide range of services to the telephone users, a high level of integration between a variety of different technologies is required. Typically, the integration of technologies such as interactive voice response, fax, telephone network access, telephone network features, interface to the internet and others is key to obtaining a successful application. A specific area of interest is the co-existence of voice resources, providing play/record and tone detection/generation functionality, and automatic speech recognition (ASR) resources, for applications using speech recognition as the main user interface over the telephone network.
In typical Computer Telephony (CT) systems, the application software interacts with the different types of resources through a multi-media server. Commonly, the multi-media server manages and arbitrates between different types of resources that provide specialized services. In a specific example, these services include automatic speech recognition, fax, voice interface and others, where each service-providing resource is commonly decoupled from all of the other resources. In the CT field, a number of specifications have been adopted in order to regulate the operations between the application program and the server, for example the Enterprise Computer Telephony Forum (ECTF) S.100 software specification (1996). A number of specifications have also been adopted in order to regulate the operations between the server and the resources, such as the ECTF S.300 software specification (1996).
A typical computer telephony (CT) platform comprises an application, a CT server, a voice resource and an automated speech recognition (ASR) resource. In a typical interaction, the CT server or the application sends a "play request" to the voice resource. The voice resource receives the request and plays the appropriate message. Once the message is complete, the voice resource sends a "play complete" message to the CT server. The CT server, upon reception of the "play complete" message, sends a start recognition request to the ASR resource. The ASR resource, upon receipt of the start recognition request, initiates the recognition process and, upon completion, sends a "recognition done" message to the CT server. The CT server then completes the process by requesting the recognition results from the ASR resource.
A problem with systems of the type described above is that if a user responds to the prompt before the latter is complete, the system does not detect the user's response. This may be inconvenient for a frequent user of the service who does not need to hear the end of the prompt before providing an answer.
A known solution to this problem is to create a system that allows premature responses to a prompt, herein designated as barge-in responses. In this case, the CT server sends a recognition request to the ASR resource before the play request is sent to the voice resource. The ASR resource is therefore ready to receive a spoken utterance before the voice resource has started playing the prompt. In a typical interaction, the CT server or the application sends a "start recognition" request to the ASR resource. The ASR resource, upon receipt of the "start recognition" request, initiates the recognition process and attempts to detect speech. The CT server then sends a "play" request to the voice resource. The voice resource receives the request and plays the appropriate message. When the ASR detects speech, it sends a "speech detected" message to the CT server or the application. If speech is detected before the voice resource has finished playing the prompt, the server sends a "stop playing" request to the voice resource which terminates the playing of the prompt. The ASR resource, upon detection of the speech, initiates the recognition process and, upon completion, sends a "recognition done" message to the CT server. The CT server/application then completes the process by requesting the recognition result from the ASR resource. If the voice resource finishes playing the prompt before the speech is detected, then the "stop playing" request of the CT server is either not sent or has no effect on the voice resource.
In order to implement speech recognition systems which support barge-in responses, it is crucial that the ASR resource be capable of providing line Echo Cancellation (EC). This feature allows the ASR resource to cancel out the echo of the prompt from the incoming signal, where this prompt is being played by the voice resource on the external line. This prompt echo exists at the CT system receiving end when the system provides externally accessible services to analog devices, for instance analog telephones, and can cause important degradation to the system's speech recognition accuracy. Therefore, a system which provides Echo Cancellation can detect the user's speech even in the presence of relatively strong line echo, necessary for the barge-in response feature where the user can interrupt the system's voice resource prompts at any time.
A problem with systems of the type described above is that between the time the speech is detected and the time at which the voice resource receives the stop playing message, the user must talk simultaneously with the prompt, a situation herein referred to as talk-over. Furthermore, if the server and application are operating under a heavy load and are managing multiple calls simultaneously, this talk-over period may be quite long in duration. If the talk-over period is too long, it may degrade the performance of the recognizer and cause annoyance to the user.
Thus, there exists a need in the industry to refine the user prompting process adopted by systems with speech recognition enabled services, so as to obtain an improved speech recognition user interface.