Particularly for implementing voice or speech controlled information services which are accessible from mobile or fixed communication terminals via telephone networks and/or voice over IP links, voice or speech processing systems need to be configured to provide concurrently complex voice interaction services to a large number of users, e.g. to hundreds or thousands of concurrent users. For example, such voice or speech controlled information services include public information services, such as telephone directories, public transportation schedules, weather forecasts, sports results or other public information or databases, or personal information services, such as voice memos, text messages, contact lists or other personal information or databases. Specifically, these voice or speech processing systems need to provide voice or speech recognition services, voice or speech synthesis services, as well as dialogue control functions.
When implementing voice or speech controlled service platforms which support complex voice interaction for a large number of users, it is common practice to allocate dynamically to active users who request a service in each case a fixed resource (a “port”) which makes it possible for the user to be connected to the service platform. However, the software and/or hardware resources required for the actual voice interaction (e.g. different types of automatic speech recognition, speech synthesis, dialogue management, access to user-specific grammars and languages, etc.) are typically accessed on demand from various networked resource pools dedicated to specific functions. In other words, a user will first be connected to a port, and subsequently, depending on the details of the interaction and/or a user profile, the port requests (dynamically or ad-hoc) voice or speech processing resources for supporting the user, e.g. speech recognition resources from a dedicated Automatic Speech Recognition (ASR) server, speech synthesis resources from a separate dedicated Text-to-Speech (TTS) server, etc. While this combined approach of fixed allocation of connectivity resources and on-demand allocation of speech processing resources may be efficient for cases where there is little a priori knowledge about the statistical requirements for speech recognition and synthesis functions, for example, it may have significant drawbacks otherwise. Particularly, requesting and accessing signal processing resources from remote servers requires the transmission of control signals as well as the speech signals to be processed, in both directions and possibly over long distances with corresponding delays, and involves various signal exchange protocols, formatting and de-formatting functions, fluctuations in the transmission delays of individual data packets, signal buffering for compensation of these delays, and response times of the different parts of the distributed system, etc.). The protocols for requesting, providing and allocating resources represent a significant overhead with some amount of inertia. They are generally designed for efficient operation under stable on-demand conditions. In the case of voice interaction between a human user and a system, however, extraneous conditions (e.g. misrecognition, lack of user familiarity with system dialogue rules, ambient noise, other forms of distraction or disturbance, barge-in, etc.) will often lead to unscheduled cancellations or interruptions—conditions which slow down overall system response and use up significant resource allocation and management time. When a user calls the service, voice interaction will be secured only if ASR and TTS resources are available whenever the user requires them. Failure in the availability of any one single resource will normally lead to a negative user experience and often to the user session being aborted—this in spite of the fact that a port had actually been dedicated, i.e. specifically allocated to the user, leading to the expectation that the service is fully available.
A change in the offered services will often lead to a change in the statistics of the resources to be provided centrally—either in terms of processing power, and/or in terms of the time requirements to be accommodated. This will reflect, in a complex way, on the overhead for resource allocation, with an impact on performance which cannot always be predicted in a simple fashion. The result may either be a systematic overdesign of the system (with the aim to prevent resource congestion), or unexpected performance bottlenecks.
Generally speaking, a system for real-time allocation of resources supporting an unpredictable voice interaction is extremely complex, and the inherent complexity of such a system is ultimately reflected in costs incurred when setting up and operating the system.
US 2002/0143551 describes a spoken dialogue system that switches between various architectural configurations for implementing speech recognition functions based on user functionality and network conditions. According to US 2002/0143551, a client device, particularly a mobile device such as a cellular phone, is connected via a network link, e.g. a telephone network, to a server computer. Depending on the architectural configuration, speech recognition functions such as feature extraction and small vocabulary decoding are performed partly or entirely on the client device or on the server, whereas speech recognition functions such as large vocabulary decoding and natural language processing are performed typically on the server. While the dialogue system of US 2002/0143551 may be advantageous for distributing speech recognition processing over a client device and a server, it does not appear particularly suitable for large scale speech recognition processing involving speech controlled service requests from thousands of callers using a variety of different client devices.
WO 02/27708 describes a call processing system connected to a Public Switched Telephone Network (PSTN) and comprising a plurality of signal processing cards. The signal processing cards provide interactive voice response (IVR) functions and are each configured to handle twenty four telephone calls simultaneously. For further services, the signal processing cards are connected via a data network to resource servers, e.g. a speech recognition server. While the call processing system of WO 02/27708 may be scalable to handle a large number of calls for IVR functions, it does not address the issue of how to provide efficiently speech recognition services concurrently to a large number of callers placing these calls.
U.S. Pat. No. 6,237,047 describes a voice processing system comprising a plurality of signal processing cards which are accessible to remote host computers via a data network. According to U.S. Pat. No. 6,237,047, the signal processing cards perform functions such as playing or recording sound, data/voice compression, voice recognition, or speaker authentication in accordance with commands received from the host computers. In operation, a user is connected via a PSTN to a signal processing card which supports several phone lines. The respective processing card answers the phone call from the user and establishes communication with a remote host computer issuing the commands. While allocating the remote host computers dynamically among the signal processing cards makes more efficient use of the remote host computers' processing power, it does not address the issue of how to provide efficiently speech recognition services concurrently to a large number of users.
U.S. Pat. No. 6,119,087 describes a system for voice processing which receives telephone calls via a telephone network and determines the grammar-type of a pending utterance from a caller. According to U.S. Pat. No. 6,119,087, the grammar-type indicates an expected type of speech such as a string of numbers, a person's name, a date, a stock quote, etc. According to U.S. Pat. No. 6,119,087 telephone lines are coupled in each case to a recognition client which has coupled thereto a speech application. The speech application causes the recognition client to play a user prompt and determines the grammar-type of incoming utterances. The voice processing system further comprises a load balancing resource manager which continually monitors speech recognition server devices with regards to their relative loading and relative efficiencies in handling a particular grammar-type. Based on the relative loading and relative efficiencies, the resource manager assigns a pending utterance for processing to a particular one of the speech recognition server devices, depending on the grammar-type of the utterance. While the resource manager of U.S. Pat. No. 6,119,087 may be advantageous in selecting a suitable speech recognition server, the required up-front determination of the grammar-type may not be suitable for handling speech controlled service requests from a large number of users.