1. Technical Field
The present application relates generally to systems and method for providing conversational networking and, more particularly, to conversational protocols for implementing DSR (distributed speech recognition) applications over a computer network.
2. Description of Related Art
The computing world is evolving towards an era where billions of interconnected pervasive clients communicate with powerful information servers. Indeed, this millennium will be characterized by the availability of multiple information devices that make ubiquitous information access an accepted fact of life. The evolution of the computer world towards billions of pervasive devices interconnected via the Internet, wireless networks or spontaneous networks (such as Bluetooth and Jini) will revolutionize the principles underlying man-machine interaction. In the near future, personal information devices will offer ubiquitous access, bringing with them the ability to create, manipulate and exchange any information anywhere and anytime using interaction modalities (e.g., speech and/or GUI) most suited to the user's current needs and abilities. Such devices will include familiar access devices such as conventional telephones, cell phones, smart phones, pocket organizers, PDAs and PCs, which vary widely in the interface peripherals they use to communicate with the user.
The information being manipulated via such devices may reside on the local device or be accessed from a remote server via a communications network using open, interoperable protocols and standards. The implementation of such open standards also leads to a seamless integration across multiple networks and multiple information sources such as an individual's personal information, corporate information available on private networks, and public information accessible via the global Internet. The availability of a unified information source will define productivity applications and tools of the future. Indeed, users will increasingly interact with electronic information, as opposed to interacting with platform-specific software applications as is currently done in the world of the desktop PC.
With the pervasiveness of computing causing information appliances to merge into the users environment, the user's mental model of these devices is likely to undergo a dramatic shift. Today, users regard computing as an activity that is performed at a single device like the PC. As information appliances abound, user interaction with these multiple devices will be grounded on a different set of abstractions. The most intuitive and effective user model for such interaction will be based on what users are already familiar with in today's world of human-intermediated information interchange, where information transactions are modeled as a conversation amongst the various participants in the conversation.
Indeed, it is expected that information-centric computing carried out over a plethora of multi-modal information devices will be essentially conversational in nature and will foster an explosion of conversational devices and applications. It is to be noted that the term “conversational” is used to mean more than speech interaction—it encompasses all forms of information interchange, where such interchange is typically embodied by one participant posing a request that is fulfilled by one or more participants in the conversational interaction. The core principle behind the conversational interaction is that any interaction between the user and the machine be handled as a dialog similar to human-human dialog. Accordingly, the increasing availability of information available over a communications network, along with the rise in the computational power available to each user to manipulate this information, brings with it a concomitant need to increase the bandwidth of man-machine communication so that the increased human-machine interaction that will result from the pervasive use of such information devices will be as natural and simple as if the user was having a conversation with another individual.
With the increased deployment of conversational systems, however, new technical challenges and limitations must be addressed. For instance, currently available pervasive clients typically do not have the required memory and/or processing power to support complex conversational tasks such as recognition and presentation. Indeed, even with the rapid evolution of the embedded processor capabilities (low power or regular processors), one can not expect that all the processing power or memory is available for executing complex conversational tasks such as, for example, speech recognition (especially when the vocabulary size is large or specialized or when domain-specific/application-specific language models or grammars are needed), NLU (natural language understanding), NLG (natural language generation), TTS (text-to-speech synthesis), audio capture and compression/decompression, playback, dialog generation, dialog management, speaker recognition, topic recognition, and audio/multimedia indexing and searching, etc.
Moreover, even if a networked device is “powerful” enough (in terms of CPU and memory) to execute all these conversational tasks, the device may not have access to the appropriate domain-specific and application-specific data files or appropriate algorithms (e.g., engines) to adequately execute such tasks. Indeed, vendors and service providers typically do not allow for open exchange of the algorithms (conversational engines) for executing conversational tasks and/or the data files (conversational arguments) utilized by such algorithms (e.g., grammars, language models, vocabulary files, parsing, tags, voiceprints, TTS rules, etc.) to execute such tasks, which they consider intellectual, business logic and technology crown jewels. Indeed, some conversational functions may be too specific to a given service, thereby requiring back end information that is only available from other devices or machines on the network.
Furthermore, the network infrastructure may not provide adequate bandwidth for rapidly exchanging data files needed by conversational engines for executing conversational tasks. For example, NLU and NLG services on a client device typically require server-side assistance since the complete set of conversational arguments or functions needed to generate the dialog (e.g., parser, tagger, translator, etc.) may be too extensive (in terms of communication bandwidth) for transmission from the server to the client over the network connection. In addition, even if such data files can be transmitted over the network, such transmission may introduce long delays before the client device is able to commence an application or process an input, thereby preventing or delaying real-time interactions. Examples of this are cases where a speech recognition engine must load some dialog specific grammars (i.e. function of the state of the dialog) after receiving and recognizing/processing an input from the user.
These problems may be solved through implementation of distributed architectures, assuming that such architectures are implemented in appropriately managed networks to guarantee quality of service for each active dialog and data exchange. Indeed, the problems associated with a distributed architecture and distributed processing between client and servers require new methods for conversational networking. Such methods comprise management of traffic and resources distributed across the network to guarantee appropriate dialog flow of for each user engaged in a conversational interaction across the network.
Security and privacy concerns and proprietary considerations can also justify the need to distribute the speech processing. For example, it is inappropriate for a bank to send to a client-side speech recognition engine a grammar of the names of its customers. Speech grammars and other data files can also sometimes be considered as intellectual property or trade secrets that should not be distributed across networks. These indeed are often the key elements that make the difference between successful and failed speech applications.
Accordingly, systems and methods that provide conversational networking through implementation of, e.g., distributed speech recognition (DSR), distributed conversational architectures and conversational protocols for transport, coding and control, are highly desirable. Indeed, it would be advantageous to allow network devices with limited resources to perform complex conversational tasks (preferably in real-time) using networked resources in a manner which is automatic and transparent to the users of such devices.
Examples of applications that could rely on a DSR framework include, for example, communication assistance (Name dialling, Service Portal, Directory assistance), information retrieval (e.g., obtaining stock-quotes, checking local weather reports, flight schedules, movie/concert show times and locations), M-Commerce and other transactions (e.g., buying movie/concert tickets, stock trades, banking transactions), personal information manager (PIM) functions (e.g., making/checking appointments, managing contacts list, address book, etc.), messaging (IM, unified messaging, etc), information capture (e.g. dictation of short memos), multi-modal applications with a GUI user agent on the terminal synchronized with a DSR automated voice service, and telephony or VoIP IVR implemented by deploying a DSR framework between the gateway (IVR telephony card or VoIP gateway) and the speech engines.