The computing world is evolving towards an era where billions of interconnected pervasive clients will communicate with powerful information servers. Indeed, this millennium will be characterized by the availability of multiple information devices that make ubiquitous information access an accepted fact of life. This evolution towards billions of pervasive devices being interconnected via the Internet, wireless networks or spontaneous networks (such as Bluetooth and Jini) will revolutionize the principles underlying man-machine interaction. In the near future, personal information devices will offer ubiquitous access, bringing with them the ability to create, manipulate and exchange any information anywhere and anytime using interaction modalities most suited to the an individual's current needs and abilities. Such devices will include familiar access devices such as conventional telephones, cell phones, smart phones, pocket organizers, PDAs and PCs, which vary widely in the interface peripherals they use to communicate with the user.
The increasing availability of information, along with the rise in the computational power available to each user to manipulate this information, brings with it a concomitant need to increase the bandwidth of man-machine communication. The ability to access information via a multiplicity of appliances, each designed to suit the individual's specific needs and abilities at any given time, necessarily means that these interactions should exploit all available input and output (I/O) modalities to maximize the bandwidth of man-machine communication. Indeed, users will come to demand such multi-modal interaction in order to maximize their interaction with information devices in hands-free, eyes-free environments.
VoiceXML is a markup language designed to facilitate the creation of speech applications such as IVR (Interactive Voice Response) applications. Compared to conventional IVR programming frameworks that employ proprietary scripts and programming languages over proprietary/closed platforms, the VoiceXML standard provides a declarative programming framework based on XML (eXtensible Markup Language) and ECMAScript (see, e.g., the W3C XML specifications (www.w3.org/XML) and VoiceXML forum (www.voicexml.org)). VoiceXML is designed to run on web-like infrastructures of web servers and web application servers (i.e. the Voice browser). VoiceXML is a key component for providing a voice interface to Mobile e-business. Indeed, VoiceXML allows information to be accessed by voice through a regular phone or a mobile phone whenever it is difficult or not optimal to interact through a wireless GUI micro-browser.
More importantly, VoiceXML is a key component to building multi-modal systems such as multi-modal and conversational user interfaces or mobile multi-modal browsers. Multi-modal e-business solutions exploit the fact that different interaction modes are more efficient for different user interactions. For example, depending on the interaction, talking may be easier than typing, whereas reading may be faster than listening. Multi-modal interfaces combine the use of multiple interaction modes, such as voice, keypad and display to improve the user interface to e-business. Advantageously, multi-modal browsers can rely on VoiceXML browsers and authoring to describe and render the voice interface.
There are still key inhibitors to the deployment of compelling multi-modal e-business applications. Most arise out of the current infrastructure and device platforms. Indeed, the current networking infrastructure is not configured for providing seamless, multi-modal access to information. Indeed, although a plethora of information can be accessed from servers over a communications network using an access device (e.g., personal information and corporate information available on private networks and public information accessible via a global computer network such as the Internet), the availability of such information may be limited by the modality of the client/access device or the platform-specific software applications with which the user is interacting to obtain such information. For instance, current wireless network infrastructure and handsets do not provide simultaneous voice and data access. Middleware, interfaces and protocols are needed to synchronize and manage the different channels.
Currently, application authoring methodologies are being developed to provide means to develop rich multi-modal applications. It is anticipated that most multi-modal mobile deployment will rely on wireless PDAs that can overcome the above challenges by hosting a VoiceXML browser on the client (fat client configuration) or by relying on sequential or notification-based multi-modal scenarios, where the user switches connectivity when he or she wants to interact through another modality.
Because of the inherent challenges of conversational engines (e.g., speech recognizer) that require data files (e.g., grammars), however, it is important to provide mechanisms that provide tools that hide this level of complexity. It is also important that such mechanisms and tools overcome some of the limitations imposed by VoiceXML (e.g. the VoiceXML execution model). Thus, while it is anticipated that voice (alone for multi-channel applications) and multi-modal will be key catalyst to wide adoption of mobile e-business, it is believed that such wide spread adoption of such voice and multi-modal interfaces will remain challenging until tools for building applications using voice-based reusable dialog components are available to non-speech specialists.
The VoiceXML Forum has submitted VoiceXML 1.0 to the W3C Voice Browser activity (see, e.g., W3C voice browser activity, www.w3.org/voice/). As part of its activities, the W3C Voice Browser working group has identified reusable dialog components as an item worth studying and it published a set of associated requirements (see, e.g., the W3C reusable dialog requirements for voice markup language (www.w3.org/TR/reusable-dialog-reqs)).
Accordingly, VoiceXML frameworks for reusable dialog components (server-centric and client-centric), which satisfy the published W3C reusable dialog component requirements while remaining compliant with the VoiceXML specifications, for example, would be a significant component for building voice interfaces and multi-modal applications that seamlessly operate across a plurality of channels.