Many mobile employees spend a considerable amount of time in cars or in other venues where a voice telephone is the only viable means of communication and the only way to access remote information sources. As self-service access to business applications becomes essential to more and more jobs, automated voice access becomes a key requirement. It is estimated that around half of cellular phone calls originate from automobiles. For a large segment of the professional workforce, the mobile phone has opened up hours of weekly commuting time for productive business purposes. Companies that offer telephone access derive a competitive advantage over those that do not. While new mobile computing devices offer remote access, their small visual displays and limited input capabilities often result in a frustrating and tedious experience. For example, the selection of items from a long list or menu is much more efficient by voice, simplifying actions such as finding a name in an address book, selecting a date on a calendar, or finding a note with a specific subject line.
Some employees with disabilities are not able to use visual interface devices, and others can not use input devices dependent on fine hand control. For these individuals, voice access is more than a competitive advantage; it is a fundamental requirement to doing their jobs. Providing voice access is much more than just voice-enabling a visual interface; it requires a basic redesign of an application for conversational interaction.
A major stumbling block for the voice interface has been the unnatural and difficult-to-understand nature of computer generated voices. Recent breakthroughs in the use of concatenative text-to-speech technology has eliminated this limitation and resulted in voice quality comparable to human speech. Speech recognition accuracy has also continued to improve, so that millions of people daily use their voice to “dial” phone numbers by saying a person's name, manage their investment portfolios, and access weather information, sports scores and other information. In addition to technology improvements, the steady refinement of conversational dialogue design has resulted in a much more efficient and pleasant user experience than was provided by earlier voice activated systems. Advances in hardware have also made it possible to deploy automated support for large numbers of simultaneous callers without large capital investments. In particular, the cost of CPU processing power, memory, and telephony interface cards have been falling by the rule of Moore's Law.
An important piece to fall into place has been the availability of VoiceXML, an open standards-based voice application design protocol that is supported by all major speech technology suppliers. This standard was designed to allow voice applications to run on all enterprise-quality computer hardware and operating system platforms. Companies can be sure that their investment in a VoiceXML application infrastructure won't lock them into a single supplier for critical system components. Voice application development had traditionally required a variety of skills, knowledge and programming techniques, including: specific Integrated Voice Response (IVR) application development environments; interfacing between specific IVR environment and middleware applications; using speech recognition and speech synthesis technologies; conversational design; and middleware design.
VoiceXML was introduced specifically to eliminate the need for proprietary IVR application design environments, to automatically provide the integration to middleware using the view-and-form based model of Web application design, and to create a standardized interface to speech recognition and speech synthesis technologies. VoiceXML enables voice application servers to integrate voice interface capabilities in the same way that web application server integrate HTML interface capabilities. These protocols provide a modular application design environment with common components sharable across all access modalities.
It is not just voice technology that has being developing, so too has user interface technology in the form of web portals. Portals serve as a simple, simultaneous unified access point to several web applications. Portals provide a runtime platform and tools that give a consistent presentation view across multiple pages, navigation control to access applications, and personalized selection and customization of content. IBM WebSphere Portal Server infrastructure accomplishes this by providing functions that: provide access to information across a spectrum of users, devices, and customization options; integrate and automate business processes; and build, connect and manage applications. Pervasive portal offerings are part of a new generation of applications designed to obtain information and execute transactions from a variety of remote access devices. In addition, the portal platform is ideal for supporting both voice and visual access through a common personalization store-and-shared business logic.
Most existing automated voice solutions have been created using proprietary voice application environments combined with custom interfaces to back-end business logic and data. These custom interfaces are difficult to integrate with traditional GUI Web access solutions. However, IBM WebSphere Voice Application Access (WVAA) combines the modular application design of IBM WebSphere Portal Server with VoiceXML to add voice access to the other modalities supported by WebSphere Portal Server. By building on VoiceXML, not only is the growing community of voice application developers able to directly leverage the WVAA platform but platform customers should be able to choose between leading speech recognition and text-to-speech offerings.
Voice interfaces, such as those provided with WVAA have significant advantages over pure visual web applications in a portal. Graphical user interfaces (GUIs) tend to have a large amount of text on every screen that can saturate the user. Most people follow spoken dialogs more easily than written instruction. Perhaps the best advantage is dialogue focus, which means that prompts lead users through a conversation step-by-step. On the other hand, in natural conversations people answer even simple questions in a large variety of ways often outside the scope of the question. For example, they may answer a question and then explain their answer. Designing automated systems to be able to “understand” most of these arbitrary inputs would generally be quite complex and impractical. Consequently, it is important to channel people's spoken input to match the computer's voice recognition strengths.
Voice interfaces designed for telephony access have evolved significantly other the past few years based on the experiences of many application deployments. Some of the most important things learned are that conversational flow must be efficient, consistent, and intuitive. Use confidence scores to avoid confirming every entry, make sure navigational commands are consistent throughout all applications in the portal; and ensure that conversational flow “makes sense” to most users. Prompts must be carefully crafted, short but not ambiguous. It should be clear to most users exactly what to say to the system. Help prompts must be short. Users can simply not remember much more than one piece of information per prompt. The system should “reveal itself” to users at appropriate times. Context-dependent help can be used when the conversation bogs down and shortcuts can be offered when things are going well, to help users learn the system incrementally.
Another difference between visual and voice interfaces is portal navigation. Visual portal design is based on the concept of presenting a top level view in a single a viewable page, but most users do not care that components on a page are made from different portlets. The navigational problem for visual portlets is finding the page that contains the right application. In order to support a large number of applications, the portal can group similar pages into a page group. These visual concepts are not useful to a voice interface. While there will be some overlap, for example in that major categories may be the same between visual and voice, the navigational menu structure for voice is likely to be quite different for several reasons: there will be some visual-only portlets and some voice-only portlets; applications may be put on a page because they fit well together visually, but a different organization will make more sense in a voice menu; and many voice targets may be implemented as shortcuts rather than normal menu choices in order to keep prompts short. In other words, a voice interface is much more than simply a voice enablement of a visual interface.
The majority of voice applications will be directed dialogue designs, as these are the simplest to create and in many cases the easiest to use. Directed dialogue designs are controlled by the automated system, offering a specific set of choices. This paradigm, also known as system initiative, is the easiest for users to learn, but for complex applications it can be inefficient and tedious. Mixed initiative dialogue designs allow both the system and the user to take control of the dialogue as appropriate. Because the majority of voice access applications will be directed dialogues. IBM WebSphere Voice Application Access provides a superior application design environment for directed dialogue applications. The emphasis is on tools that facilitate iterative implementation, debugging and enhanced designs, using best practices for conversational dialogues.
FIG. 1 shows a Web browser's rendering of an example web portal 20 for two portlets 24 and 26 on a page 1 of the portal. The web aggregator has rendered the title banner, the menu of pages on the left hand side, provided a title bar (the ‘skin’) for each portlet, and asked each portlet to render itself. Pages 2 and 3 are not selected and are shown grayed in. Portal 24 is an email portal, menu items 24A, 24B and 24C are ‘get email’, ‘compose’ and ‘move to folder’ respectively. Portal 26 is a calendar portal, menu items 26A, 26B and 26C comprise: ‘new entry’; ‘new web conference’; and ‘edit’ respectively. Other menu items can be seen on the figure but are not referenced. Functional and sample voice application portlets are included in the system installation or available by download. Examples of the key functional portlets are: Lotus Notes R5 access to e-mail and calendar. Other portlets could be Lotus Notes R5 access to contact information; and Microsoft Exchange 2000 access.
FIG. 2 shows the hierarchy of elements for the example web portal 20 having pages 1, 2, and 3 in FIG. 1. Page 1 comprises two portals 24 and 26. Both portals comprise at least three menu items 24A, 24B, 24C and 26A, 26B, 26C. Further portlets 27 and 28 and corresponding menu items are not elaborated on but could exist in many combinations. Existing voice aggregation of the web portal would generate a voice portal menu that follows the same hierarchy, for instance, giving the telephone user a first choice of portal pages 1, 2, or 3. After, for example, page 1 is chosen, giving the user a second choice of portlet 21, 22, or 23. After, for example, portlet 21 is chosen, giving the user a third choice of which menu item in the portlet, e.g. 26A, 26B or 26C. In a voice environment having three consecutive groups of menu choices to get to the right menu item can be tedious.