1. Field of the Invention
The present invention is related to the field of communications networks, and, more particularly, to network browsers used in communications networks.
2. Description of the Related Art
It is expected that in the not too distant future personal computers and other small computing devices, such as smart phones and personal digital assistants (PDAs), will be able to run Web applications that are configured to support multimodal interactions. As is well known, a Web application, generally, is a software program that affects the delivery of data from a Web site to a client, the Web site serving as the front-end of the particular application, and client-supplied data inducing the logical operations by which the Web site transfers data or information to the client. A Web application configured to support multimodal interactions is typically referred to as a multimodal Web application.
With a multimodal application, a user can interact with the application through different modes, a particular mode characterizing the particular form of input and/or output related to the application. The different modes of user interaction with the application—including voice and visual modes—can be used interchangeably during the same interaction. Thus, with a multimodal Web application, the user is able to interact with the application using, for example, human speech as well as the more traditional a graphical user interface (GUI).
The form-factor of smart phones, PDAs, and other small computing devices makes data entry via a keypad relatively more difficult compared with the entry of data via human speech. Nonetheless, there are situations, such as when the device is being utilized in a noisy environment, in which it is helpful to interact via non-voice mode, such as with a keypad, stylus, and/or GUI. It is not surprising, therefore, that the option of interacting with an application via multiple modes interchangeably, as afforded by a multimodal Web application, makes the use of multimodal Web applications especially advantageous in the context of smart phones, PDAs, and other relatively small computing devices.
Nonetheless, the relatively small size of a such computing devices, however desirable in other respects, can make the local running of different modes of a multimodal Web application on such devices problematic. The relatively small size of a smart phone, PDA, or similar computing device can constrain the amount of processing and/or memory resources that can be accommodated in such a device. Accordingly, such resource-constrained devices are often incapable of supporting more than one mode of interaction running locally on the device itself. The rendering of data in the form of speech, for example, is a mode that requires processing resources exceeding those of many types of resource-constrained devices. Thus, in running a multimodal Web application on a small device, it may be that the visual mode can be successfully run locally on the device, yet it may be necessary to run the speech rendering component of the same application on a remote device, namely, the server distributing the application. Alternatively, a relatively rudimentary speech rendering component residing on the small device can provide basic speech recognition input, while a more robust, albeit remotely located, speech rendering component can be used to process more complex speech input.
In the present context, it is worth recalling that speech processing typically comprises two components: speech recognition and speech synthesis. The speech recognition encompasses automatically recognizing speech input by comparing speech segments to a grammar, or grammars. Speech synthesis encompasses text-to-speech (TTS) processing by which synthetic speech output is rendered based upon input in the form of text. Even though a complete voice-mode application typically requires both automatic speech recognition and TTS processing, each can be performed at different locations. In the specific context of a hand-held or other small device, the device may have sufficient resources to locally process TTS, but may not have sufficient resources to perform speech recognition locally.
Additionally, even though there may be devices that can process both TTS and speech recognition locally, a particular Web application may require that speech recognition be performed on the Web server. This latter situation may arise when the relatively greater resources of the Web server are needed to support a large grammar, or multiple grammars, as well as perform post-processing functions related to a natural language understanding (NLU) model.
Notwithstanding these different scenarios, conventional multimodal Web browsers lack a dynamic capacity for switching among alternate speech configurations depending on the resources of the particular host that is tasked with running the application and/or depending on the nature of the particular application itself. That is, conventional multimodal Web browsers lack a dynamic capacity to switch, for example, from a speech configuration in which the speech recognition and TTS processing are performed locally to one in which both are performed remotely at the server or in which the TTS processing is performed locally while speech recognition is performed remotely.