Currently, the following are available as markup language specifications for describing a speech user interface (to be referred to as a “speech UI” hereinafter):    (1) VoiceXML (see http://www.w3.org/TR/voicexml20/)    (2) SALT (see http://www.saltforum.org/)    (3) XHTML+Voice (see http://www.w3.org/TR/xhtml+voice/)
Letting a browser read in contents written in accordance with such specification makes it possible to implement a speech UI between a user and a device (or service).
In general, an author (content creator) creates these speech UI contents by using a dedicated authoring tool (see e.g., Japanese Patent No. 3279684 and Japanese Patent Laid-Open No. 09-114623).
In order to implement a speech UI, a speech recognition technique for recognizing speech is required. Speech recognition is a process for selecting one of word sequences satisfying designated language constraints, which is nearest to the utterance, by using the acoustic statistic of human speech called an acoustic model. The language constraints are also called a speech recognition grammar.
As a general speech recognition grammar for the recognition of “Yes” or “No”, an existing grammar can be used. However, the author needs to create grammars specialized for other applications. For such speech recognition grammars, W3C is working for standardization, which is now recommended as “Speech Recognition Grammar Specification Version 1.0” (to be referred to as “SRGS” hereinafter). The specifications according to SRGS are disclosed in http://www.w3.org/TR/speech-grammar/. FIGS. 3 and 4 show an example of the description of a speech recognition grammar described by SRGS.
With regard to the specification “Semantic Interpretation for Speech Recognition” (to be referred to as “SISR” hereinafter) as well, standardization is now being promoted. This is a specification for specifying the semantic structure of a speech recognition result. Using this specification makes it possible to extract semantic information contained in a corresponding utterance as a speech recognition result. Referring to FIG. 3, reference numeral 302 denotes an example of an SISR semantic structure generating rule. As in this case, the semantic structure generating rule is described between <tag> and </tag>°in SRGS or in “tag” attribute. Note that the SISR specification is disclosed in http//www.w3.org/TR/semantic-interpretation/.
Consider, for example, a case wherein the utterance “I would like a coca cola and three large pizzas with pepperoni and mushrooms.” is made in speech recognition processing using the speech recognition grammar shown in FIGS. 3 and 4. As a result, structure data like that shown in FIG. 5 is generated. In this specification, a data structure (501) according to a user input is called a “semantic structure”, and each data (502) constituting the semantic structure is called a “semantic structural element”. In general, an application which receives a recognition result can use such a semantic structure more easily than the character string “I would like a coca cola and three large pizzas with pepperoni and mushrooms.” received as a recognition result.
FIG. 6A shows an example of a speech recognition application window before data input. This application is designed to order pizzas by speech or GUI input. A user may fill each form by GUI input or may make the utterance “I would like a coca cola and three large pizzas with pepperoni and mushrooms.” after clicking a speech input button 602. When the above utterance is made, each form is automatically filled with data, as shown in FIG. 6B.
Such a speech UI is generally created by using an UI authoring tool. FIG. 7 shows an example of a UI authoring tool window. In many general UI authoring tools, for example, a form palette 702 and a GUI window 703 under edition are seen. An application author creates a GUI window by dragging & dropping desired form controls from a form palette onto a UI window.
After the utterance is made by the user, in order to update the value of each form control as indicated by a window 603 in FIG. 6B in accordance with the user's utterance, the application author needs to perform the operation of binding each form to a semantic structural element of a speech recognition result. For example, the application author must bind the data 502 (the number of pizzas) in the semantic structure of the speech recognition result to a form 704 in which the number of pizzas is stored. When each form or object is to be bound to a semantic structural element of a speech recognition result in this manner, the simplest implementation is a UI like that shown in FIG. 8. That is, a semantic structure bind dialog 801 is presented to the author to make him/her input, by text input, a speech recognition grammar name (802) and a path (803) to a specific structural element generated by speech recognition. In this case, a path to such a semantic structural element is called a “semantic structural path”. “/” written in the semantic structural path represents a parent-child relationship. Therefore, “/pizza/number” represents “number” element of the child of “pizza” element, i.e., the data 502.
As shown in FIG. 8, letting the author input a semantic structural path of a speech recognition result by text input makes it possible to set binding of each form control (or object) and a semantic structural element of a speech recognition result.
Such text input imposes a load on the author. It is therefore required to reduce such load on the author.