1. Field of the Invention
The present invention relates to a voice browser apparatus for processing documents written in a predetermined markup language by voice interaction, a method therefor, and a program therefor.
2. Related Background Art
Conventionally, access has been made to Web contents by means of a browser using the graphical user interface (GUI). Recently, voice browsers for making access to Web contents by means of voice interaction have come into use for the purpose of making access via telephones, and so on.
In the voice browser, Web contents are voice-outputted. For voice output, there are cases where contents written in text are converted into voices through voice synthesis and are outputted, and cases where contents prepared as voice data through recording are played back and outputted. This voice output is equivalent to display of pages in the browser in the graphical user interface.
In the browser in the graphical user interface, movement to next contents and input in a form are performed through mouse operation and keyboard entry, but in the voice browser, they are done through voice input. That is, a user's voice input is voice-recognized, and the recognition result is used to perform movement to next contents and input in the form.
There is a method in which a dedicated markup language is used as these contents for voice browsers. In this method, however, access cannot be made to the contents by the browser of the graphical user interface, and with this voice browser, access cannot be made to contents for the graphical user interface that currently exist numerously. Thus, there is a method in which HTML, a markup language that is used in the browser of the graphical user interface, is used also in the voice browser.
In this method, output contents and input candidates in voice, namely contents of processing suitable for voice recognition vocabularies and man-power, are determined from contents written in HTML, according to a specific rule. For example, there is a voice browser apparatus using rules as described below.
First, output contents shall constitute the text ranging from the head to the end of the HTML document to be subjected to browsing. However, if the URL indicates some midpoint in the HTML document, the output contents shall cover the range therefrom, and if there is an <HR> tag at some midpoint, the output contents shall cover the range ending with the tag. The input candidate shall constitute an anchor in the same range (text in the range surrounded by the <A> tag). When a word existing in the input candidate is inputted, the target to which it is linked is defined as a new object of browsing to perform similar processing.
For example, the case where the HTML document shown in FIG. 4 is targeted will be discussed. Assume that the URL of this HTML document is “http://guide/index.html”. First, the voice browser outputs “Please select a genre of shops from the following. French. Italian.” with a voice, and waits for a user's input. When the user inputs “Italian” with a voice, for example, the voice browser performs similar processing from the position of the HTML document of “http://guide/index.html #italian”. In other words, it outputs “Please select a shop. ∇∇. □□.”, and waits for the user's input. When the user inputs “∇∇”, for example, it obtains the HTML document of “http://guide/shop3.html” to carry out similar processing.
However, for the above described device of conventional example, contents must be described in accordance with a specific rule, thus raising a disadvantage that flexibility is reduced when contents are created also for the graphical user interface.