1. Field of the Invention
The present invention relates to an audio control method for controlling a data processor with a group of audio commands, in which method information is presented on the display device of the data processor, and at least one control field, to which a predetermined function is assigned, is formed on the display device. The present invention also relates to a device controlled with audio commands, which comprises an audio recognition device, a data processor, a display device for presenting information, means for forming at least one control field on the display device, and means for assigning a predetermined function to said control field.
2. Brief Description of Related Developments
Generally, the purpose of audio control is to facilitate the use of various devices. Such audio control applications include, for example, different devices controlled with speech. Speech control applications are developed, for instance, for computers and telecommunication terminals, such as mobile phones and landline network telephones. With speech control, the user can control a computer by uttering different command words aloud, wherein the user does not have to use the keyboard of the computer for entering these commands. In a speech controlled telecommunication terminal, the user can select the telephone number by saying it aloud, typically one number at a time.
Instructing the computer with voice commands rather than using a pointing device such as a mouse also has significant benefits especially in small, communicator type terminal devices, such as Nokia 9110 Communicator, where the use of the keyboard and/or the pointing device may not be so convenient than the use of bigger keyboard and/or pointing device of e.g. desktop computers.
These speech recognizers are generally based on fixed vocabulary speech recognition or phoneme-based speech recognition. In the fixed vocabulary speech recognition, the device tries to select from a specified vocabulary the word which best corresponds to the word uttered by the user. It is also possible to implement such speech controlled devices in such a way that the user can instruct the device command words with his/her own voice, wherein the device recognizes best the words uttered by that user. The purpose of the speech control methods based on phoneme recognition is to recognize phonemes uttered by the user and to form words on the basis of these phonemes. Such devices based on phoneme recognition are, however, more complex and more expensive than fixed vocabulary speech recognition devices. Furthermore, especially in noisy circumstances the recognition is not as reliable with speech recognition devices based on phonetic recognition as with fixed vocabulary recognizers.
For implementing speech control in devices, in which it is necessary to use only a few command words or numbers, such fixed vocabulary recognizers are well suited. Nevertheless, the aim has been to accomplish speech control also in devices, during the use of which it might be necessary to utter a variety of command words, the command words varying in different situations. For example, when utilizing the Internet data network, it is possible to set up a connection by using several different addresses. Such a browser program for the Internet data network, so-called www browser (world wide web), has recently become common in computers. It is even possible to equip mobile telecommunication devices with such a www browser program, for examining data accessible via the Internet data network. Such a www browser program contains certain standard functions which are largely similar, irrespective of where the data is retrieved from. However, the data retrieved from the Internet data network, for example HTML pages (HyperText Markup Language), may contain active locations, for example links to other Internet addresses, e.g. URL (Uniform Resource Locator), option buttons etc. with varying names, contents and references. In practice, the recognition of these variable names is not possible when using recognition methods and devices of prior art, based on fixed vocabulary speech recognition. On the other hand, especially links can be composed of very long character strings, which the user has to be able to define accurately without misspellings when s/he wants to move to the location indicated by the link. Thus, the speech recognition methods and devices of prior art, based on phonetic recognition are not sufficiently reliable for implementing practicable speech recognition in said browser applications.
Using voice control for www browsing has the difficulty that the links are often long and complex, frequently containing numbers and other non text symbols. This makes them unnatural for a user to say in voice controlled browsing. This problem was solved earlier by instructing the user to say the name of the link (e.g., xe2x80x9cMicrosoftxe2x80x9d for xe2x80x9cwww.microsoft.comxe2x80x9d, or xe2x80x9cNokiaxe2x80x9d for xe2x80x9cwww.nokia.comxe2x80x9d, . . . ), and then the technique known as speech recognition from text (SRFT) can be used to find the closest match of the input utterance to the currently displayed web links.
SRFT method creates speech recognition models based on text input. From each text entity an acoustic model which represents the spoken equivalent to the text entity is created. The acoustic models are then used to recognize which of the alternative text entities is uttered (if any). SRFT method relies on knowing (or creating) the phonetic structure of the links, making it possible to identify how the user should utter each link name.
Wireless Application Protocol (WAP) is a system architecture specifically designed for use in low bandwidth environment using terminals with varying, often limited, capabilities. Not all terminals are able to display images, for example. A central object of WAP is the WAP gateway (WAP gw), through which all of the traffic between communicating parties (e.g. the terminal and a content server) flows.
The WAP is capable of displaying normal HTML files to the user by converting the HTML to Wireless Markup Language (WML), which is a markup language specifically designed for WAP, in the WAP gateway. Of course the WML can be used independently from the HTML.
Because a small terminal, such as a portable phone, usually cannot display images, it is necessary to offer a textual replacement for an image. This can be done by using the ALT attribute of the image in the HTML, if one exists (e.g.  less than a href=xe2x80x9cmain.htmlxe2x80x9d greater than  less than img border=0 src=xe2x80x9cimg00253.gifxe2x80x9d ALT=xe2x80x9cJack""s photoxe2x80x9d greater than  less than /a greater than ). The ALT attribute of the IMG tag will be displayed when the pointing device is placed on top of the image containing the link. If an image is used as a link, a text tag, very similar to a voice tag, must be created to be used as the link name if no ALT attribute (or equivalent) directive exists.
When terminals with text and voice I/O are used for www browsing for example in WAP environment, it is impossible for the user to distinguish between different pictures which are used as hyperlinks (i.e. xe2x80x98 less than a href=foo.html greater than  less than img src=linkpicjpg greater than  less than /a greater than xe2x80x99type of links), since it is impossible to tell what the picture would tell to the user. Thus, it is very difficult to make a voice tag out of it, and the link name would be either the actual URL the link points to, or something very uninformative like xe2x80x98[IMAGE]xe2x80x99. The fact that the name of the image usually does not provide too much information does not make it any easier. Too often the target URL is useless as well, since the target page may be accessed through a common gateway interface (cgi), which can have multiple arguments, or the URL contains multiple random digits and letters, which are difficult to speak and provide no information about the page the link points to. The common gateway interface means computer programs running on a webserver that can be invoked from a www page at the browser.
There is also a possibility that the user of the www browser selects a page, which contains multiple links with the same link name (i.e. numerous xe2x80x98click less than a href=foo.html greater than here less than /a greater than  for infoxe2x80x99 type of links). In this case it is impossible to use the word xe2x80x98herexe2x80x99 (or whatever is the conflicting word or phrase) as a voice tag.
German publication DE-4440598 discloses a speech controlled hypertext navigation system. The aim of the system presented in the publication is to use the content of a hypertext document retrieved into the computer, such as an HTML page containing links, to define the possible phonetic form of the links included in it. When the user utters a link, the recognizer compares the phonetic forms produced of these links to the speech of the user, in order to find out which link the user uttered. Thus, the recognition is based on phonetic recognition. A drawback in the system presented in this publication is, for instance, that an HTML page can contain several links with nearly identical content, wherein it can be difficult or even impossible to distinguish them from each other. Moreover, the links can be long character strings, which complicates the recognition.
U.S. Pat. No. 5,465,378 discloses a report generating system. The system is based on report material which is stored in a computer and can contain text and images, and on command words connected to this material. The speech recognition device tries to recognize the command words uttered by the user and to retrieve from the memory the material corresponding to these command words, to generate a report. Also here the problem is that certain command words are linked with a particular function, wherein for introducing new functions, the recognition device has to be trained to recognize these new functions.
The above mentioned inventions do not provide a user friendly nor informative tag if the link name is difficult to pronounce or if the link is an image and the terminal is unable to display such an image.
One purpose of the present invention is to produce an audio recognition method and a device in which fixed vocabulary audio recognition, such as speech recognition, can be used also in a situation where control commands can vary. An audio recognition method according to the present invention is characterized in that in the method, one audio command from said group of audio commands is assigned to said control field, and the audio command assigned to said control field is presented on the display device, wherein when the user gives an audio command assigned to the control field, the audio command is recognized and the function corresponding to the audio command is conducted. An audio controlled device according to the present invention is characterized in that the device also comprises means for assigning an audio command to said control field, means for presenting the audio command assigned to the control field on the display device, means for recognizing the audio command, and means for conducting the function corresponding to the recognized audio command. The invention is based on the idea that part of the voice storage, such as the vocabulary, of a fixed vocabulary speech recognizer, is determined for controlling certain standard commands, and the other commands in the vocabulary can be set for addressing variable control functions. Hereinbelow in this description, the invention will be primarily illustrated with examples relating to speech control, but it is obvious that the use of also other sounds is possible in audio control. Examples of such audio signals include different clapping and knocking sounds.
Considerable advantages are achieved with the present invention compared with audio control systems of prior art, such as speech control methods and devices. With the method according to the invention, it is possible to implement control functions with a more advantageous fixed vocabulary speech recognizer also in a variable environment without having to instruct the new words to the speech recognition device. When using a speech recognition device according to the invention, the number of the commands to be selected at a time can be varied by joining several command words one after the other to select a particular function.
Using this invention, it is possible to generate meaningful tags even if the link is either an image, it is ambiguous, or it and the URL are difficult to pronounce. Also, this invention allows more powerful link name generation for voice only www browsers.