1. Field of the Invention
The present invention generally relates to a multi-modal audio-visual content renderer and, more particularly, to a multi-modal content renderer that simultaneously renders content visually and verbally in a synchronized manner.
2. Background Description
In the current art, content renderers (e.g., Web browsers) do not directly synchronize audio and visual presentation of related material and, in most cases, they are exclusive of each other. The presentation of HyperText Markup Language (HTML) encoded content on a standard browser (e.g., Netscape or Internet Explorer) is primarily visual. The rate and method of progression through the presentation is under user control. The user may read the entire content from beginning to end, scrolling as necessary if the rendered content is scrollable (that is, the visual content extends beyond the bounds of the presentation window). The user may also sample or scan the content and read, for example, only the beginning and end. Fundamentally, all of the strategies available for perusing a book, newspaper, or other printed item are available to the user of a standard browser.
Presentation of audio content tends to be much more linear. Normal conversational spoken content progresses from a beginning, through a middle, and to an end; the user has no direct control over this progression. This can be overcome to some degree on recorded media via indexing and fast searching, but the same ease of random access available with printed material is difficult to achieve. Voice controlled browsers are typically concerned with voice control of browser input or various methods of audibly distinguishing an HTML link during audible output. Known prior art browsers are not concerned with general synchronization issues between the audio and visual components.
There are several situations where a person may be interested in simultaneously receiving synchronized audio and visual presentations of particular subject matter. For example, in an automotive setting a driver and/or a passenger might be interfacing with a device. While driving, the driver obviously cannot visually read a screen or monitor on which the information is displayed. The driver could, however, select options pertaining to which information he or she wants the browser to present audibly. The passenger, however, may want to follow along by reading the screen while the audio portion is read aloud.
Also, consider the situation of an illiterate or semi-literate adult. He or she can follow along when the browser is reading the text, and use it to learn how to read and recognize new words. Such a browser may also assist the adult in learning to read by providing adult content, rather than content aimed at a child learning to read. Finally, a visually impaired person who wants to interact with the browser can xe2x80x9cseexe2x80x9d and find highlighted text, although he or she may not be able to read it.
There are several challenges in the simultaneous presentation of content between the audio and video modes. The chief one is synchronizing the two presentations. For example, a long piece of content may be visually rendered on multiple pages. The present invention provides a method and system such that when some section of that content is being heard by the user, that section is visible on the screen and, furthermore, the specific visual content (e.g., the word or phrase) being audibly rendered is somehow highlighted visually. This implies automatic scrolling as the audio presentation progresses, as well as word-to-word highlighting.
A further complication is that the visual presentation and audible presentation may not map one-to-one. Some applications may want some portions of the content to be rendered only visually, without being spoken. Some applications may require content to be spoken, with no visual rendering. Other cases lie somewhere in between. For example, an application may want a person""s full name to be read while a nickname is displayed visually.
U.S. Pat. No. 5,884,266 issued to Dvorak, entitled xe2x80x9cAudio Interface for Document Based on Information Resource Navigation and Method Thereforxe2x80x9d, embodies the idea that markup links are presented to the user using audibly. distinct sounds, or speech characteristics such as a different voice, to enable the user to distinguish the links from the non-link markup.
U.S. Pat. No. 5,890,123 issued to Brown et al., entitled xe2x80x9cSystem and Method for Voice Controlled Video Screen Displayxe2x80x9d, concerns verbal commands for the manipulation of the browser once content is rendered. This patent primarily focuses on digesting the content as it is displayed, and using this to augment the possible verbal interaction.
U.S. Pat. No. 5,748,186 issued to Raman, entitled xe2x80x9cMultimodal Information Presentation Systemxe2x80x9d, concerns obtaining information, modeling it in a common intermediate representation, and providing multiple ways, or views, into the data. However, the Raman patent does not disclose how the synchronization is done.
It is therefore an object of the present invention to provide a multi-modal renderer that simultaneously renders content visually and verbally in a synchronized manner.
Another object of the invention is to provide a multi-modal renderer that allows content encoded using an eXtensible Markup Language (XML) based markup tag set to be audibly read to the user.
The present invention provides a system and method for simultaneously rendering content visually and verbally in a synchronized manner. The invention renders a document both visually and audibly to a user. The desired behavior for the content renderer is that when some section of that content is being heard by the user, that section is visible on the screen and, furthermore, the specific visual content (e.g., the word or phrase) being audibly rendered is highlighted visually. In addition, the invention also reacts to multi-modal input (either tactile input or voice input). The invention also allows an application or server to be accessible to someone via audio instead of visual means by having the renderer handle Embedded Browser Markup Language (EBML) code so that it is audibly read to the user. EBML statements can also be combined so that what is audibly read to the user is related to, but not identical to, the visual text. The present invention also solves the problem of synchronizing audio and visual presentation of changing content via markup language changes rather than by application code changes.
The EBML contains a subset of Hypertext Markup Language (HTML), which is a well-known collection of markup tags used primarily in association with the World Wide Web (WWW) portion of the Internet. EBML also integrates several tags from a different tag set, Java Speech Markup Language (JSML). JSML contains tags to control audio rendering. The markup language of the present invention provides tags for synchronizing and coordinating the visual and verbal components of a web page. For example, text appearing between  less than SILENT greater than  and  less than /SILENT greater than  tags will appear on the screen but not be audibly rendered. Text appearing between  less than INVISIBLE greater than  and  less than /INVISIBLE greater than  tags will be spoken but not seen. A  less than SAYAS greater than  tag, adapted from JSML, allows text (or recorded audio such as WAV files, the native digital audio format used in Microsoft Windows(copyright) operating system) that differs from the visually rendered content to be spoken (or played).
The method for synchronizing an audio and visual presentation in the multi-modal browser includes the steps of receiving a document via a computer network, parsing the text in the document, providing an audible component associated with the text, and simultaneously transmitting to output the text and the audible components.