The aural assimilation of information is useful in ways that visual assimilation of information may not. Thus, speech interfaces now facilitate aural presentations of information in a variety of environments, including computer-based screen readers, portable electronic devices, and phone-based information systems. Speech interfaces are a great aid in freeing visual attention in cognitively overloaded environments. Reading out a file, mail, or web page while composing a document, replying to a mail, doing exercises, etc. enables multitasking by freeing the visual attention. Speech interfaces are also an effective way of promoting folk computing. The terms “aural” and “auditory,” applied for instance in the phrases “aural skimming and/or scrolling” and “auditory skimming and/or scrolling,” are used interchangeably herein, unless expressly noted otherwise.
With the availability of portable devices like PDAs, mobile phones, and iPods, speech interfaces are likely to witness increased use. Today's speech interfaces may comprise both speech input and speech output. Speech input is handled through speech recognition and speech output through speech synthesis. The inputs to streaming speech applications need not necessarily be speech but can be any input interface, including keyboard, keypad, media player control, optical recognizer, and so on. Potential applications of speech synthesis include email readers, RSS to Podcast conversions, news readers, and so on.
One challenge to the more widespread proliferation of devices that deliver information aurally is the sequential nature of aural presentations. This sequential nature makes it much harder to skip predictable information and locate specific information within an aural presentation than within a visual presentation. For instance, suppose a user wanted to convert the following example email to speech:
From: John <john@domain1.com>To: Sue <sue@domain2.com>, Joe <joe.domain3.com>Cc: chae@domain4.comSubject: Re: Annual day> Please send 10 iPods.Please mention the model number.
If this email were visually assimilated by Sue, for example, she would hardly read the more or less routine and/or predictable information like “john@domain1.com.” Instead, she would visually skim over most of the message. The format of text provides cues to her so that she recognizes which parts of the text are important. First, the text is divided into sentences and lines, giving Sue a hierarchical structure with which to process the message. Second, the start of each line contains an identifying marker such as “From” or “>” to help Sue quickly recognize the context of the line. If Sue were reading this message to determine what John's response is, she would use these cues to skip straight to the first line that appears to be John's response: “Please mention the model number.” If she were to read the response and not remember what the response was in reply to, she might then scan backwards in the message to the line marked with a “>” character, or perhaps even to the line marked “Subject.” If the email were longer, for instance seven pages, she might find it easier to search for the information she needs by scanning the topic sentence of each paragraph or looking for certain keywords and numbers.
On the other hand, if this email were assimilated aurally through a speech synthesizer, all of its parts would be given equal importance. Sue would have no choice but to listen to the whole message to find the information she was seeking. If she missed important information the first time, she would, just like a person who missed a phone number left in a voicemail message, have to listen to the aural presentation all over again.
Computer interfaces support another feature that facilitates more efficient assimilation of a visual information source—scrolling. Scrolling may be defined as producing faster output which closely corresponds to the original information. Scrolling helps facilitate even more efficient skimming. For example, if an individual were looking for a small section of a very long document, the individual could use a computer-based application to visually scroll through the document with keys on a keyboard or the scroll wheel of a mouse. The document would rapidly progress before the individual's eyes, allowing the individual to look for key headers, words, bolded text, or other formatting that might help the individual locate the section that the individual is searching for. In this respect, scrolling works much like searching for a scene in movie using fast forward and rewind buttons. Unfortunately, aurally presented information cannot be scrolled in this fashion, since, in contrast to visually presented information, aurally presented information cannot be comprehended in traditional “fast forward” and “rewind” modes.
Of course, there are many simple approaches to progressing through aurally presented content without having to listen to the entire aural presentation. For instance, a device might allow a user to skip forwards or backward a predetermined amount of time into a presentation. A device might also allow a user to skip to predetermined segments, tracks, or files. However, these approaches have their drawbacks in that unless someone has already identified for the user exactly where in the presentation the user can expect to find the information the user is looking for, there is no way for the user to know whether a particular segment is relevant or should be skipped. The user must actually listen to the whole segment. Thus, neither of these approaches can match the efficiency of the above described context-driven scrolling and skimming methods employed by typical persons assimilating visual information.
Another approach may be to segment a presentation based on acoustic cues such as pause and pitch. This approach provides some context, but fails to provide the same level of logical context that can be gleaned in visually presented information from cues such as headers, text formatting, punctuation, key words, and other afore-mentioned markers.
Another approach may be to translate the speech to text and allow the user to skim through the textual transcript. Once the user identifies the portion of the textual transcript the user wants to hear, the user may begin listening to the corresponding portion of the aural presentation. Because this approach is insensitive to the context of the information in the transcript, however, the user must actually read the transcript and search for the desired information. Thus, the user is deprived of the ability to assimilate the information aurally without requiring visual attention, or to assimilate the information aurally with minimal visual attention. This approach also has the drawback of requiring a device that contains a screen large enough for viewing a transcript.
Another approach to producing a faster output of an information source may be to time-compress the audio stream using signal processing techniques. Using such an approach, an audio presentation is sped up so that a voice appears to be speaking at a faster rate, thus creating a different playback speed. However, such an approach is limited in that speech comprehension rapidly degrades the faster a message is sped up.
Another approach may be to develop a rule-based system for scrolling and skimming an aural presentation. Unfortunately, skimming and scrolling a visual information source are complex phenomena involving higher-level cognitive processes. While possible to mimic these cognitive operations through a rule-based system for aural presentations, such a system would be enormously complex and not likely to reflect the needs and objectives of most listeners.
Another approach to producing a faster output of an information source may be summarization. However, with existing summarization processes it is difficult to establish a sequential correspondence between the original information and the summary. For example, a summary may contain juxtaposition of concepts in the original information, or altogether neglect minor facts that may be of interest to a researcher. Thus, summarization does not provide an aural scrolling effect similar to visual scrolling.
Based on the foregoing, a mechanism to overcome the lack of context-sensitive skimming and scrolling in aural presentations of information would be useful. Such a mechanism could make it easier for users to locate and comprehend specific information in an aural presentation.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.