Absorbing and processing time-based information is more difficult than absorbing spatially distributed information. This is, in part, due to the difficulty in creating a "sense of the whole" from time-based information. For instance, it is a common experience that visual reading of printed text is much easier and faster than listening to recorded speech. The focus of the eye (fovea vision) can move quickly from one place to another; the peripheral vision fuzzily views a large area around the fovea so a reader knows where the fovea vision is with respect to the rest of the printed text (e.g., the rest of page, the rest of the sentence, the rest of the word) at any given moment. It is the peripheral vision that enables the fovea vision to locate quickly the places the reader wants to examine. Therefore, visual reading can be thought of as "instant arbitrary accessing".
In contrast the conventional devices which people use to access time-based information, for instance, an audio and/or video player, only display such information sequentially (e.g., playing back recorded audio or video frames). In other words, the conventional devices present time-based information only along a time axis. Since time is sequential, at any moment, the person who is accessing such information is exposed to only a small segment of the information. Given that a human's temporal memory is short and has poor resolution of time, it is difficult for users to determine the relative timing mark of a particular information segment with respect to the entire piece of information being displayed or accessed.
Even if the underlying devices allow users to quickly move from one part of the speech to another (e.g., a digital storing device that allows random access), a listener normally would not do so, simply because he/she doesn't know where he/she is with respect to the entire information at any moment, thus he/she doesn't know where to move. Thus the difficulty of accessing time-based information stems chiefly from the lack of the capability to generate a sense of the whole due to the combination of characteristics of human sight, hearing, memory, and the way conventional devices work. This temporal limitation makes it much more difficult to construct a general outlook of the set of information.
If time-based information, such as recorded speech, could be translated precisely into printed text or some other spatial visual codes such as pictures, accessing the time-based information could be turned into reading. However, automatic speech recognition is far from perfect. Thus alternative ways to absorb and present recorded speech are needed.
Many prior art systems have attempted to solve this problem. Most of the previous research has emphasized one of the following approaches:
1. Condensing the information by extracting important (situation dependent) portions or by throwing out insignificant portions. This approach is not effective in all circumstances since it assumes that some of the time-based information, is in fact, insignificant and that an efficient method exists for prioritizing the information. PA1 2. Utilizing the characteristics of the information, such as the amplitude of the waveform of a recorded speech, and structural patterns to visually annotate the information so that different segments can be easily distinguished. Thus this approach relies upon an effective automatic feature extraction system. This approach is also difficult to implement and not effective for all sources of time-based information.
Other prior art systems have addressed the presentation of recorded speech based upon the prior knowledge of the structure of the underlying text. An example of one such system is the multimedia package "From Alice to Ocean" marketed by Apple Computers. This package presents the story of a woman's journey across the Australian desert. The text of the story, as read by the author, has been stored in convenient segments. These segments of text are accessed by a user along with corresponding visual images based upon visual cues displayed on a computer screen.
A second prior art system of this kind is a CD-ROM game/reader, "Just Grandma and Me," marketed by Broderbund. This package allows a user to play back prerecorded spoken words corresponding to the words of a story displayed on a computer screen.
These systems and their underlying methods do not address the problem of accessing time-based information without the additional a priori information provided by the presence of the text. If the underlying text is known, the solution to the problem is trivialized since the text itself could be used to directly access the information, e.g. directly reading or text-to-speech translation. This is not an effective approach for time-based information which lacks a priori access to a textual interpretation or other similar script.