The digital multimedia revolution has spawned a vast array of products and devices upon which media can be viewed, such as personal digital assistants (PDAs), digital picture frames, cellular phones, liquid crystal displays (LCD), cathode-ray tube (CRT), projection devices, plasma screens, and the capture devices themselves. The multimedia/imaging industry will continue to embrace ways other than hardcopy prints to view and share imagery. This fact, combined with the proliferation of digital media stored in memory devices and repositories as diverse as the displays themselves, presents a significant challenge in terms of organization, search and retrieval of images of interest.
As the number of these digital images continues to grow, there is much effort in industry and academia spent on technologies that analyze image data to understand the content, context, and meaning of the media without human intervention. This area of technologies is called semantic understanding, and algorithms are becoming more and more sophisticated in how they analyze audiovisual data and non-audiovisual data, called metadata, within a media file. For example, face detection/recognition software can identify faces present in a scene. Speech recognition software can transcribe what is said in a video or audio file, sometimes with excellent accuracy depending on the quality of the sound and attributes of the speech. Speaker recognition software is capable of measuring the characteristics of an individual's voice and applying heuristic algorithms to guess the speaker's identity from a database of characterized speakers. Natural language processing methods bring artificial intelligence to bear as an automated way for understanding speech and text without human intervention. These methods produce very useful additional metadata that often is re-associated with the media file and used for organization, search and retrieval of large media collections.
There have been many innovations in the consumer electronics industry that marry media files such as digital still photographs with sound. For example, U.S. Pat. No. 6,496,656 teaches how to embed an audio waveform in a hardcopy print. U.S. Pat. No. 6,993,196 teaches how to store audio data as non-standard metadata at the end of a digital image file.
U.S. Pat. No. 6,833,865 teaches about an automated system for real time embedded metadata extraction that can be scene or audio related so long as the audio already exists in the audio-visual data stream. The process can be done parallel to image capture or subsequently. U.S. Pat. No. 6,665,639 teaches a speech recognition method and apparatus that can recognize utterances of specific words, independent of who is speaking, in audio signals according to a pre-determined list of words.
That said, there often is no substitute for human intuition and reason, and a real person viewing media will almost always understand and recognize things that computers have a hard time with. There are those who maintain that computers will one day equal or surpass the processing and reasoning power of the human brain, but this level of artificial intelligence technology lies far into the future. As an example, consider a system that analyzes an image with several people in a scene. The system may use face detection algorithms to locate faces, and recognition algorithms to identify the people. Extending this example into the video space, additional algorithms to detect and identify speech can be employed to produce a transcript, or to augment metadata through recognition of specific words in a list. While the existing technology is promising, it is arguable that such algorithms will compare unfavorably with a human performing these tasks for the foreseeable future.
Suppose two people are viewing images as a slideshow on a digital picture frame or other display device. The people can, and often do, comment on who is in the image, the circumstances in which the image was captured. Typically this commentary is ephemeral and has no lasting value beyond the viewing moment. By the time the next image is displayed, the commentary has withered from the minds of the viewers.
There has been much activity related to annotating image data with descriptive text. Some use variations on a simple text entry interface, where the viewer enters textual information through a keyboard input device, the text subsequently associated with the image data. For example, Google has a web application called Google Image Labeler, developed by Carnegie Mellon University. It is a collaborative real-time application that turns the task of image keyword tagging into a game. The system takes a “distributed human processing” approach, where individuals spend their own time viewing and tagging randomly chosen images. The words are then saved as keywords in the image file, to aid in future search queries.
Other methods for annotating images with additional metadata take advantage of audio, specifically speech. U.S. Pat. No. 7,202,838, teaches a graphical user interface which allows a picture database user to annotate digital pictures to promote efficient picture database browsing, where annotation can take the form of comments spoken by the user. U.S. Pat. No. 7,202,838 describes a system for showing medical imagery on a display, through which additional data can be gathered in several forms, including written annotation and speech, and associated with the imagery for diagnostic and other purposes. In another medically related patent, U.S. Pat. No. 6,518,952 describes a system and device for displaying medical images and controlling a way of recording, synchronizing, and playing back dictation associated with the imagery.
Similarly, in U.S. Pat. No. 7,225,131 describes a system and method of capturing user input comprising speech, pen, and gesture, or any combination thereof describing a medical condition, and associating the user input with a bodily location via a multi-modal display that shows a schematic representation of the human body.