The present invention relates to characterizing media, and more specifically, characterizing digital media with voice tags.
Digital libraries, photo sharing sites, image search engines, on-line encyclopedias, and other computer systems all hold large numbers of images in file systems or databases. Users accessing these sites may have trouble finding desired images because, unlike documents, images (and other digital media) do not include words or phrases that can be indexed.
One solution to the problem of finding desired images is image recognition, but this approach is prohibitively expensive for user-generated content, and is not highly accurate. Another known method is to group images in named categories (such as folders) to facilitate access. This requires, however, manual effort and the images must be known ahead of time.
There are many ways of organizing these images, including collections, sets, and hierarchies. One common method to organize a collection is tagging. When a user sees the image, the user may type in a word or phrase to “tag” (describe) the image. Multiple users can add one or more tags to the same image. When another user accesses the site, they can then navigate to the images labeled by particular tags.
There are various ways image navigation using tags may be accomplished. For instance, the user may type in a word or phrase that is an existing tag for a set of one or more images. Alternatively, a user may see the tags arranged in various ways (alphabetically, by popularity, etc.) and then choose a tag which describes the image(s). The efficacy of text tagging for social navigation is widely used and well understood.
There are also ways of presenting digital media so that users can scan and identify items (collages, grids, visualizations). A major drawback of these approaches is that they are not scaleable: the display becomes cluttered and the screen may run out of pixels, particularly on small screens such as on mobile devices.
There are also ways to ‘automatically’ process digital media to derive metadata that can then be used for searching. Metadata (location, time) may be captured at the time of image acquisition and subsequently used to navigate to the visual digital media.
However, there are many situations in which creating or using textual tags is not possible or is inconvenient. Examples include when users: are using mobile phones (takes a long time or diverts attention from a visual task to type in a word or phrase); are physically disabled (cannot type the words or phrases); are illiterate or semi-literate because of limited education (have only a limited ability to read or write); or have vision problems (cannot see the words or phrases) or combinations of these situations.