The digital revolution has radically changed the way we access media. Most digital photo viewing is now done by looking at a screen. Furthermore, it is now possible to store many thousands of photographic, video and other media items on a common mass storage device such as a memory stick, SD card or hard drive and to easily share these items by email, uploading to a website or other electronic techniques. While digital media files can theoretically be assigned distinctive names to identify their respective content, media files are instead often or usually named automatically by the device that created them without regard for content. For example, a digital camera might automatically name a snapshot “IMG_5467.jpg” meaning the 5,467th photo taken by that particular digital camera. Although such automatic sequential or other naming ensures that each media item is assigned a unique name, automatically-generated sequentially assigned numerical file names are not particularly helpful in identifying media item content.
To solve these challenges, many photo sharing websites permit users to electronically “tag” images with identifying information. As one popular sharing site (Flickr) explains, “Tags are like keywords or labels that you add to a photo to make it easier to find later. You can tag a photo with phrases like “Catherine Yosemite hiking mountain trail.” Later if you look for pictures of Catherine, you can just click that tag and get all photos that have been tagged that way. You may also have the right to add tags to your friends' photos, if your friends set that option in the privacy settings for their photos.”
Unfortunately, manually tagging images in this way can be time-consuming and labor-intensive. Imagine typing in tags for each of the 3000 photos you took on your last vacation. Automatic machine tagging techniques that automatically analyze media items and identify them are known. Some machine-tagging approaches use pattern recognition and pattern matching techniques. For example, automatic face identification algorithms can be used to identify portions of digital photos or videos that contain faces. However, even with machine-tagging approaches, a human is generally asked to identify who the identified face belongs to. Automatic algorithms may then abstract patterns associated with identified elements, and use pattern matching and recognition to automatically identify additional occurrences of the same pattern in other media items within a collection. These techniques, while partially effective, do not completely solve the tagging problem. In particular, a machine can never replace the human factor when it comes to memory, emotion and the human connection.
Additionally, while collaborative tagging (with or without machine assistance) is a useful concept, it can raise privacy concerns. For example, you may not want your friends or acquaintances being able to create captions or tags for cherished photos. Also, it may be entirely appropriate and desirable to share photos taken at a party or other event with others who attended the party or event. However, it may be inappropriate or undesirable to share such photos with people who did not attend the party or other event. Current infrastructure allows some degree of control over who sees what, but the automatic controls tend to be coarse and often ineffective. There exists a compelling need to facilitate sharing of media items with some people or groups while preventing those same media items from being shared with other people or groups.
An easy, interesting and innovative way to manipulate and tag photos while viewing the photos using display devices with processing and sound receiving capability is to apply a voice tag. Voice tagging in the context of real time capture of voice information with a smart phone or other device is generally known. However, further improvements are desirable.
In one example illustration, if a user is looking at a photo on a display device and wishes to tag the photo, the user can touch the photo on the screen and speak a voice tag, or utter a command and then say the voice tag. As one example, if the user is looking at a photo of Gerilynn on the screen and wishes to tag the photo, the user can touch the photo on the touch screen and say “Gerilynn”, or alternatively just say “Tag Gerilynn.” That photo has now been tagged. The action identifies the people or objects in the photo and also applies a voice tag to the photo.
Thus, in some non-limiting arrangements, touching on the touch screen may not be necessary—voice commands could be used instead (e.g., “tag: Gerilynn”) and the voice tagging could automatically be applied to the item displayed at that time. In such implementations, the device could respond to additional voice commands such as “IPAD Gerilynn” by recognizing the word “Gerilynn” and start showing photos that had previously been tagged with “Gerilynn”. Any keyword used during the tagging operation(s) could be uttered to call up and cause display of items tagged with that particular keyword.
Any type of device could be commanded in such a manner. For example, one implementation provides a digital photo frame that is hanging on the wall. The digital photo frame includes a microphone. If the user utters the phrase “Photoframe: Antarctica”, the digital photo frame could automatically recognize the phrase and begin displaying a single image, a slide show or a stream of images that had previously been tagged with “Antarctica” (e.g., an Antarctica vacation).
Other non-limiting implementations provide additional photoframe functionality. For example, the user could utter the phrase “Photoframe: Free.” This can place the photoframe into a free recognition mode where the photoframe begins to attempt to recognize words that are being spoken in the room. If the people in the room just happened to be talking about Antarctica, the photoframe can recognize the word and, when it determines that it has an inventory of photos or other images that were previously tagged with that term, it can begin to display such tagged photos or other images.
In other example implementations, when photos or other images are being displayed, the displaying device can record what people are saying while the photos are being displayed. For example, while a photo stream of a vacation is being displayed, a person viewing the photo stream may describe the photos as they are being displayed. The conversation could for example be comments about important photos such as family history, historical events or the like. The recorded comments can be recorded in association with the photos for synchronized playback when the photos are shown again. Such voice comments may be invaluable content in the future. They could be stored in a repository for example and distributed like videos or podcasts are today. A widely distributed application for a commonly-available device could be used to collect memories and narration of many people and store those memories and narrations in association with the photos or other images in the form of voice tags.
In some implementations, searching for voice tags can be performed in the audio domain by using pattern recognition techniques for example that match uttered audio tags with previously stored audio tags. In other implementations, off-line or on-line processing can be used to recognize uttered speech and store text, data or other information and store this information in association with images for later comparison. In some implementations, it will be possible to recognize who the speakers are in the neighborhood of the device and to play photo streams appropriate to or customized for those particular speakers.
In other implementations, the recorded voice comments can be processed and automatically converted into text for storage and presentation as a written transcript. In other implementations, it may be desirable to store the voice tags separately from the images and simply associate the two on an on-demand basis.
Exemplary illustrative non-limiting technology herein provides innovative tagging technology that makes it fun for users to tag media items such as photos and videos with information relating to people, groups, time, event and other relevant criteria. A user interface provides access to automatic features providing fun and efficient tagging of media items. The items may then be automatically shared based on the tags, e.g., only to members of a particular group, based on age of the media item, or other criteria.
Additionally, an innovative use of tagged media items is to use the tags to automatically communicate or share. For example, media items can be automatically shared or otherwise presented based on tags. For example, particular photo and/or video streams can be tagged as being associated with a particular person, time and event and made available for sharing over a communications network. When that person initiates or establishes a communication over the network, network-connected components can automatically access and retrieve media items tagged to that person, event and/or time and present them to the recipient of the communication.
As one particular example, establishing a voice call or other connection between two parties could cause media items to be accessed based on their tags and automatically presented to call participants. The tagged media items could be transmitted over the voice call connection, or they could otherwise be accessed such as from a video and photo sharing website or other network-based storage. The tagging technology could be based on group sharing techniques. For example, photos taken during a party or other event could be tagged with the event, the people who attended the event and the time of the event. The tagging technology could be used to automatically share recent photos and/or videos based on such tagging so that for example a phone call or text from one of the party participants to another could cause automatic sharing or retrieval for sharing of a photo or video stream associated with that party.
In one exemplary illustrative non-limiting implementation, a communications arrangement provides a network that permits the user devices to communicate. At least one tagging store stores tagged multi-media items, and a tagging server coupled to said network and to said tagging store can automatically access at least one tagged media item for presentation at least in part in response to a communication over said network.
The tagged media item may comprise a video or photo stream presented during communication, said stream being tagged to at least one person, group, time or event.