1. Field of the Invention
The present invention relates generally to digital capture devices, and more particularly, to digital still cameras, digital video cameras, digital video encoders and other media capture devices.
2. Description of the Related Technology
The distinction between still devices and motion devices is becomming blurred as many of these devices can perform both functions, or combine audio capture with still image capture. The capture of digital content is expanding rapidly due to the proliferation of digital still cameras, digital video cameras, and digital television broadcasts. Users of this equipment generally also use digital production and authoring equipment. Storing, retrieving, and manipulating the digital content represent a significant problem in these environments. The use of various forms of metadata (data about the digital content) has emerged as a way to organize the digital content in databases and other storage means such that a specific piece of content may be easily found and used.
Digital media asset management systems (DMMSs) from several vendors are being used to perform the storage and management function in digital production environments. Examples include Cinebase, WebWare, EDS/MediaVault, Thomson Teams, and others. Each of these systems exploit metadata to allow constrained searches for specific digital content. The metadata is generated during a logging process when the digital content is entered into the DMMS. Metadata generally falls into two broad categories:
Collateral metadata: information such as date, time, camera properties, and user labels or annotations, and so forth;
Content-based metadata: information extracted automatically by analyzing the audiovisual signal and extracting properties from it, such as keyframes, speech-to-text, speaker ID, visual properties, face identification/recognition, optical character recognition (OCR), and so forth.
Products such as the Virage VideoLogger perform the capture and logging of both of these types of metadata. The VideoLogger interfaces with the DMMS to provide the metadata to the storage system for later use in search and retrieval operations. These types of systems can operate with digital or analog sources of audiovisual content.
The capture of digital content offers an opportunity which is not present in analog capture devices. What is desired is a capability to embed a content-based analysis function in the capture device for extracting metadata from the digital signals in real-time as the content is captured. This metadata could then be later exploited by DMMSs and other systems for indexing, searching, browsing, and editing the digital media content. A central benefit of this approach would be that it is most valuable to capture this type of metadata as far xe2x80x9cupstreamxe2x80x9d as possible. This would allow the metadata to be exploited throughout the lifecycle of the content, thereby reducing costs and improving access to and utilization of the content. Such an approach would be in contrast to the current practice of performing a separate logging process at some point in time after the capture of the content. Therefore, it would be desirable to capture the metadata at the point of content capture, and to perform the analysis in real-time by embedding metadata engines inside of the physical capture devices such as digital still cameras, digital audio/video cameras, and other media capture devices.
Some previous efforts at capturing metadata at the point of content capture have focused on the capture of collateral metadata, such as date/time, or user annotations. Examples of these approaches can be found in U.S. Pat. No. 5,335,072 (sensor information attached to photographs), 4,574,319 (electronic memo for an electronic camera), U.S. Pat. No. 5,633,678 (camera allowing for user categorization of images), U.S. Pat No. 5,682,458 (camera that records shot data on a magnetic recording area of the film), and U.S. Pat. No. 5,506,644 (camera that records GPS satellite position information on a magnetic recording area of the film). In addition, professional digital cameras being sold today offer certain features for annotating the digital content. An example of this is the Sony DXC-D30 (a Digital Video Cassette camera, or DVC) which has a ClipLink feature for marking video clips within the camera prior to transferring data to an editing station.
Many aspects of digital capture devices are well understood and practiced in the state of the art today. Capture sensors, digital conversion and sampling, compression algorithms, signal levels, filtering, and digital formats are common functions in these devices, and are not the object of the present invention. Much information can be found in the literature on these topics. For example, see Video Demystified by Keith Jack, published by Harris Semiconductor, for an in-depth description of digital composite video, digital component video, MPEG-1 and MPEG-2.
The present invention is based on technologies relating to the automatic extraction of metadata descriptions of digital multimedia content such as still images and video. The present invention also incorporates audio analysis engines that are available from third parties within an extensible metadata xe2x80x9cenginexe2x80x9d framework. These engines perform sophisticated analysis of multimedia content and generate metadata descriptions that can be effectively used to index the content for downstream applications such as search and browse. Metadata generated may include:
Image Feature Vectors
Keyframe storyboards
Various text attributes (closed-captioned (CC) text, teletext, time/date, media properties such as frame-rates, bit-rates, annotations, and so forth)
Speech-to-text and keyword spotting
Speaker identification (ID)
Audio classifications and feature vectors
Face identification/recognition
Optical Character Recognition (OCR)
Other customized metadata via extensibility mechanisms: GPS data; camera position and properties; any external collateral data; and so forth.
A central theme of the technical approach is that it is most valuable to capture this type of metadata as far xe2x80x9cupstreamxe2x80x9d as possible. This allows the metadata to be exploited throughout the lifecycle of the content, thereby reducing costs and improving access and utilization of the content. The natural conclusion of this approach is to extract the metadata at the point of content capture. Thus, the present invention embeds metadata engines inside of the physical capture devices such as digital still cameras, digital audio/video cameras, and so forth.
Digital cameras are rapidly advancing in capabilities and market penetration. Megapixel cameras are commonplace. This results in an explosion of digital still content, and the associated problems of storage and retrieval. The visual information retrieval (VIR) image engine available from Virage, Inc. has been used effectively in database environments for several years to address these problems. The computation of image feature vectors used in search and retrieval has to date been part of the back-end processing of image. The present invention pushes that computation to the cameras directly, with the feature vectors naturally associated with the still image all during its life. A practical xe2x80x9ccontainerxe2x80x9d for this combined image+feature vector is the FlashPix image format, which is designed to carry various forms of metadata along with the image. Image feature vectors may also be stored separately from the image.
Digital video cameras are also advancing rapidly, and are being used in a number of high-end and critical applications. They are also appearing at the consumer level. Digital video itself suffers from the same problems that images do, to an even greater degree since video data storage requirements are many times larger than still images. The search and retrieval problems are further compounded by the more complex and rich content contained in video (audio soundtracks, temporal properties, motion properties, all of which are in addition to visual properties).
The present invention is based on a sophisticated video engine to automatically extract as much metadata as possible from the video signal. This involves visual analysis, audio analysis, and other forms of metadata extraction that may be possible in particular situations. The present invention embeds this video engine directly inside the camera equipment such that the output is not only the digital video content, but a corresponding package of metadata which is time indexed to describe the video content. Promising xe2x80x9ccontainersxe2x80x9d for this combined video and metadata are the proposed MPEG-4 and MPEG-7 digital multimedia formats, which, such as FlashPix for still images, are designed and intended to embed rich metadata directly in the video format to allow indexing and non-linear access to the video. The current version of QuickTime (on which MPEG-4 is based) is also an ideal container format. While these standards are still under development (and MPEG-7 is several years away) and are not in widespread use, these mechanisms are not required for the present invention. The metadata may be packaged in any form as long as an association with the original content is maintained as the video and metadata are downloaded from the camera into subsequent asset management and post-processing applications.
A novel aspect and benefit of this embedded approach is that xe2x80x9cclip markingxe2x80x9d can become an automatic part of the videography process. Today, clips (defined by marking IN and OUT points in a video) must be defined in a post-process, usually involving a human to discern the clip boundaries and to add some additional metadata describing the clip. Some camera manufactures (such as Sony) have enhanced their digital camera offerings to automatically generate clip boundaries based on the start and stop of recording segments. In the present invention, this type of automatic clip definition is a starting point for gathering and packaging video metadata. In addition to automatically marking the IN/OUT points, other collateral data may be associated with the clip and become part of the metadata. Often this metadata is already available to the camera electronics, or can be entered by the camera operator. Examples include:
Time/Date
Location
In a Hollywood-type setting, the Scene # and Take #
Any other alpha numeric information that could be entered or selected by the camera operator
In one aspect of the present invention, there is an integrated data and real-time metadata capture system, comprising a digital capture device producing a digital representation of one or more forms of media content; a feature extraction engine integrated with the digital capture device, the feature extraction engine having a plurality of feature extractors to automatically extract metadata in real-time from the digital content simultaneously with the capture of the content; and a storage device capable of storing the media content and the metadata, wherein selected portions of the metadata are associated with selected portions of the media content.
In another aspect of the present invention, there is an integrated data and realtime metadata capture method, comprising sensing analog signals, converting the analog signals to a digital representation of one or more forms of media content, compressing the digital media content, automatically extracting metadata in real-time from the digital media content simultaneously with the compressing of the digital media content, and storing the digital media content and the metadata, wherein selected portions of the metadata are associated with selected portions of the digital media content.