1. Field of the Invention
The present invention relates to an information processing apparatus, an information processing method, and a computer program, and, more particularly, to an information processing apparatus, an information processing method, and a computer program that are configured to generate and record metadata available for the classification and so on of content including still and moving images.
2. Description of the Related Art
To be more specific, the present invention is related to an information processing apparatus, an information processing method, and a computer program that are configured to, in reproducing content including still and moving images for appreciation, interact with a user, observe the interaction through a camera and a microphone, for example, generate metadata from the information obtained by the observation, use the metadata set through interaction, and analyze content, thereby enhancing the accuracy of the metadata obtained by the analysis.
Recently digital cameras and video cameras have been increasingly growing popular. Users can store content, such as still and moving images taken by use of these devices, into storage media, such as a hard disk drive of a PC (Personal Computer), a DVD, or a flash memory, for example. In order to reproduce or print content stored in these storage media, the user needs to search for desired pieces of content. However, as the number of pieces of content increases, a problem emerges that the extraction of desired pieces of content becomes increasingly difficult.
Normally, each piece of content records attribute information (or metadata), such as a name of content, a data of shooting, and a place of shooting, for example, in correspondence with the substance data of content including still and moving images. The user uses these pieces of metadata for searching for desired pieces of content.
Metadata are largely divided into those which are automatically generated in accordance with the processing of content shooting and others which are given by the user as the information corresponding to shooting data. For example, the information, such as a date of shooting, is one of metadata automatically generated by each camera on the basis of a clock capability thereof. On the other hand, user-generated metadata include various kinds of information, such as place and persons subject to particular shooting operations, episodes involved therein, and so on, in addition to content titles, for example.
However, user-generated metadata required a very labor-taking operation because each user need to provide the above-mentioned metadata to personal content shot and recorded by the user very time shooting and recording are made. For example, in the case of broadcast content, such as television program, a configuration is normally employed in which the transmission source of content or a third party provides various kinds of metadata to viewers as users. Each user can efficiently search for desired programs by use of the provided metadata. However, of the metadata about the personal content obtained by shooting and recording, the setting of those pieces of metadata other than the formal information, such as data of shooting, for example, is required to be executed by the user, which becomes a very cumbersome task as the volume of content increases.
A configuration intended to enable the efficient execution of a metadata providing task, such as described above, by the user is disclosed Japanese patent Laid-open No. 2001-229180 (hereinafter referred to as Patent Document 1). To be specific, Patent Document 1 proposes a configuration in which voice recognition or image recognition is executed on the audio data or image data contained in recorded content, such as taken video data, and the information obtained by the recognition is related with the content as metadata, both the content and the metadata being automatically recorded together. In addition, a configuration in which morphological analysis is executed on text information describing non-text content, such as images, to extract a keyword and the extracted keyword is provided as the metadata corresponding to the content is disclosed in Japanese Patent Laid-open No. 2003-228569 (hereinafter referred to as Patent Document 2).
A method in which audio scenario information prepared in association with content is used to provide words extracted by scenario voice recognition processing as metadata is disclosed in Japanese Patent Laid-open No. 2004-153764 (hereinafter referred to as Patent Document 3). Further, a method in which a biological reaction of a viewer during a period of time of content reproduction is manipulated to provide the resultant data as sensory metadata is disclosed in Japanese patent Laid-open No. 2003-178078 (hereinafter referred to as Patent document 4).
The configuration written in Patent Document 1, namely, a method in which voice recognition and image recognition are applied to content, is convenient because of automatic metadata provision. Unlike professionally shot data, personal content shot by amateur users is often low in image or audio quality. Here emerges a problem that it is difficult to execute data, such as keywords usable as metadata, from such low-quality content by means of voice recognition or image recognition.
The method written in Patent Document 2 in which text information describing non-text content is used involves a problem that this method cannot be applied to any personal content to which no text information is given. The scenario-based configuration disclosed in Patent Document 3 involves a problem that this method is unavailable for any content for which no scenario is recorded beforehand. The method using biological reactions disclosed in Patent Document 4 requires a device for analyzing observed values obtained through a device that is set to each user to obtain biological information, such as blood pressure and blood flow, which cannot be realized with a general-purpose PC, thereby pushing up the cost of realizing the method.